{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Umbra Notes","text":""},{"location":"books/api_design_patterns/part1/chapter1/","title":"Introduction to APIs","text":"<p>API: Application Programming Interface</p>"},{"location":"books/api_design_patterns/part1/chapter1/#what-are-web-apis","title":"What are web APIs?","text":"<ul> <li>An API defines the way in which computer systems interact.</li> <li>We can find APIs in the standard libraries</li> <li>But a special type of API that is built to be exposed over a network and used remotely, \"web APIs\".</li> <li>Those building the API have so much control where as the users have relatively little.</li> <li>Web APIs allow you to expose functionality without exposing the implementation.<ul> <li>Sometimes they allow users to take advantage of massive compute.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#what-are-resource-oriented-apis","title":"What are resource-oriented APIs?","text":"<ul> <li>Many web APIs act like servants.<ul> <li>You ask them to do something, and they go off and do it.</li> </ul> </li> <li>This is called remote procedure call (RPC)</li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#so-why-arent-all-apis-rpc-orinented","title":"So why aren't all APIs RPC-orinented?","text":"<p>One of the main reasons is the idea of statefulness.</p> <ul> <li>Stateless: When an API call can be made independently from all other API requests, with no additional context.</li> <li>Statefulness: A web API that stores context on a user from previous API requests. For example a web API that stores a user's favourite cities and provides weather forecasts for just those has no runtime inputs but requires a state to be set by the user.</li> </ul> <p>Consider the following API method names:</p> <ol> <li><code>ScheduleFlight()</code></li> <li><code>GetFlightDetails()</code></li> <li><code>ShowAllFlights()</code></li> <li><code>CancelReservation()</code></li> <li><code>RescheduleFlight()</code></li> <li><code>UpgradeTrip()</code></li> </ol> <p>Each one of these RPCs is pretty descriptive, but we have to memorize these methods, each of which is subtly different. </p> <ul> <li>e.g. sometimes we talk about flight, other times we talk about a trip or a reservation.</li> <li>We also need to memorise which action is used in the method.<ul> <li>Was it <code>ShowFlights()</code>, <code>ShowAllFlights()</code>, <code>ListFlights()</code> etc</li> </ul> </li> </ul> <p>We need to standardise, by providing a standard set of building blocks - method-resource</p> <ol> <li><code>CreateFlightReservation()</code></li> <li><code>GetFlightReservation()</code></li> <li><code>ListFlightReservation()</code></li> <li><code>DeleteFlightReservation()</code></li> <li><code>UpdateFlightReservation()</code></li> </ol> <p>Resource-oriented APIs will be much easier for users to learn, understand and remember.</p> <ul> <li>Standardisation makes it easy to combine what you already know (set of standard actions) which the resource which is easy to learn.</li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#what-makes-an-api-good","title":"What makes an API \"good\"?","text":"<p>What is the purpose of building an API in the first place?</p> <ol> <li>We have some functionality that some users want.</li> <li>Those users want to use this functionality programmatically</li> </ol>"},{"location":"books/api_design_patterns/part1/chapter1/#operational","title":"Operational","text":"<ul> <li>The system as a whole must be operational.<ul> <li>It must do the thing users actually want.</li> </ul> </li> <li>Non-operational requirements: It must perform how the user expects.<ul> <li>e.g. latency</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#expressive","title":"Expressive","text":"<ul> <li>The system needs to allow users to express the thing they want to do clearly and simply.</li> <li>The API should be designed such that there is a clear and simple way to do so.</li> <li>Avoid workarounds - if there is some functionality a user wants but there is not an easy way to do this, this is called a workaround.<ul> <li>e.g. If you have a translation API, users can create a detect language feature by constantly pinging translate endpoint.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#simple","title":"Simple","text":"<ul> <li>We could think of simplicity as the number of endpoints.<ul> <li>However an API that relies on a single <code>ExecuteAction()</code> method just shifts complexity from one place to another.</li> </ul> </li> <li>APIs should aim to expose the functionality users want in the most straightforward way possible, making the API as simple as possible, but no simpler.</li> <li>Make the common case fast<ul> <li>Whenever you add something that might complicate the API for the benefit of an advanced user, it is best to keep this complexity hidden from a basic user.</li> <li>This keeps the more frequent scenarios simple and easy whilst enabling advanced features for those who want them.</li> <li>e.g. Image a translation API. <code>GET /translate?lang=en</code>, allowing the user to add a specific language model as a mandatory field is complex for the average user and will slow down basic scenarios.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part1/chapter1/#predictable","title":"Predictable","text":"<p>APIs that rely on repeated patterns applied to both the API surface definition and the behaviour.</p> <p>Users very rarely learn an entire API, they learn the parts they need to and make assumptions when they need to make additions. e.g. if a query parameter is called text in one endpoint, it should not be called string or query in another.</p> <p>APIs that rely on repeated, predictable patterns are easier and faster to learn; and therefore better.</p>"},{"location":"books/api_design_patterns/part1/chapter1/#summary","title":"Summary","text":"<ul> <li>Interfaces are contracts that define how two systems should interact with one another.</li> <li>APIs are a special type of interface</li> <li>Web APIs are again a special type of API that is exposed over a network.</li> <li>Resource-oriented APIs are a way of designing APIs to reduce complexity by relying on a standard set of actions, called methods, across a limited set of resources.</li> <li>Good APIs are generally: operational, expressive, simple and predictable.</li> </ul>"},{"location":"books/api_design_patterns/part1/chapter2/","title":"Introduction to API Design Patterns","text":""},{"location":"books/api_design_patterns/part1/chapter2/#what-are-api-design-patterns","title":"What are API Design Patterns?","text":"<p>A software design pattern is a particular design that can be applied over and over to lots of similar software problems, with only minor adjustments. It is not a pre-built library but more of a blueprint for solving similarly structured problems.</p> <ul> <li>Most often, design patterns focus on specific components rather than entire systems.<ul> <li>e.g. If you want to add a logging system, you can use the singleton design pattern.</li> <li>This pattern is not complete</li> <li>However, it's well-defined and well-tested pattern to follow when you need to solve this small compartmentalised problem of always having a single instance of a class.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part1/chapter2/#why-are-api-design-patterns-important","title":"Why are API Design Patterns Important?","text":"<ul> <li>While having programmatic access to a system is very valuable, it's also much more fragile and brittle.<ul> <li>Changes to the interface can easily cause failures for those using the interface.</li> </ul> </li> <li>We refer to this aspect as flexibility<ul> <li>Interfaces where users can easily accommodate changes are flexible<ul> <li>GUIs are flexible - moving a button</li> </ul> </li> <li>Interfaces where even small changes cause complete failures are rigid.<ul> <li>Backend APIs: changing a query parameter breaks old client code.</li> </ul> </li> </ul> </li> <li>Rigid interfaces make it much more difficult to iterate toward a great design.<ul> <li>We are often stuck with all design decisions, both good and bad.</li> </ul> </li> </ul> <p>Pagination Pattern: The pagination pattern is a way of retrieving a long list of items in smaller, more manageable chunks. The pattern relies on extra fields on both the request and response.</p> <p>Moving from a non-paginated to paginated response pattern:</p> <p>Q. What happens if we don't start with the pattern?</p> <ol> <li>All previously written clients are expected all the data in one list - it has no way of getting subsequent pages.</li> <li>Clients are left to think they have all the data - which can lead to incorrect conclusions.</li> </ol>"},{"location":"books/api_design_patterns/part2/chapter3/","title":"Naming","text":"<p>In every software system we build, and every API we design or use - there are names that will live far longer than we ever intend them to. It is important to choose great names.</p>"},{"location":"books/api_design_patterns/part2/chapter3/#why-do-names-matter","title":"Why do names matter?","text":"<p>When designing and building an API, the names we use will be seen by &amp; interacted with all users of the API.</p>"},{"location":"books/api_design_patterns/part2/chapter3/#what-makes-a-name-good","title":"What makes a name \"good\"?","text":""},{"location":"books/api_design_patterns/part2/chapter3/#expressive","title":"Expressive","text":"<p>It is critical that a name clearly convey the thing is it naming.</p> <ul> <li>e.g. The term topic is used in both messaging and machine learning.</li> <li>If your project includes both using the name topic will be confusing.</li> <li>A more expressive name is required:<ul> <li><code>topic_model</code></li> <li><code>topic_message</code></li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#simple","title":"Simple","text":"<ul> <li>While an expressive name is important, it can also become burdensome if the name is excessively long without adding additional clarity.</li> <li>Names should be expressive but only to the extent that each additional part of a name adds value to justify its presence.</li> <li>On the other hand, names shouldn't be oversimplified</li> </ul> Name Note <code>UserSpecifiedPreferences</code> Expressive, but not simple enough <code>UserPreferences</code> Both simple &amp; expressive <code>Preferences</code> Too simple"},{"location":"books/api_design_patterns/part2/chapter3/#predictable","title":"Predictable","text":"<ul> <li>In general, we should use the same name to represent the same thing, and different names to represent different things.</li> <li>The basic goal is to allow users of an API to learn one name and continue building on that knowledge to be able to predict what future names would look like.</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#language-grammar-syntax","title":"Language, Grammar &amp; Syntax","text":"<p>Language being inherently flexible and ambiguous can be a good thing and a bad thing.</p> <ul> <li>On the one hand, ambiguity allows us to name things to be general enough to support future work.<ul> <li>Naming <code>image_url</code> rather than <code>jpeg_url</code> presents us from limiting ourselves to a single image format.</li> </ul> </li> <li>One the other hand, when there are multiple ways to express the same thing, we often tend to use them interchangeably, which ultimately makes our naming choices unpredictable.</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#language","title":"Language","text":"<p>Use American English.</p>"},{"location":"books/api_design_patterns/part2/chapter3/#grammar","title":"Grammar","text":""},{"location":"books/api_design_patterns/part2/chapter3/#imperative-actions","title":"Imperative Actions","text":"<p>REST standard verbs should use the imperative mood. They are all commands or orders.</p> <ul> <li><code>isValid()</code>: Should it return simple boolean field? Should it return a list of errors?</li> <li><code>GetValidationErrors()</code>: Clear that it will return list of errors, empty list if is valid.</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#prepositions","title":"Prepositions","text":"<ul> <li>If a Library API wants to list <code>Book</code> resources with the <code>Author</code>, it's tempting to name <code>BooksWithAuthor</code>.</li> <li>This falls apart when we add in all our additional resources</li> <li>We will end up with many function names to call.</li> <li>The preposition <code>with</code> is indicative of a more fundamental problem.</li> <li>Prepositions act like code smell, hinting at something not being quite right.</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#pluralisation","title":"Pluralisation","text":"<ul> <li>Most often, we should use the singular.</li> <li>However collection names might be pluralised.<ul> <li>Use American English to pluralise.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#context","title":"Context","text":"<ul> <li>When we use <code>book</code> in the library API, we are referring to the resource, however in a flight booking API - we are referring to an action.</li> </ul> <p>This means we need to keep the context of our API in mind.</p> <ul> <li>Context can impart additional value to a name that might otherwise lack a specific meaning.</li> <li>It can also lead us astray when we use words with a specific meaning but don't make sense without the context.<ul> <li>record is very generic, until you consider the context of an audio recording API.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part2/chapter3/#data-types-and-units","title":"Data types and units","text":"<p>A name can become more clear when using a richer data type.</p> <ul> <li><code>dimensions: String;</code> - this is ambiguous</li> <li><code>dimensions: Dimensions;</code> (where <code>Dimensions</code> is an object)</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter4/","title":"Resource Scope and Hierarchy","text":""},{"location":"books/api_design_patterns/part2/chapter4/#what-is-a-resource-layout","title":"What is a resource layout?","text":"<p>The arrangement of resources in our API, the fields that define those resources, and how those resources relate to one another through those fields.</p> <p>In other words, resource layout is the entity (resource) relationship model for a particular design of an API.</p>"},{"location":"books/api_design_patterns/part2/chapter4/#types-of-relationships","title":"Types of Relationships","text":""},{"location":"books/api_design_patterns/part2/chapter4/#reference-relationships","title":"Reference Relationships","text":"<p>The simplest way or two resources to relate to one another is by a simple reference.</p> A message resource contains a reference to a specific user who authored the message. <ul> <li>This reference relationship is sometimes referred to as a foreign key relationship.</li> <li>As a result, this can also be considered a many-to-one relationship.<ul> <li>A user might write many messages, but a message always has one user as the author.</li> </ul> </li> </ul>"},{"location":"books/api_design_patterns/part2/chapter4/#self-reference-relationships","title":"Self-Reference Relationships","text":"An employee resource points at other employee resources as managers and assistants."},{"location":"books/api_design_patterns/part2/chapter4/#hierarchical-relationships","title":"Hierarchical Relationships","text":"<ul> <li>Hierarchical relationships are sort of like one resource having a pointer to another<ul> <li>But that pointer aims upward and implies more than just one resource pointing at another.</li> </ul> </li> <li>Hierarchies also tend to reflect containment or ownership between resources.</li> </ul> ChatRoom resources act as the owner of Message resources through a hierarchical relationship. <p>In this case, there is an implied hierarchy of <code>ChatRooms</code> containing or owning <code>Messages</code>.</p>"},{"location":"books/api_design_patterns/part2/chapter4/#choosing-the-right-relationship","title":"Choosing the Right Relationship","text":""},{"location":"books/api_design_patterns/part2/chapter4/#do-you-need-a-relationship-at-all","title":"Do you need a relationship at all?","text":"<p>When building an API, after we've chosen the list of things or resources that matter to us, the next step is to decide how these resources relate to one another.</p> <ul> <li>Consider a self-reference relationship between <code>Users</code>. A single change to one resource can affect millions of other related resources.</li> <li>e.g. if someone famous deletes their Instagram account, millions of records might be to be removed/updated.</li> </ul> <p>Reference relationships should be purposeful and fundamental to the desired behaviour. Any reference relationship should be something important for the API to accomplish its primary goal.</p>"},{"location":"books/api_design_patterns/part2/chapter4/#references-or-in-line-data","title":"References or in-line data","text":"<ul> <li>Where data is in-lined, we only need a single API call to retrieve all the relevant information.</li> <li>But what if we aren't interested in that information very often?<ul> <li>Then our response is bloated.</li> </ul> </li> </ul> <p>Optimise for the common case - without compromising the feasibility of the advanced case.</p>"},{"location":"books/api_design_patterns/part2/chapter4/#hierarchy","title":"Hierarchy","text":"<p>The biggest differences with this type of relationship are the cascading effect of actions and the inheritance of behaviours and properties from parent to child.</p> <ul> <li>Deleting a parent resource typically implies deleting a child resource.</li> <li>Access to a parent generically implies the same level of access to the children resources.</li> </ul>"},{"location":"books/api_design_patterns/part2/chapter4/#anti-patterns","title":"Anti-patterns","text":""},{"location":"books/api_design_patterns/part2/chapter4/#resources-for-everything","title":"Resources for Everything","text":"<p>It can often be tempting to create resources for even the tiniest concept you might want to model.</p> <p>Rule of thumb: If you don't need to interact with one of your resources independent of a resource it's associated with, then it can probably be a data type.</p>"},{"location":"books/api_design_patterns/part2/chapter4/#deep-hierarchies","title":"Deep Hierarchies","text":"<p>Overly deep hierarchies can be confusing and difficult to manage.</p> <p>Page 63 4.3.3 in-line everything</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/","title":"Chapter 1: Reliable, Scalable and Maintainable Applications","text":"<p>Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications.</p> <p>A data-intensive application is built from the following building blocks</p> <ul> <li>Store data so that they, or another application can find it again later (databases)</li> <li>Remember the result of an expensive operation, to sped up reads (caches)</li> <li>Allow users to search data by keyword or filter it in various ways (search indexes)</li> <li>Send a message to another process, to be handled asynchronously (stream processing)</li> <li>Periodically crunch a large amount of accumulated data (batch processing)</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#thinking-about-data-systems","title":"Thinking about Data Systems","text":"<p>Database and a message queue are quite similar. They both store data for some time - though they have very different access patterns which means different performance characteristics and thus very different implementations.</p> <p>Boundaries between these implementations are becoming slightly blurred. There are data-stores that are also used as message queues (Redis) and there are messages queues with database-like durability guarantees (Apache Kafka).</p> One possible architecture for data system that combines several components <p>When you combine several tools in order to provide a service, the service's interface or application programming interface (API) usually hides those implementation details from clients.</p> <ul> <li>Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, even human error).</li> <li>Scalability: As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.</li> <li>Maintainability: Over time, many different people will work on the system (engineering and operations, both maintaining current behaviour and adapting the system to new use cases), and they should all be able to work in it productively.</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#reliability","title":"Reliability","text":"<ul> <li>The application performs the function that the user expected.</li> <li>It can tolerate the user making mistakes or using the software in unexpected ways.</li> <li>Its performance is good enough for the required use case, under the expected load and data volume.</li> <li>The system prevents any unauthorized access and abuse.</li> </ul> <p>Things that ca go wrong are called faults. Systems that anticipate faults and can cope with them are called fault-tolerant or resilient. Fault tolerance does not mean making a system tolerant of all faults, but only tolerating certain types of faults.</p> <p>NOTE: A fault is not the same as a failure. </p> <ul> <li>A fault is defined as one component of the system deviating from its spec.</li> <li>A failure is when the system as a whole stops providing the required service to the user,</li> </ul> <p>It is impossible to to reduce the probability of a fault to zero; therefore it is best to design fault-tolerance mechanisms that prevent faults from causing failures.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#hardware-faults","title":"Hardware Faults","text":"<p>Hard disks are reported as having a mean time to failing (MTTF) of about 10 to 50 years. So on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.</p> <p>A good combatant for this is redundancy. Disks may be set up in RAID configurations, servers can have dual power supplies etc. When a component dies, the redundant component can take it's place whilst the broken one is being replaced. This approach cannot complete prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.</p> <p>However, as data volumes and applications' computing demands have increased, more applications have begun using larger number of machines, which proportionally increase the rate of hardware faults. Moreover, in some cloud platforms such as AWS it is fairly common for virtual machine instances to become unavailable without warning as the platforms are designed to prioritise flexibility and elasticity over single-machine reliability.</p> <p>Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operations advantages: a single-server system requires planned downtime, whereas a system that can tolerate machine failure can be patched one node at a time with no downtime of the entire system (rolling upgrade).</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#software-faults","title":"Software Faults","text":"<p>Hardware faults are normally random and independent form each other. This is not the case for software faults. Software fault can lie dormant for a long time until they are triggered by am unusual set of circumstances. Though there is no quick solution, there are lots of small ones:</p> <ul> <li>Testing</li> <li>Process isolation</li> <li>Allowing crash &amp; restart</li> <li>Measuring and monitoring</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#human-errors","title":"Human Errors","text":"<p>Humans design and build software systems, and the operators are also human. Humans are unreliable.</p> <p>10%-25% of outages are caused by hardware faults, the rest are human related faults.</p> <ul> <li>Design systems in a way that minimises opportunities for error.<ul> <li>e.g. well designed abstractions, APIs and admin interfaces that make it easy to do \"the right thing\"</li> </ul> </li> <li>Decouple the places where people make the most mistakes from places where they can cause failures<ul> <li>Provide fully featured non-production sandbox environments.</li> </ul> </li> <li>Test thoroughly at all levels, from unit tests to whole-system integration tests &amp; manual tests.</li> <li>Allow quick and easy recovery from human errors to minimise the impact in the case of failure.<ul> <li>Make it fast to roll back configuration changes</li> <li>Roll out new code gradually</li> <li>Provide tools to recompute data</li> </ul> </li> <li>Set up detailed and clear monitoring, such as performance metrics and error rates.</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#scalability","title":"Scalability","text":"<p>Even if a system is working reliably today, that doesn't mean it will necessarily work reliably in the future.</p> <p>Scalability is the term we used to describe a system's ability to cope with increased load.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#describing-load","title":"Describing Load","text":"<p>Load can be described with a few numbers which we call load parameters. These parameters depend on the architecture of the system. It might be:</p> <ul> <li>Requests per second</li> <li>Ratio of reads to writes</li> <li>Number of simultaneous active users</li> <li>Hit rate on cache</li> </ul> <p>Consider Twitter as an example, they have two main operations, post tweet and home timeline. There are two ways of implementing these.</p> <p>Approach 1: Posting a tweet simply inserts the new tweet into a global collection of tweets. When user requests their home timeline, look up all the people they follow, find all the tweets for each of those users and merge them (sorting on time). In a relational database <pre><code>SELECT tweets.*, users.*\nFROM tweets\nJOIN users ON tweets.sender_id = users.id\nJOIN follows ON follows.followee_id = users.id\nWHERE follows.follower_id = current_user\n</code></pre> Approach 2: Maintain a cache for each user's home timeline - like a mailbox of tweets for each user. When user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is the cheap, because its result has been computed ahead of time.</p> Twitter's data pipeline for delivering tweets to followers, with load parameters <p>The first version of Twitter used approach 1, but the systems struggled to keep up with the load of home timeline queries, so the company switched to approach 2. The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, so in this case its preferable to do more work at write time and less at read time.</p> <p>However the downside of approach 2 is posting a tweet now requires a lot of extra work. On average a tweet is delivered to about 75 followers, so 4.6K tweets/second became 345k writes/second to home timeline caches. However now consider some accounts have 30 million followers.</p> <p>Twitter uses a hybrid of both solutions. For users with smaller follow counts approach 2 is used, however for celebrity accounts approach 1 is used and these two timelines are merged together.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#describing-performance","title":"Describing Performance","text":"<p>Once you have described the load on your system, you can investigate what happens when load increases.</p> <ul> <li>When you increase a load parameter and keep the system resources unchanged, how is the performance of your system affected?</li> <li>When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?</li> </ul> <p>LATENCY AND RESPONSE TIME</p> <p>Latency and response time are often used synonymously, but they are not the same.  Response time: Is what the client sees: the sum of service time, network delays and queuing delays. Latency: Is the duration that a request is waiting to be handled - during which it is latent, awaiting service.</p> Illustrating mean and percentiles: response times for a sample of 100 requests to a service <p>Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps these requests are intrinsically more expensive - however even the same request will see variations due to all matter of reasons.</p> <p>Average response time of a service is common however it is not a very good metric if you want to know your \"typical\" response time - it doesn't tell you how many users actually experienced that delay.</p> <p>Percentiles are a better metric. </p> <ul> <li>Take all response times, sort them and the median is the half way point. </li> <li>This makes the median a good metric if you want to know how long users typically have to wait: half of users are served in less than the median, the other half longer. The median is also known as the 50<sup>th</sup> percentile and abbreviated as p50.</li> <li>Note this refers to a single request. If a user creates multiple requests, the probability that one of them is over the p50 is much greater than 50%.</li> <li>In order to figure out how bad your outliers are you can look at higher percentiles: the 95<sup>th</sup>, 99<sup>th</sup> and 99.9<sup>th</sup> (abbreviated to p95, p99 and p999).<ul> <li>e.g. if p95 is 1.5 seconds, that means 95 out of 100 requests are served quicker than 1.5 seconds, and 5 are served slower.</li> </ul> </li> <li>High percentiles of response times (also known as tail latencies), are important because they directly affect users' experience of the service.</li> </ul> <p>Amazon descries response time requirements for internal services in terms of p999 even though it only affects 1 in 1000 requests. This is because customers with the slowest requests are often those who have the most data in their accounts (valuable customers).</p> <p>Queuing delays often account for a large part of the response time at high percentiles. It only takes a small number of sow requests to hold up the processing of subsequent requests - known as head-of-line blocking. Due to this it is important to measure response times on client side.</p> When several back end calls are needed to serve a request, it takes just a single slow back end request to slow down the entire end-user request."},{"location":"books/designing_data_intensive_applications/part1/chapter1/#approaches-for-coping-with-load","title":"Approaches for Coping with Load","text":"<p>Vertical Scaling: Moving to a more powerful machine.</p> <p>Horizontal Scaling: Distributing the load across multiple smaller machines.</p> <p>Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase. Elastic systems are useful if load is unpredictable, but manual/periodic scaled systems are simpler and have fewer operational surprises.</p> <p>While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed set up can introduce additional complexity. Common wisdom (until recently) was to keep your database on a single node and vertically scale until cost dictated horizontal scaling.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#maintainability","title":"Maintainability","text":"<p>Majority of the cost of software is not initial development, but in on going maintenance:</p> <ul> <li>Fixing bugs</li> <li>Keeping systems operational</li> <li>Investigating failures</li> <li>Adapting to new platforms</li> <li>Modifying it for new use cases</li> <li>Repaying technical debt</li> </ul> <p>Operability: Make it easy for operations teams to keep the system running smoothly.</p> <p>Simplicity: Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.</p> <p>Evolvability: Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases are requirements change. (Also known as extensibility, modifiability or plasticity)</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#operability-making-life-easy-for-operations","title":"Operability: Making Life Easy for Operations","text":"<p>\"Good operations can work around the limitations of bad software, but good software cannot run reliably with bad operations\"</p> <p>Operation teams are responsible for the following:</p> <ul> <li>Monitoring the health of the system and quickly restoring services.</li> <li>Tracking down the cause of the problems.</li> <li>Keeping software and platforms up to date, including security patches.</li> <li>Keeping tabs on how different systems affect each other.</li> <li>Anticipating future problems and applying fixes before they occur.</li> <li>Establishing good practices are tools for deployment and configuration management.</li> <li>Performing complex maintenance tasks such as moving an application from one platform to another.</li> <li>Maintaining the security of the system.</li> <li>Defining processes that make operations predictable and help keep the production environment stable.</li> <li>Preserving the organisations knowledge about the system, even as individuals come and go.</li> </ul> <p>Good operability means making routine tasks easy - allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy:</p> <ul> <li>Providing visibility into the runtime behaviour and internals of the system.</li> <li>Providing good support for automation and integration with standard tools.</li> <li>Avoiding dependency on individual machines.</li> <li>Providing good documentation and easy to understand operational model.</li> <li>Providing good default behaviour.</li> <li>Self-healing where appropriate.</li> <li>Exhibiting predictable behaviour, minimising surprises.</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#simplicity-managing-complexity","title":"Simplicity: Managing Complexity","text":"<p>In complex software, there is a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked.</p> <p>Complexity can be accidental. This is defined if it is not inherent in the problem the software is trying to solve, but only arises from implementation. One of the best tools for removing accidental complexity is abstraction.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter1/#evolvability-making-change-easy","title":"Evolvability: Making Change Easy","text":"<p>The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones.</p> <p>Evolvability can be thought of the agility on a data system level.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/","title":"Chapter 2: Data Models and Query Languages","text":"<p>Data models are perhaps the most important part of developing software. They define on how we think about the problem we are solving.</p> <p>Most applications are built by layering one data model on top of another. For each layer the key question is: how is it represented in terms of the next-lower layer? For example:</p> <ol> <li>Application developer looks at the real world and model in terms of objects/data structures and APIs that manipulate those data structures.</li> <li>Storing is done in JSON, a relational database or a graph model.</li> <li>Database engineers then map these structures in terms of bytes in memory on a disk or on a network. This representation needs to allow querying, updating, deletion etc.</li> <li>Then the physical layer of actual electrical signals.</li> </ol>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#relational-model-vs-document-model","title":"Relational Model Vs Document Model","text":"<p>In a relational model, data is organised into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#the-birth-of-nosql","title":"The Birth of NoSQL","text":"<p>#NoSQL is retroactively interpreted as Not Only SQL.</p> <p>There are several driving forces behind the adoption of NoSQL databases:</p> <ul> <li>A need for greater scalability than relational databases can easily achieve, include very large datasets or very high write throughput.</li> <li>A widespread preference for free and open source software over commercial database products.</li> <li>Specialised query operations that are not well supported by the relational model.</li> <li>Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model.</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#the-object-relational-mismatch","title":"The Object-Relational Mismatch","text":"<p>Most application development today is done in OOP, meaning if data is stored in relational tables, an awkward transition layer is required between the object in application code and the database model of tables, rows and columns. The disconnect between the models is sometimes called an impedance mismatch.</p> <p>Object-relational mapping (ORM) frameworks reduce the amount of boiler plate required for this translation layer, but they cannot completely hide it.</p> <p>For example, storing a resume on a relational schema can be tricky. The profile as a while can be identified by a unique identifier <code>user_id</code>. Fields like <code>first_name</code> and <code>last_name</code> appear exactly once per user so they can be modeled as columns in the table. However most people have had <code>n</code> jobs, this is a one-to-many relationship.</p> <ol> <li>In traditional SQL, jobs would be put in a separate table, with foreign keys in the user table.</li> <li>There are some DBs that have added standard support for multi-valued data to be stored in a single row</li> <li>Encode this information in a string field as JSON.</li> </ol> Representing a LinkedIn profile using a relational schema. <p>Here is the same data stored as a JSON object:</p> <pre><code>{\n  \"user_id\": 251,\n  \"first_name\": \"Bill\",\n  \"last_name\": \"Gates\",\n  \"summary\": \"Co-chair of the Bill &amp; Melinda Gates... Active blogger.\",\n  \"region_id\": \"us:91\",\n  \"industry_id\": 131,\n  \"photo_url\": \"/p/7/000/253/05b/308dd6e.jpg\",\n  \"positions\": [\n    {\n      \"job_title\": \"Co-chair\",\n      \"organization\": \"Bill &amp; Melinda Gates Foundation\"\n    },\n    {\n      \"job_title\": \"Co-founder, Chairman\",\n      \"organization\": \"Microsoft\"\n    }\n  ],\n  \"education\": [\n    {\n      \"school_name\": \"Harvard University\",\n      \"start\": 1973,\n      \"end\": 1975\n    },\n    {\n      \"school_name\": \"Lakeside School, Seattle\",\n      \"start\": null,\n      \"end\": null\n    }\n  ],\n  \"contact_info\": {\n    \"blog\": \"http://thegatesnotes.com\",\n    \"twitter\": \"http://twitter.com/BillGates\"\n  }\n}\n</code></pre> <p>The JSON model reduces the impedance mismatch between the application code and the storage layer. The lack of schema is often cited as an advantage.</p> <p>The JSON representation has better locality than the multi-table schema, if you want to fetch a profile in the relational example, you need to perform multiple queries or a join between 2 or more tables. In the JSON format all relevent data is in one place.</p> <p>The one-to-many relationships from the user profile to the user's positions, education, contact information etc imply a tree like structure, the JSON representation makes this tree structure explicit.</p> One-to-many relationships forming a tree structure"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#many-to-one-and-many-to-many-relationships","title":"Many-to-One and Many-to-Many Relationships","text":"<p>In the previous example <code>region_id</code> are given as IDs, not as plain-text strings. This is because:</p> <ul> <li>Consistent style</li> <li>Avoids ambiguity (if there are several similarly named cities)</li> <li>Ease of updating - name is only stored in one place</li> <li>Localisation support</li> </ul> <p>Whenever you store an ID or a text string is a question of duplication. When you use an ID, the information that is meaningful to humans is stored in only one place and everything that refers to it uses an ID.</p> <p>The advantages of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes.</p> <p>Anything that is meaningful to humans may need to change sometime in the future - and if that information is duplicated, all the redundant copies need to be updated.</p> <p>Removing such duplication is the key idea behind normalisation in databases.</p> <p>Even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications. See below how adding two extra features turns one-to-many to many-to-many.</p> Extending resumes with many-to-many relationships"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#are-document-databases-repeating-history","title":"Are Document Databases Repeating History","text":"<p>While many-to-many relationships and joins are routinely used in relational databases, document databases and NoSQL reopened the debate on how best to represent such relationships in a database.</p> <p>This debate is much older than NoSQL - going back to the 1970s.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#the-network-model","title":"The Network Model","text":"<p>In the tree structure of the hierarchical model, every record has exactly one parent; in the network model, a record could have multiple parents.</p> <p>For example, there could be one record for the <code>\"Greater Seatlle Area\"</code> region and every user who lived in that region could be linked to it. This allowed one-to-many and many-to-many relationships to be modeled.</p> <p>The links between records in the network model were not foreign keys, but more like pointers in a programming language. The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.</p> <p>In the simplest case, an access path could be like the traversal of a linked list: start at the head of the list and look one record at a time until you find the one you want. But in a world of many-to-many relationships, several different paths can lead to the same record, and a programmer working with the network model had to keep track of these different access paths in their head.</p> <p>A query was performed by moving a cursor through the database by iterating over lists of records and following access paths. If a record has multiple parents (i.e. multiple incoming pointers from other records), the application code had to keep track of all the various relationships.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#the-relational-model","title":"The Relational Model","text":"<p>What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that it. There are no labyrinthine nested structures, no complicated access paths to follow if you want to query data you can:</p> <ul> <li>Read any or all of the rows in a table, selecting those that match your conditions.</li> <li>Read a particular row by designating some columns as a key and matching on those</li> <li>Insert a new row into any table without worrying about foreign key relationships to and from other tables.</li> </ul> <p>The query optimiser automatically decides which parts of the query to execute in which order, and which indexes to use.</p> <p>Those choices are effectively the equivalent of the \"access path\", but the big difference is it is made by the query optimiser, not the application developer.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#comparison-to-document-databases","title":"Comparison to Document Databases","text":"<p>Document databases reverted back to the hierarchical model in one aspect: storing nested records (one-to-many) relationships within their parent record rather than a separate table.</p> <p>However, when it come to representing many-to-one and many-to-many relationships, relational and document databases both refer using foreign keys.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#relational-versus-document-databases-today","title":"Relational Versus Document Databases today","text":"<p>The main arguments in favour of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application.</p> <p>The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#which-data-model-leads-to-simpler-application-code","title":"Which data model leads to simpler application code?","text":"<p>If data in your application has a document-like structure (i.e. a tree of one-to-many relationships where typically the entire tree is loaded at once), then the document model makes sense.</p> <p>The relational technique of shredding - splitting a document-like structure into multiple tables - can lead to cumbersome schemas and complex code.</p> <p>If a document model is deeply nested it can cause problems as nested items cannot be queried directly. For example \"the second item in the list of employers for user 251\" is inefficient.</p> <p>However if you applicaiton does use many-to-many relationships, the document model is less appealing. It's possible to reduce the need for joins by denormalising but then the application code needs to do additional work to keep the denormalised data consistent. Joins can be emulated in application code by making multiple requests to the database. But that moves complexity to the application code and multiple calls is usually slower than the optimised JOIN request.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#schema-flexibility-in-the-document-model","title":"Schema Flexibility in the Document Model","text":"<p>No schema means that arbitrary keys can values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.</p> <p>Document databases are sometimes called schemaless, but that's misleading, as the code that read the data usually assumes some kind of structure. A more accurate term is schema-on-read. In contrast schema-on-write is enforced by the database on writes.</p> <p>For example, say you have currently storing user's full name in one field, however now you want to store them separately. In a document database:</p> <pre><code>if (user &amp;&amp; user.name &amp;&amp; !user.first_name) {\n // Documents written before Dec 8, 2013 don't have first_name\n    user.first_name = user.name.split(\" \")[0];\n}\n</code></pre> <p>On the other hand, in a \"statically typed\" database schema-on-write approach.</p> <pre><code>ALTER TABLE users\nADD COLUMN first_name text;\nUPDATE users\nSET first_name = split_part(name, ' ', 1);\n</code></pre> <p>Altering the table is relatively quick however setting every row in the table is time consuming.</p> <p>The schema-on-read approach is advantageous if the items in the collection don't all have the same structure.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#data-locality-for-queries","title":"Data Locality for Queries","text":"<p>A document is usually stored as a single continuous string, encoded as JSON or binary (MongoDB's BSON). If your application often needs access to the entire document (e.g. rendering to a web page), there is a performance advantage to this storage locality. If data is split across multiple tables, multiple index lookups are required to retrieve it all.</p> <p>The database typically needs to load the entire document, even if you access only a small portion of it. On updates to a document, the entire document usually needs to be rewritten - only modifications that don't change encoded size can be performed in place (rare).</p> <p>For this reason its recommended to keep documents small and avoid frequent updates.</p> <p>Some relational databases can offer this locality. Oracle's feature: multi-table index cluster tables which declares rows should be inter-leaved in the parent table. There is also the column-family concept in Cassandra.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#convergence-of-document-and-relational-databases","title":"Convergence of document and relational databases","text":"<p>Relational databases have supported XML since their inception - however many now support JSON.</p> <p>Document databases now supports relational like joins in its query language and some MongoDB drivers automatically resolve database references.</p> <p>It seems that relational and document databases are becoming more similar over time, and that is a good thing: the data models complement each other. If a database is able to handle document-like data and also perform relational queries on it, applications can use the combination of features that best fits their needs.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#query-languages-for-data","title":"Query Languages for Data","text":"<p>SQL is a declarative query language.</p> <p>Imperative example: <pre><code>function getSharks() {\n    var sharks = [];\n    for(var i = 0; i &lt; animals.length; i++) {\n        if (animals[i].family === \"Sharks\") {\n            sharks.push(animals[i]);\n        }\n    return sharks;\n}\n</code></pre> In relational algebra, you would instead write: $$ sharks = \\sigma_{family =''Sharks''} (animals) $$</p> <p>Where \\(\\sigma\\) is the selection operator, returning only those animals that match the condition \\(family = ''Sharks''\\). SQL follows this closely.</p> <pre><code>SELECT * FROM animals WHERE family = 'Sharks';\n</code></pre> <p>An imperative language tells the computer to perform certain operations in a certain order.</p> <p>In a declarative query language, you just specify the pattern of the data you want. e.g. what conditions should be met, how the data should be transformed - but not how to achieve that goal. The declarative query language hides the implementation details of the database engine. This allows the database engine to be optimised and improved without the need to change the query language itself.</p> <p>Declarative languages are very easy to parallelise - they specify the pattern of results not the algorithm to be used.</p>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#declarative-queries-on-the-web","title":"Declarative Queries on the Web","text":"<pre><code>&lt;ul&gt;\n    &lt;li class=\"selected\"&gt;&lt;p&gt;Sharks&lt;/p&gt;&lt;/li&gt;\n    &lt;li&gt;&lt;p&gt;Whales&lt;/p&gt;&lt;/li&gt;\n    &lt;li&gt;&lt;p&gt;Fish&lt;/p&gt;&lt;/li&gt;\n&lt;/ul&gt;\n</code></pre> <pre><code>li.selected &gt; p {\n    background-color: blue;\n}\n</code></pre> <p>Here the CSS selector <code>li.selected &gt; p</code> declares the pattern of elements to colour blue: all <code>&lt;p&gt;</code> elements whise direct parent is a <code>&lt;li&gt;</code> element which a class of <code>selected</code>.</p> <p>Doing this with an imperative approach is a nightmare. <pre><code>const liElements = document.getElementsByTagName(\"li\");\nconst selectedLiElements = liElements.filter(liElement =&gt; liElement.className === \"Selected\")\nfor (selectedElement : selectedLiElements) {\n    for (child : selectedElement.childrenNodes()) {\n        if (child.tagName === \"p\") {\n            child.setAttribute(\"style\", \"background-color: blue\")\n        }\n    }\n}\n</code></pre></p> <ul> <li>If the selected class is removed because the user clicks onto a different page, the colour won't be removed - even if the code is re-run, so the item will remain highlighted until refresh. With CSS the browser automatically detects when the rule no longer applies.</li> <li>If you want to take advantage of a new API, such as <code>document.getElementsByClassName()</code>, the code will have to be entirely re-written. On the other hand browsers can improve the performance of CSS without breaking compatibility.</li> </ul>"},{"location":"books/designing_data_intensive_applications/part1/chapter2/#mapreduce-querying","title":"MapReduce Querying","text":"<p>MapReduce is a programming model for processing large amount of data in bulk across many machines. This is supported by MongoDB as a mechanism for performing read-only queries across many documents.</p> <p>MapReduce is neither declarative nor imperative but somewhere in between.</p> <p>Example in PostgreSQL <pre><code>SELECT date_trunc('month', observation_timestamp) as observation_month, sum(num_animals) AS total_animals\nFROM observations\nWHERE family = \"Sharks\"\nGROUP BY observation_month;\n</code></pre></p> <p>Example in MongoDB using MapReduce <pre><code>db.observations.mapReduce(\n    function map() {\n        var year = this.observationTimestamp.getYear();\n        var month = this.observationTimestamp.getMonth();\n\n        return [`${year}-${month}`, this.numAnimals];\n    },\n    function reduce(key, values) {\n        return Array.sum(values);\n    },\n    query: {\n        family: \"Sharks\"\n    },\n    out: {\n        \"monthlySharkReport\"\n    }\n);\n</code></pre></p> <p>The <code>map</code> function would be called once for each document (e.g. returning <code>[\"2026-01\", 3], [\"2026-01\", 4]</code>. Subsequently the <code>reduce</code> function would be called <code>[\"2026-01\", [3,4]]</code> returning 7.</p> <p>Map and Reduce functions must be pure with no side effects (no additional db calls). This allows them to be run anywhere, in any order and re-run on failure.</p> <p>MapReduce was replaced by the aggregation pipeline.</p> <pre><code>{\n    \"$match\": {\n        \"family\": \"Sharks\"\n    }\n},\n{\n    \"$group\": {\n        \"_id\": {\n            \"year\": {\n                \"$year\": \"$observationTimestamp\"\n            },\n            \"month\": {\n                \"$month\": \"$observationTimestamp\"\n            }\n        },\n        \"totalAnimals\": {\n            \"$sum\": \"$numAnimals\"\n        }\n    }\n}\n</code></pre> <p>Aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses JSON syntax rather than SQL's English sentence style.</p>"},{"location":"books/designing_data_intensive_applications/preface/","title":"Preface","text":"<p>There many been many developments in distributed systems, databases and the applications build on top of them, there are various driving forces:</p> <ol> <li>Handling huge volumes of data.</li> <li>Businesses need to be agile, test hypotheses cheaply and respond quickly to markets.</li> <li>Free &amp; open source software has become very successful and is preferred now to commercial or in-house solutions</li> <li>CPU clock speeds are barely increasing. But multi-core processors are standard and networks are getting faster. Parallelism is only going to increase.</li> <li>Even small teams can now build systems that are distributed across machines and regions - thanks to IaaS (think AWS)</li> <li>Many services are expected to be highly available. Extended downtime is unacceptable.</li> </ol> <p>An application is data-intensive if data is it's primary challenge.</p> <ul> <li>The quantity of data.</li> <li>The complexity of data.</li> <li>The speed at which data changes.</li> </ul> <p>This is opposed to compute-intensive where the CPU is the bottle neck.</p>"}]}