Chapter 1: Reliable, Scalable and Maintainable Applications¶

Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications.

A data-intensive application is built from the following building blocks

Store data so that they, or another application can find it again later (databases)
Remember the result of an expensive operation, to sped up reads (caches)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send a message to another process, to be handled asynchronously (stream processing)
Periodically crunch a large amount of accumulated data (batch processing)

Thinking about Data Systems¶

Database and a message queue are quite similar. They both store data for some time - though they have very different access patterns which means different performance characteristics and thus very different implementations.

Boundaries between these implementations are becoming slightly blurred. There are data-stores that are also used as message queues (Redis) and there are messages queues with database-like durability guarantees (Apache Kafka).

One possible architecture for data system that combines several components

When you combine several tools in order to provide a service, the service's interface or application programming interface (API) usually hides those implementation details from clients.

Reliability: The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, even human error).
Scalability: As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.
Maintainability: Over time, many different people will work on the system (engineering and operations, both maintaining current behaviour and adapting the system to new use cases), and they should all be able to work in it productively.

Reliability¶

The application performs the function that the user expected.
It can tolerate the user making mistakes or using the software in unexpected ways.
Its performance is good enough for the required use case, under the expected load and data volume.
The system prevents any unauthorized access and abuse.

Things that ca go wrong are called faults. Systems that anticipate faults and can cope with them are called fault-tolerant or resilient. Fault tolerance does not mean making a system tolerant of all faults, but only tolerating certain types of faults.

NOTE: A fault is not the same as a failure.

A fault is defined as one component of the system deviating from its spec.
A failure is when the system as a whole stops providing the required service to the user,

It is impossible to to reduce the probability of a fault to zero; therefore it is best to design fault-tolerance mechanisms that prevent faults from causing failures.

Hardware Faults¶

Hard disks are reported as having a mean time to failing (MTTF) of about 10 to 50 years. So on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

A good combatant for this is redundancy. Disks may be set up in RAID configurations, servers can have dual power supplies etc. When a component dies, the redundant component can take it's place whilst the broken one is being replaced. This approach cannot complete prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.

However, as data volumes and applications' computing demands have increased, more applications have begun using larger number of machines, which proportionally increase the rate of hardware faults. Moreover, in some cloud platforms such as AWS it is fairly common for virtual machine instances to become unavailable without warning as the platforms are designed to prioritise flexibility and elasticity over single-machine reliability.

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operations advantages: a single-server system requires planned downtime, whereas a system that can tolerate machine failure can be patched one node at a time with no downtime of the entire system (rolling upgrade).

Software Faults¶

Hardware faults are normally random and independent form each other. This is not the case for software faults. Software fault can lie dormant for a long time until they are triggered by am unusual set of circumstances. Though there is no quick solution, there are lots of small ones:

Testing
Process isolation
Allowing crash & restart
Measuring and monitoring

Human Errors¶

Humans design and build software systems, and the operators are also human. Humans are unreliable.

10%-25% of outages are caused by hardware faults, the rest are human related faults.

Design systems in a way that minimises opportunities for error.
- e.g. well designed abstractions, APIs and admin interfaces that make it easy to do "the right thing"
Decouple the places where people make the most mistakes from places where they can cause failures
- Provide fully featured non-production sandbox environments.
Test thoroughly at all levels, from unit tests to whole-system integration tests & manual tests.
Allow quick and easy recovery from human errors to minimise the impact in the case of failure.
- Make it fast to roll back configuration changes
- Roll out new code gradually
- Provide tools to recompute data
Set up detailed and clear monitoring, such as performance metrics and error rates.

Scalability¶

Even if a system is working reliably today, that doesn't mean it will necessarily work reliably in the future.

Scalability is the term we used to describe a system's ability to cope with increased load.

Describing Load¶

Load can be described with a few numbers which we call load parameters. These parameters depend on the architecture of the system. It might be:

Requests per second
Ratio of reads to writes
Number of simultaneous active users
Hit rate on cache

Consider Twitter as an example, they have two main operations, post tweet and home timeline. There are two ways of implementing these.

Approach 1: Posting a tweet simply inserts the new tweet into a global collection of tweets. When user requests their home timeline, look up all the people they follow, find all the tweets for each of those users and merge them (sorting on time). In a relational database

SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user

Approach 2: Maintain a cache for each user's home timeline - like a mailbox of tweets for each user. When user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is the cheap, because its result has been computed ahead of time.

Twitter's data pipeline for delivering tweets to followers, with load parameters

The first version of Twitter used approach 1, but the systems struggled to keep up with the load of home timeline queries, so the company switched to approach 2. The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, so in this case its preferable to do more work at write time and less at read time.

However the downside of approach 2 is posting a tweet now requires a lot of extra work. On average a tweet is delivered to about 75 followers, so 4.6K tweets/second became 345k writes/second to home timeline caches. However now consider some accounts have 30 million followers.

Twitter uses a hybrid of both solutions. For users with smaller follow counts approach 2 is used, however for celebrity accounts approach 1 is used and these two timelines are merged together.

Describing Performance¶

Once you have described the load on your system, you can investigate what happens when load increases.

When you increase a load parameter and keep the system resources unchanged, how is the performance of your system affected?
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

LATENCY AND RESPONSE TIME

Latency and response time are often used synonymously, but they are not the same. Response time: Is what the client sees: the sum of service time, network delays and queuing delays. Latency: Is the duration that a request is waiting to be handled - during which it is latent, awaiting service.

Illustrating mean and percentiles: response times for a sample of 100 requests to a service

Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps these requests are intrinsically more expensive - however even the same request will see variations due to all matter of reasons.

Average response time of a service is common however it is not a very good metric if you want to know your "typical" response time - it doesn't tell you how many users actually experienced that delay.

Percentiles are a better metric.

Take all response times, sort them and the median is the half way point.
This makes the median a good metric if you want to know how long users typically have to wait: half of users are served in less than the median, the other half longer. The median is also known as the 50^th percentile and abbreviated as p50.
Note this refers to a single request. If a user creates multiple requests, the probability that one of them is over the p50 is much greater than 50%.
In order to figure out how bad your outliers are you can look at higher percentiles: the 95^th, 99^th and 99.9^th (abbreviated to p95, p99 and p999).
- e.g. if p95 is 1.5 seconds, that means 95 out of 100 requests are served quicker than 1.5 seconds, and 5 are served slower.
High percentiles of response times (also known as tail latencies), are important because they directly affect users' experience of the service.

Amazon descries response time requirements for internal services in terms of p999 even though it only affects 1 in 1000 requests. This is because customers with the slowest requests are often those who have the most data in their accounts (valuable customers).

Queuing delays often account for a large part of the response time at high percentiles. It only takes a small number of sow requests to hold up the processing of subsequent requests - known as head-of-line blocking. Due to this it is important to measure response times on client side.

When several back end calls are needed to serve a request, it takes just a single slow back end request to slow down the entire end-user request.

Approaches for Coping with Load¶

Vertical Scaling: Moving to a more powerful machine.

Horizontal Scaling: Distributing the load across multiple smaller machines.

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase. Elastic systems are useful if load is unpredictable, but manual/periodic scaled systems are simpler and have fewer operational surprises.

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed set up can introduce additional complexity. Common wisdom (until recently) was to keep your database on a single node and vertically scale until cost dictated horizontal scaling.

Maintainability¶

Majority of the cost of software is not initial development, but in on going maintenance:

Fixing bugs
Keeping systems operational
Investigating failures
Adapting to new platforms
Modifying it for new use cases
Repaying technical debt

Operability: Make it easy for operations teams to keep the system running smoothly.

Simplicity: Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.

Evolvability: Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases are requirements change. (Also known as extensibility, modifiability or plasticity)

Operability: Making Life Easy for Operations¶

"Good operations can work around the limitations of bad software, but good software cannot run reliably with bad operations"

Operation teams are responsible for the following:

Monitoring the health of the system and quickly restoring services.
Tracking down the cause of the problems.
Keeping software and platforms up to date, including security patches.
Keeping tabs on how different systems affect each other.
Anticipating future problems and applying fixes before they occur.
Establishing good practices are tools for deployment and configuration management.
Performing complex maintenance tasks such as moving an application from one platform to another.
Maintaining the security of the system.
Defining processes that make operations predictable and help keep the production environment stable.
Preserving the organisations knowledge about the system, even as individuals come and go.

Good operability means making routine tasks easy - allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy:

Providing visibility into the runtime behaviour and internals of the system.
Providing good support for automation and integration with standard tools.
Avoiding dependency on individual machines.
Providing good documentation and easy to understand operational model.
Providing good default behaviour.
Self-healing where appropriate.
Exhibiting predictable behaviour, minimising surprises.

Simplicity: Managing Complexity¶

In complex software, there is a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked.

Complexity can be accidental. This is defined if it is not inherent in the problem the software is trying to solve, but only arises from implementation. One of the best tools for removing accidental complexity is abstraction.

Evolvability: Making Change Easy¶

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones.

Evolvability can be thought of the agility on a data system level.