hacker / welder / mechanic / carpenter / photographer / musician / writer / teacher / student
Musings of an Earth-bound carbon-based life form.
While working at Amazon I’ve had the chance to work on some pretty high-volume systems – some to the tune of 1M+ requests / minute – where uptime was critical along with supporting customers with varying versions of software and hardware. Since then I’ve done a lot of thinking about what has made the teams I worked on successful (and not so successful) and so I would like to take the time to talk about some of the principles that I have learned while building high-stakes services.
Everything has an API and those APIs need to be solid, versioned, and well-documented. In addition, teams building APIs need to also build clients for those APIs – who else knows the API better? At the minimum they should be a simple reference implementation that doesn’t do a lot of extra work; at best they should be fully-supported API clients that other teams can depend on.
All of your interactions with data should be through an API - data storage should be done through a single owner that can make decisions about how to optimally store, cache various parts of, and how to represent that data. In particular, for every data storage engine in use, only one service should directly interact with that data store. The driving principle here is that you have one actor that is interacting with the data store; you can easily understand access patterns and account for connectivity and better manage performance when you only have one service talking to the data store.
REST is great - using path-based resources means someone can talk to your
service with little more than curl
or Postman. Using HTTP is a great way to
make your API accessible to anyone and it’s really easy to secure via TLS/SSL
which means it’s a safe and convenient way for external parties to communicate
with your service. However, when it comes to internal, service-to-service API
calls you are more likely to want RPC-style interfaces for a number of reasons.
However, I have seen teams spend a lot of time building RESTful clients for internal APIs that are constantly changing when they could have spent far less time using something like Thrift to auto-generate internal clients that other teams can consume. It supports versioning so other teams can be loosely coupled without you needing to generate multiple versions of your endpoints, and RPC is often a more natural way of viewing a distributed system from the inside. When you are talking to internal services then you want to treat your other services as extensions of your own. When you are an external customer then you are more likely to treat the service as a collection of resources rather than operations.
If you use REST, use the correct HTTP status codes for your API - 200s for normal responses, 400s when the client has done something wrong, 500s when the server is at fault. Track how many of these happen over time, alert when you start to see a lot more of any of these than normal, particularly 500s.
I have seen APIs that send back a 200 with a blob of XML saying that there was an error when things go wrong. Avoid doing this at all cost, it is very confusing to consumers to say “OK” and then at the same time say “something went wrong”.
One of the fastest ways to build a legacy horror is to permit consumers direct access to the various parts of your system. Doing so prevents your team from changing the way your internal API contracts operate by accruing consumers of APIs that are not intended for the public.
This is why Apple, for example, has rules about accepting iOS applications that consume private APIs; those APIs are very low-level and are not guaranteed to remain constant over time between hardware revisions and OS updates. As a result they have built a public API available via Swift and Objective-C for developers to build software on top of. To ensure a solid customer experience, Apple reserves the right to alter how those internal APIs behave while keeping the public API consistent.
Think of an API gateway as your public interface to your service’s private methods; you wouldn’t permit anyone to call private methods in your library so why would you let them talk to your APIs this way?
Measure twice, cut once – if you can’t measure it, you can’t understand it, if you don’t understand it, you can’t fix or improve it; if you can’t fix or improve it then you might as well just stop now. Metrics collection is so easy these days that you have little-to-no excuse for not measuring almost everything (if not everything) about your system. There are great libraries out there for nearly every language that permit you to collect measurements of your system.
Some easy things to measure, if you are not already, that will tell you a significant amount about your service:
P90 and P99 are often quite interesting latency metrics because that’s where the weird stuff happens and it’s where anywhere from one-in-one hundred to one-in-ten requests are affected. That’s a significant population of consumers that are experiencing that level of latency and should be addressed. Spikes in these ends of the latency spectrum are often associated with abnormal upstream request behavior, larger than normal requests, or other interesting edge cases in your APIs.
Dashboards can be hard to get right; there is rarely one-size-fits-all when it comes to the data that you display. Typically a good dashboard will show:
Within each of these you should be able to drill-down to more detail and see information such as:
Systems fail; software crashes, RAM fails, CPUs overheat, networks flip bits, someone cuts a wire – anything can happen. Part of building systems at scale means that you are building a large enough system with many moving parts that may behave in new and interesting ways.
Don’t plan for the “happy path”; plan for the failure case and be thankful when the right thing does happen. Even simple things like retrying requests can increase your resiliency by a large margin and requires very little additional work on your part. Of course, make sure that you measure when these things happen – emit counters when you retry; track the overall time it takes to fulfill the request (including retries); know how long it takes for you to fail when there’s no response; know how many open connections you have; keep track of everything you can so that you know where, and how, your system is being strained.
When you build a system or infrastructure, ensure that you can withstand some level of failure in your complete system. To achieve this leverage multiple availability zones in AWS or multiple data centers with other providers. Make sure you have enough computing and throughput to withstand a third of your capacity disappearing. Use load balancers with sane algorithms to distribute your load across your fleet so that not any one host is drowning in connections.
Systems should be a collection of interchangeable parts, there should be no snowflakes or singletons. These types of actors represent a single-point-of-failure because they are unique and you cannot easily replace or reconstruct them. In addition to not being replaceable, they are also harder to scale due to their unique traits; You cannot simply add more of them without introducing other issues.
Use tools that encourage automation and homogeneity to ensure that your systems are all configured the same way. Treat deployments as atomic; either your fleet has been upgraded to the new version or it has not been – there should be no middle-ground. To this end, you need to be prepared for a loss of capacity - what if your deployment partially succeeds and now a portion of your fleet is out of service?
Build APIs that can handle smaller amounts of data in a stream, this allows your clients to better handle partial failure and retry accordingly. Similarly, services should avoid bulk-loading data to process into memory - it creates scenarios where you cannot complete your work because of limited resources and it makes your system harder to scale horizontally. If your service crashes while processing one small segment of data it’s easier to pick up where you left off or reprocess it; if the service dies while processing a bulk load of thousands of records, how does the client know where it stopped?
If you have lots of large data to process, look at map-reduce and other streaming algorithms that let you spread the load across multiple nodes to increase both performance and resilience.
When you deploy software at a large enough scale, you cannot afford to take down your entire system to perform an upgrade. It doesn’t matter if you do green-blue deployment strategies or if you simply do rolling upgrades. Every service has some acceptable threshold of lost capacity before it becomes unavailable (for a variety of reasons) so you can’t do a complete replacement of every host’s software simultaneously while keeping the system up. To keep the system up you will inevitably have some two populations providing different versions of your API.
This requires that you focus on a few different key points:
You’ll notice that all of these points emphasize backwards compatibility. This is because in a loosely coupled system you do cannot replace everything at once
State violates the no-snowflake rule; it implicitly requires coordination of many entities to ensure that it is consistent. You cannot avoid state entirely – you will likely need to process transactions somewhere (purchases, reservations, work completion, and so on), save user preferences, or do other things that require coordination. The trick here is to minimize your reliance on this type of data and to isolate your access to it – it’s much easier to coordinate and organize operations on data from a single place than from many places.
Systems that store state internally are not interchangeable; losing a system means you lose a core component of your stack. If you need to store state, store it somewhere that is resilient to loss – Riak, DynamoDB, a master-slave MySQL or PostgreSQL setup, even Redis with snapshots and sharding are good solutions. Control all access to state data through an API and isolate data access in that repository to one, and only one service. Do not permit direct access to data storage to cross service boundaries; not reads and definitely not writes. If performance becomes a concern, which it very well may, then introduce caching mechanisms as appropriate (see below).
Caching is an incredibly important part of achieving high-performance services. Without caching it would be very challenging if not impossible to achieve the high level of throughput that sites like Amazon, Facebook and other high-traffic websites sustain. Pacing caches close to the consumer and choosing caching technologies and strategies that best match your traffic patterns are critical to achieving high-throughput services.
Without caches, there is a lot of additional pressure being placed on upstream data storage mechanisms. Databases, which are really good at keeping transactional data consistent, are not a great mechanism for fast lookup of read-heavy data that does not change. Consider, for a second, a whale user in Twitter; that user may have 10,000,000 viewers. Even if that user published a new tweet once per second, and their fans re-loaded the whale user’s twitter feed once every ten seconds, their write volume is dramatically lower than the read volume on that particular feed. If you had to reconstruct that whale’s twitter feed on every read-request it would become unbearably expensive.
One of the best approaches to caching is to use consistent-hashing of your key-space to spread the cache across a reasonable number of cache hosts. In this strategy if you lose one of your hosts you will lose 1/N of your cached data, where N is the number of hosts in the cache fleet. A lot of modern cache clients support this mechanism of sharding your cache data with storage like memcache or Redis.
Every system is unique and has its own emergent behavior, quirks and patterns. Learn to identify the parts of your system that cause you the most pain and put fixes in place to prevent them from recurring in the future. Putting out fires is important when things are broken but if you don’t have visibility into your system you can’t put long-term fixes in place because you aren’t able to root-cause the problem.