← All Articles

The Calculus of Service Availability

The Calculus of Service Availability by Ben Treynor, Mike Dahlin, Vivek Rau, and Betsy Beyer explores a simple but important truth in large systems: a service is only as available as the dependencies it relies on.

The article shows why reliability cannot be treated as a vague goal. It has to be understood in practical terms and built into the design of the system. When a service depends on databases, storage, caches, RPC services, or other internal platforms, each of those dependencies shapes what users actually experience. Strong availability does not come from making one service look reliable on its own. It comes from understanding the full chain behind it and designing around its limits.

One of the most useful ideas in the article is the "extra 9" rule. If a service aims for 99.99% availability, its critical dependencies usually need to be even more reliable. Otherwise, too much of the service’s allowed downtime is lost to dependency failures before its own problems are even counted. This makes reliability feel much more concrete: it is not just a target for one team, but a shared responsibility across the whole system.

The article is especially valuable because it connects availability to day-to-day operations. Reliability is shaped by how often outages happen, how wide their impact is, and how quickly recovery begins. In practice, that means careful rollouts, isolation between shards or regions, fast detection, quick rollback, graceful degradation, and designs that reduce the blast radius of failure. A strong system is not one that never has problems, but one that limits damage and recovers quickly.

Another important lesson is that not every dependency should remain critical. Whenever possible, a system should keep working in a reduced but acceptable way instead of failing completely. The article highlights approaches such as fallback behavior, asynchronous design, redundancy, and automated failover. These choices make systems more resilient because they reduce the chance that one broken part will bring down everything else.

The article also makes clear that better reliability comes with trade-offs. More resilience often means more planning, more testing, more careful engineering, and sometimes higher cost. Redundancy, geographic isolation, and extra capacity can improve availability, but they also add complexity. The strength of the article is that it presents this honestly and shows that reliability becomes far more achievable when it is designed in from the beginning instead of added later as a fix.

You can also listen to the audio summary of the article below.

Audio Overview of the Article