Google provides a comprehensive guide to cloud reliability

24 January, 2023

Google recently presented its customers with a cloud infrastructure reliability guide combining best practices and the expertise of its engineers.

Reliable infrastructure is a critical requirement for cloud workloads. As a cloud architect, in order to design a reliable infrastructure for your workloads, you need to thoroughly understand the reliability capabilities of your chosen cloud provider.
The company also provides guidelines for evaluating reliability requirements for user-caused workloads and presents architectural recommendations for building and managing a reliable Google Cloud infrastructure.

Reliability Overview
An application or workload is reliable when it meets current availability and fault tolerance goals.

Availability (or uptime) is the percentage of time an application is usable. For example, for an application that has an availability target of 99.99%, the total downtime should not exceed 8.64 seconds over a 24-hour period.
Availability is sometimes measured as the proportion of requests that the application serves successfully during a given period. For example, for an application that has an availability goal of 99.99%, no more than ten requests can fail for every 100,000 requests received. Availability is often expressed as the number of nines in a percentage. For example, 99.99% availability is expressed as “4 nines”.

Depending on the purpose of the application, there may be different sets of metrics for how reliable the application is. Below are examples of such reliability metrics:

For applications that serve content, availability, latency, and throughput are important reliability metrics.
For databases and storage systems, the reliability metrics are latency, throughput, availability, and durability (how well the data is protected against loss or corruption).
For big data and analytic workloads, such as data processing pipelines, consistent pipeline performance (throughput and latency) is essential to ensure the freshness of data products and is an important reliability metric. It indicates how much data can be processed and how long it takes the conveyor to go from ingesting data to processing it.
For most applications, data correctness is an essential reliability metric.

Finally, Google provides more guidance with models and best practices for building scalable and resilient applications.