In today’s hyperconnected world, addressing the rapidly growing customer expectations is a daunting job yet highly crucial to stay competitive and relevant for companies. Adopting SRE fundamentals to align increasing customer expectations and organizational capabilities is a practical approach that helps achieve superior service reliability. Therefore, it’s strategically significant for businesses to plan and develop a robust SRE practice based on its fundamentals: SLAs, SLOs, and SLIs.
The acronyms – SLAs, SLOs, and SLIs, are the primary metrics of Site Reliability Engineering (SRE). Together these SRE metrics provide a framework to define, measure and manage the level of service provided to customers.
Let’s find below what these terms stand for and the relationships between these metrics in SRE.
What is SLA?
SLA or service level agreement is an agreement between the organization and its customers, outlining clearly the proposed product or service in terms of their functionality, uptime, responsiveness, reliability, and performance.
SLA is all about what an organization promises to the customer. Hence it also includes the remediation an organization takes, such as issuing money back or providing free credits if the SLOs don’t meet. Therefore, site reliability engineers focus on defining the SLOs included in SLAs. SLA helps in establishing transparency and trust between the organization and the customer.
What is SLO?
SLO or service level objective is an agreement within a defined SLA shared with clients. It specifies metrics such as uptime or response time of the system. You can consider SLA as a formal agreement between the service provider and its customer whereas SLO counts as functional outputs a vendor has to deliver to its customer. SLOs guide IT and DevOps teams to what goals they have to achieve and measure their strategies against.
SLO is a key threshold value that is designed for each SLI. SLO, based on SLI metrics, sets precise numerical reliability or performance targets. Hence, any changes in the product or service fall under these defined target values.
SLO is used to ensure the service is customer-centric and quantify the reliability of the product and services.
What is SLI?
SLI or service level indicators are metrics used to keep in check the health of a service. SLI considers business objectives and customer expectations and creates a playbook covering all aspects of the service that should be measured and a checklist of how to measure them.
SLI also measures compliance with an SLO. For example, if your SLA manifestation described that your systems should be available 99.95% of the time, your SLO is likely 99.95% uptime and your SLI is the actual measurement of your uptime. Maybe it’s 99.96%. Maybe 99.99%. To maintain compliance with the SLA, SLI has to meet or exceed the promises made in that document.
According to Google’s Site Reliability Workbook, SLI measures the following system characteristics:
- Error rates
- System throughput
SLAs vs. SLOs vs. SLIs: What’s The Difference?
SLAs, SLOs, and SLIs are related concepts still they are widely different from one another.
- SLA or Service Level Agreement stands as a contract where the service provider promises customers service availability, performance, etc.
- SLO or Service Level Objective defines realistic goals that the service provider strives to reach.
- SLI or Service Level Indicator is a serviceable metric the service provider tracks and uses to achieve defined goals in the agreement with the client.
Therefore an SLA includes reliability values that help customers understand the product’s capabilities. On the other hand, SLO clearly defines threshold values to achieve better reliability needed to respond according to customer queries, stakeholders’ expectations, and other details. SLI here helps to directly measure the system’s behavior in every stage of the business operations.
How SRE Fundamentals Help Improve Customer Experience
SRE metrics provide an insightful perspective to SRE teams. It helps the team to make decisions with data instead of fuzzy definitions and allows teams to measure whether the service is meeting its reliability targets.
1) Help Define Error Budget
Error budget in SRE helps maintain a balance between maximum change/new release velocity and ensuring system stability. It is said to achieve 100% reliability of services, but the best practice of SRE is to keep the SLO threshold below 100%. The cost and technical complexities of managing 100% reliability are immense. The fact is a user cannot distinguish between 100% reliability and 99.9% or even 99.0%.
Keeping your SLO less than 100% enables you to have an error budget that allows you to confidently manage risk and make decisions about when to release features without sacrificing user happiness. Ideally, one shouldn’t exhaust or come close to exhausting the error budget.
The best practice here is to stop further development work and focus on restoring stability when you feel you are close to your error budget. On the other side, if you have an error budget, you can continue innovating and adding new features while ensuring reliability with system stability.
2) Focus On Customer Happiness
SRE metrics allow you to keep business decisions focused on customer happiness. Though the teams find metrics intriguing like CPU and memory utilization, it helps them determine reliability issues and lets them understand and graph them, but it matters nothing to the end-users.
Hence the SLO helps identify the service attributes that matter most to the end-users. Defining SLO also ensures that SRE teams, developer teams, operations teams, and executives understand and measure reliability as it matters to the customer.
3) Help Setting Right Expectations
The purpose of SRE fundamental metrics is to measure customer happiness while protecting the organization from SLA breaches. These metrics help in setting a shared understanding of reliability across product, engineering, and business leadership.
SLO works as an internal playbook for the teams, guiding them to act toward set goals of making the customer happy with just the right level of reliability. It also prevents the team from making an effort to increase reliability when the customer is already satisfied. It helps teams move on to new features once the customer is satisfied with the reliability.
Using SLAs, SLOs, and SLIs helps clearly define expectations for system reliability for both your customers and SRE teams. An accurately documented SLAs and SLOs are derived from customer needs and appropriate SLIs that are used to verify that those needs are being met. Setting up an error budget with a well-defined, internal SLO further enables working around SREs to improve overall system performance by addressing reliability issues.