SRE SLO Tools
As an SRE, it is essential to have a comprehensive understanding of the tools required to effectively monitor and measure the reliability of critical user journeys. To aid in this endeavor, I have curated a list of available tools for managing and analyzing SLI/SLOs. Some of these tools not only assist in the measurement of reliability, but also in the management of the Error Budget and the implementation of clear protocols for when the budget is exceeded. Here are the tools I found so far. Do I miss anything?
+++ Update 05.011.2023: Datadog, Honeycomb
+++ Update 27.01.2023: Keptn
+++ Update 11.02.2023: Uptime, Flux Ninja
+++ Update 14.02.2023: Added a table and some more details to each tool.
+++ Update 11.03.2023: Added features and status to each tool and a diagram
Disclaimer: The shown information is not final; This is a working draft; Iām still working out the details and all feedback is welcome. You have an opinion, and you think a tool should be described or placed differently? Write a comment! A different status or another wording would be helpful? Write a comment!
Blameless
A reliability engineering platform that brings together AI-driven incident resolution, blameless retrospectives, SLOs/Error Budgets, and reliability insights reports and dashboards.
Cloud Only: No
Open Source: No
Features: Wide Range
Status: Examine & Test
Datadog SLOs
Track, manage, and monitor the status of all of their SLOs and error budgets.
https://www.datadoghq.com/blog/define-and-manage-slos/
Cloud Only: Yes
Open Source: No
Features: Focused
Status: Common Use
Flux Ninja
Reliability automation for cloud native apps.
Cloud Only: Yes
Open Source: No
Features: Focused
Status: Examine & Test
Google Service Mesh SLO
The SLO tool in the Google service monitoring toolkit (GCP only).
https://cloud.google.com/service-mesh/docs/observability/slo-overview
Cloud Only: Yes
Open Source: No
Features: Focused
Status: Common Use
Harness SLO
Full service reliability suit with platform for modern software delivery.
https://www.harness.io/products/service-reliability-management
Cloud Only: No
Open Source: No
Features: Wide Range
Status: Common Use
Honeycomb SLOs
Observability solution incl. SLOs .
Cloud Only: No
Open Source: No
Features: Wide Range
Status: Examine & Test
Keptn
Automated configuration of observability tools, creation of dashboards, and alerting based on Service-Level Objectives.
Cloud Only: Yes
Open Source: Yes
Features: Wide Range
Status: Examine & Test
Last9
SRE platform to gain visibility and adopt SLOs.
Cloud Only: Yes
Open Source: No
Features: Focused
Status: Interesting & Trending
Nobl9
SLOs from existing monitoring incl. a service health dashboard.
Cloud Only: No
Open Source: No
Features: Focused
Status: Examine & Test
OpenSLO
SLO language that declaratively defines reliability and performance targets using a simple YAML specification.
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
Prometheus + Grafana (custom)
Open Source monitoring solutions where you can custom build your SLO dashboards.
https://prometheus.io/ and https://grafana.com/
Cloud Only: Yes
Open Source: Yes
Features: Wide Range
Status: Common Use
Rely.io
SLO Dashboard incl. alerting from SLOs and Error Budget .
https://www.rely.io/features/slos-and-error-budgets
Cloud Only: No
Open Source: No
Features: Focused
Status: Examine & Test
RunWhen
Visualization of the reliability dependencies between teams and technologies. It also connects SLOs to automatically generated Runbooks.
Cloud Only: No
Open Source: No
Features: Focused
Status: Interesting & Trending
SLO Computer
Setting and monitoring SLOs for all services seamless and fast.
https://github.com/last9/slo-computer
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
SLOGen
Tool to create and manage content for reliability tracking from logs/event data.
https://github.com/OpenSLO/slogen
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
SLO Exporter
Slo-exporter computes standardized SLI and SLO metrics based on events coming from various data sources.
https://github.com/seznam/slo-exporter
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
Sloth.dev
Simple Prometheus SLO generator.
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
SLO Tracker
Track SLO and burn rate, an open-source tool designed to make Error Budget and SLO tracking simpler.
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Interesting & Trending
Uptime
Simple SLA calculator.
Cloud Only: No
Open Source: Yes
Features: Focused
Status: Examine & Test
Key for the table:
Features focused: The scope of the application is focused on some specific features.
Features wide range: The application has a broad scope and features in multiple directions.
Status Interesting & Trending: A tool that is interesting and trending typically has unique or innovative features that make it stand out from other tools, as well as being effective and efficient at accomplishing its intended purpose. It gains popularity or attention from a particular group of people due to factors such as affordability, ease of use, or unique features. Its ability to solve a particular problem in a unique way or provide innovative and effective features captures the attention of a significant number of people, leading to increased adoption and popularity.
Status Examine & Test : It is worth taking a closer look to explore the applicability with your team. Put it through various scenarios to evaluate its performance under unique conditions like regulatory requirements, security, speed, capacity, scalability, and compatibility. Trial it out. Ensure that it meets the needs and performs effectively and efficiently.
Status Common Use: The tool is widely used and accepted by many people, organizations, or industries. A common use tool is typically well-established, and its benefits and effectiveness are widely recognized. Common use tools often have a proven track record of success, and their reliability and effectiveness have been demonstrated over time. They are often supported by a large user community, which can provide valuable feedback and support to users. Additionally, common use tools may have a wide range of available resources, such as documentation, tutorials, and support forums, which make them more accessible to users.
Common use tools can also have industry-wide standards and regulations associated with them, which may be used to evaluate their performance and ensure their compatibility with other systems and tools. Overall, a tool being common use indicates that it has a high level of acceptance and usage within a particular context or industry, and that it is likely to be a reliable and effective solution for the intended purpose.