SRE SLO Tools

SRE

As an SRE, it is essential to have a comprehensive understanding of the tools required to effectively monitor and measure the reliability of critical user journeys. To aid in this endeavor, I have curated a list of available tools for managing and analyzing SLI/SLOs. Some of these tools not only assist in the measurement of reliability, but also in the management of the Error Budget and the implementation of clear protocols for when the budget is exceeded. Here are the tools I found so far. Do I miss anything?

+++ Update 05.011.2023: Datadog, Honeycomb

+++ Update 27.01.2023: Keptn

+++ Update 11.02.2023: Uptime, Flux Ninja

+++ Update 14.02.2023: Added a table and some more details to each tool.

+++ Update 11.03.2023: Added features and status to each tool and a diagram



Disclaimer: The shown information is not final; This is a working draft; Iā€™m still working out the details and all feedback is welcome. You have an opinion, and you think a tool should be described or placed differently? Write a comment! A different status or another wording would be helpful? Write a comment!





Blameless

A reliability engineering platform that brings together AI-driven incident resolution, blameless retrospectives, SLOs/Error Budgets, and reliability insights reports and dashboards.

https://www.blameless.io/

  • Cloud Only: No

  • Open Source: No

  • Features: Wide Range

  • Status: Examine & Test



Datadog SLOs

Track, manage, and monitor the status of all of their SLOs and error budgets.

https://www.datadoghq.com/blog/define-and-manage-slos/

  • Cloud Only: Yes

  • Open Source: No

  • Features: Focused

  • Status: Common Use



Flux Ninja

Reliability automation for cloud native apps.

https://www.fluxninja.com/

  • Cloud Only: Yes

  • Open Source: No

  • Features: Focused

  • Status: Examine & Test



Google Service Mesh SLO

The SLO tool in the Google service monitoring toolkit (GCP only).

https://cloud.google.com/service-mesh/docs/observability/slo-overview

  • Cloud Only: Yes

  • Open Source: No

  • Features: Focused

  • Status: Common Use



Harness SLO

Full service reliability suit with platform for modern software delivery.

https://www.harness.io/products/service-reliability-management

  • Cloud Only: No

  • Open Source: No

  • Features: Wide Range

  • Status: Common Use



Honeycomb SLOs

Observability solution incl. SLOs .

https://www.honeycomb.io/slo

  • Cloud Only: No

  • Open Source: No

  • Features: Wide Range

  • Status: Examine & Test



Keptn

Automated configuration of observability tools, creation of dashboards, and alerting based on Service-Level Objectives.

https://keptn.sh

  • Cloud Only: Yes

  • Open Source: Yes

  • Features: Wide Range

  • Status: Examine & Test



Last9

SRE platform to gain visibility and adopt SLOs.

https://last9.io

  • Cloud Only: Yes

  • Open Source: No

  • Features: Focused

  • Status: Interesting & Trending



Nobl9

SLOs from existing monitoring incl. a service health dashboard.

https://www.nobl9.com/

  • Cloud Only: No

  • Open Source: No

  • Features: Focused

  • Status: Examine & Test



OpenSLO

SLO language that declaratively defines reliability and performance targets using a simple YAML specification.

https://openslo.com/

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



Prometheus + Grafana (custom)

Open Source monitoring solutions where you can custom build your SLO dashboards.

https://prometheus.io/ and https://grafana.com/

  • Cloud Only: Yes

  • Open Source: Yes

  • Features: Wide Range

  • Status: Common Use



Rely.io

SLO Dashboard incl. alerting from SLOs and Error Budget .

https://www.rely.io/features/slos-and-error-budgets

  • Cloud Only: No

  • Open Source: No

  • Features: Focused

  • Status: Examine & Test



RunWhen

Visualization of the reliability dependencies between teams and technologies. It also connects SLOs to automatically generated Runbooks.

https://www.runwhen.com

  • Cloud Only: No

  • Open Source: No

  • Features: Focused

  • Status: Interesting & Trending



SLO Computer

Setting and monitoring SLOs for all services seamless and fast.

https://github.com/last9/slo-computer

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



SLOGen

Tool to create and manage content for reliability tracking from logs/event data.

https://github.com/OpenSLO/slogen

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



SLO Exporter

Slo-exporter computes standardized SLI and SLO metrics based on events coming from various data sources.

https://github.com/seznam/slo-exporter

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



Sloth.dev

Simple Prometheus SLO generator.

https://sloth.dev

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



SLO Tracker

Track SLO and burn rate, an open-source tool designed to make Error Budget and SLO tracking simpler.

https://slotracker.com

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Interesting & Trending



Uptime

Simple SLA calculator.

https://uptime.is

  • Cloud Only: No

  • Open Source: Yes

  • Features: Focused

  • Status: Examine & Test



Key for the table:

  • Features focused: The scope of the application is focused on some specific features.

  • Features wide range: The application has a broad scope and features in multiple directions.

  • Status Interesting & Trending: A tool that is interesting and trending typically has unique or innovative features that make it stand out from other tools, as well as being effective and efficient at accomplishing its intended purpose. It gains popularity or attention from a particular group of people due to factors such as affordability, ease of use, or unique features. Its ability to solve a particular problem in a unique way or provide innovative and effective features captures the attention of a significant number of people, leading to increased adoption and popularity.

  • Status Examine & Test : It is worth taking a closer look to explore the applicability with your team. Put it through various scenarios to evaluate its performance under unique conditions like regulatory requirements, security, speed, capacity, scalability, and compatibility. Trial it out. Ensure that it meets the needs and performs effectively and efficiently.

  • Status Common Use: The tool is widely used and accepted by many people, organizations, or industries. A common use tool is typically well-established, and its benefits and effectiveness are widely recognized. Common use tools often have a proven track record of success, and their reliability and effectiveness have been demonstrated over time. They are often supported by a large user community, which can provide valuable feedback and support to users. Additionally, common use tools may have a wide range of available resources, such as documentation, tutorials, and support forums, which make them more accessible to users.

    Common use tools can also have industry-wide standards and regulations associated with them, which may be used to evaluate their performance and ensure their compatibility with other systems and tools. Overall, a tool being common use indicates that it has a high level of acceptance and usage within a particular context or industry, and that it is likely to be a reliable and effective solution for the intended purpose.

Previous
Previous

Cognitive Overload

Next
Next

Dispatch #03/23