How to Find the Bottom of the Reliability U-Curve (Without Chasing Five Nines)

Most teams argue about “more nines.” The better question is: where is the bottom of your reliability U-curve — and are you left or right of it?

Let’s chase the nines!

As a short reminder: the U-curve represents the total cost of reliability:

  • Failure cost (reactive): incidents, war rooms, hotfixes, customer impact, escalation overhead

  • Prevention cost (proactive): SLOs, automation, resiliency patterns, testing, observability, compliance-by-design

(More details here: The Reliability-Cost Inversion Law: Why Reliability Gets Cheaper at Scale — Tech Acceleration & Resilience)

If you invest too much in failure remediation, you’re basically funding firefighting. If you invest too much in prevention, you may be chasing “more nines” with diminishing returns (or building process friction instead of reliability). The bottom of the U-curve is where prevention and failure are balanced — and total cost is minimized.

A lightweight model to place a service on the U-curve

To place one customer journey / service on the curve, I look at three dimensions:

  1. Capability maturity — are we structurally geared toward firefighting or prevention?

  2. Financial baseline — what do failure and prevention cost per month?

  3. Reality check via SLOs — is the service actually healthy or burning error budget?

The ambition is simple: 15 minutes max, so this becomes a habit — not a consulting project.

1) Capability maturity rating

We rate the five signals on both sides (1–5 each). This tells us whether the organization’s capacity is skewed toward failure or prevention.

  • FC (Failure Cost signals): how much firefighting drag is the org currently paying? Interpretation: higher FC = more firefighting tax (bad)

  • PC (Prevention signals): do we have the prevention mechanisms that reliably reduce failure cost? Interpretation: higher PC = stronger guardrails (good)

Failure Cost signals (reactive bias)

  • Incident blast radius / coordination overhead

  • Incident frequency & severity (customer impact reality)

  • Manual vs automated recovery (MTTR driver)

  • Change failure rate (rollback/hotfix driver)

  • Recurrence / learning loop (root cause removal)

Prevention Cost signals (proactive guardrails)

  • SLOs + error budgets used for decisions (control knob)

  • Actionable observability + alerting (signal > noise)

  • Standardized resilience patterns (reusable guardrails)

  • Resilience validation (pre-prod readiness & game days)

  • Scalable ops (toil reduction + automated controls/compliance + paved roads)

Interpretation rule:

  • Higher Failure score = higher firefighting tax

  • Higher Prevention score = stronger guardrails

2) Financials

At service level, you can get surprisingly far with numbers you already have.

  • RFC — Reactive Failure Cost (monthly) = incident hours (incl. war room tax) + customer impact + change fallout

  • PPC — Proactive Prevention Cost (monthly) = reliability engineering time + tooling + compliance/controls overhead

That last part matters in Financial Services: compliance effort is real prevention cost — and if it’s mostly manual, it becomes expensive friction.

3) Reality check via SLOs

Finally: how is the service actually doing?

  • Are we green / amber / red?

  • What does SLO compliance say?

  • Is the error budget burning fast?

RO = Reliability Outcome

(SLO compliance / error budget burn). This prevents the classic failure mode: “We invest a lot, therefore we must be reliable.” No, outcomes decide.

Example: Card Transaction Service (Financial Services)

Here’s a simplified example.

Capability maturity

  • Failure scoring: 16 / 25

  • Prevention scoring: 10 / 25

Interpretation: failure signals dominate; guardrails are weak → classic firefighting bias.

Financial baseline

  • RFC: ~$760k/month (war rooms + customer impact minutes + emergency fixes + change fallout)

  • PPC: ~$78k/month (SLO/observability + automation + control overhead)

Reality check

  • RO: SLO breached; error budget burns fast

Conclusion: you’re left of optimum → prevention is still dramatically cheaper than failure. In other words: you don’t need “five nines.” You need basic guardrails that compound.

////////////////////////////////////////////////////////////////////////////////////////////////////////

Want the worksheet?

If you want the worksheet I use to estimate U-curve position in 15 minutes, DM or comment “INVERSION” and I’ll share it.

→ We explore these ideas in much more depth in our book, Mastering Site Reliability Engineering in Enterprise, a complete guide to building resilient, chaos-tolerant systems, available on Amazon and Springer.

Mastering Site Reliability Engineering in Enterprise now on amazon.com

Mastering Site Reliability Engineering in Enterprise on Springer

Next
Next

The Reliability-Cost Inversion Law: Why Reliability Gets Cheaper at Scale