SLO Calculator + Burn-Rate Alerts

Compute SLO error budgets and emit Prometheus burn-rate alert rules following the Google SRE multi-window recipe. The complete toolkit, not just a calculator. Runs entirely in your browser.

SLO definition

cascades into everything below

AvailabilityLatencyEvent rate

Target (%)

Window (days)

Good-events metric (PromQL)

Total-events metric (PromQL)These metrics drive the PromQL alert rules below. They should match what your service exports to Prometheus.

Alert name prefix

Expected events per window

Error budget

over the 30-day window

Allowed in window

43m 12s

downtime

1,000

bad events

0.100%

of all requests

Nines reference

your tier highlighted

Availability	Per day	Per 7 days	Per 30 days	Per year
99%(two nines)	14m 24s	1h 40m 48s	7h 12m 0s	3d 15h 36m 0s
99.5%	7m 12s	50m 24s	3h 36m 0s	1d 19h 48m 0s
99.9%(three nines)	1m 26.4s	10m 4.8s	43m 12s	8h 45m 36s
99.95%	43.2s	5m 2.4s	21m 36s	4h 22m 48s
99.99%(four nines)	8.64s	1m 0.5s	4m 19.2s	52m 33.6s
99.999%(five nines)	0.86s	6.05s	25.92s	5m 15.4s

Burn-rate alerts

multi-window / multi-burn-rate

Severity	Triggers when…	Budget at trigger
● Critical (page)	Error rate exceeds 14.4× SLO target sustained for 1h (and currently in last 5m)	2.0% in 1h
● Critical (page)	Error rate exceeds 6× SLO target sustained for 6h (and currently in last 30m)	5.0% in 6h
◯ Warning (ticket)	Error rate exceeds 2× SLO target sustained for 24h (and currently in last 2h)	6.7% in 24h
◯ Warning (ticket)	Error rate exceeds 0.5× SLO target sustained for 3d (and currently in last 6h)	5.0% in 3d

checkout-slo-burn-alerts.yaml

groups:
  - name: checkout-slo-burn-alerts
    rules:
      - alert: CheckoutSLOBurnRate-Critical-1h
        expr: |-
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[1h]))
               / sum(rate(http_requests_total{service="checkout"}[1h])))
            > (14.4 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[5m]))
               / sum(rate(http_requests_total{service="checkout"}[5m])))
            > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "CheckoutSLO burning 14.4× over 1h (confirming over 5m)"
          description: "2% of the 30-day budget (0.1% of requests) burned in 1h. Page on-call."
      - alert: CheckoutSLOBurnRate-Critical-6h
        expr: |-
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[6h]))
               / sum(rate(http_requests_total{service="checkout"}[6h])))
            > (6 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[30m]))
               / sum(rate(http_requests_total{service="checkout"}[30m])))
            > (6 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "CheckoutSLO burning 6× over 6h (confirming over 30m)"
          description: "5% of the 30-day budget (0.1% of requests) burned in 6h. Page on-call."
      - alert: CheckoutSLOBurnRate-Warning-24h
        expr: |-
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[24h]))
               / sum(rate(http_requests_total{service="checkout"}[24h])))
            > (2 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[2h]))
               / sum(rate(http_requests_total{service="checkout"}[2h])))
            > (2 * 0.001)
          )
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CheckoutSLO burning 2× over 24h (confirming over 2h)"
          description: "6.67% of the 30-day budget (0.1% of requests) burned in 24h. File a ticket and investigate."
      - alert: CheckoutSLOBurnRate-Warning-3d
        expr: |-
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[3d]))
               / sum(rate(http_requests_total{service="checkout"}[3d])))
            > (0.5 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{code!~"5..",service="checkout"}[6h]))
               / sum(rate(http_requests_total{service="checkout"}[6h])))
            > (0.5 * 0.001)
          )
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CheckoutSLO burning 0.5× over 3d (confirming over 6h)"
          description: "5% of the 30-day budget (0.1% of requests) burned in 3d. File a ticket and investigate."

Composite SLO calculator

for service trees with multiple upstream dependencies

Sequential (every step succeeds)Parallel (any one succeeds)

Upstream 1 (%)

Upstream 2 (%)

Effective SLO: 99.85005% — equivalent to 1h 4m 46.7s over a 30-day window.

What's an SLO and why should you have one?

The most-confused triad in SRE is also the most foundational. An SLA (Service Level Agreement) is a contract with a customer — when you miss it, you owe them something concrete: a refund, a credit, a public post-mortem. An SLO (Service Level Objective) is the internal target your team commits to — when you miss it, you change priorities, freeze risky work, redirect engineering to reliability. An SLI (Service Level Indicator) is the actual measurement: request success rate, latency percentile, queue depth.

SLOs matter because they create a shared language for "is this service healthy enough?" that survives team turnover and product pressure. Instead of arguing about whether 27 alerts last week were too many, the team can ask: did we burn the error budget? If yes, slow down. If no, keep shipping.

The inverse of the SLO is the error budget: the amount of failure you're allowed before action. A 99.9% SLO over 30 days is a 0.1% budget — about 43 minutes of downtime. That budget is what you spend on deploys, experiments, planned maintenance, and the inevitable surprises. When it runs out, the team's options change.

Key insight

The SLO is what you commit to internally. The SLA is what you commit to externally. Keep the SLO tighter than the SLA — the gap is your safety margin and the room you have to fix incidents before customers notice.

How burn-rate alerting works

The naive approach to alerting on SLOs is "page on every error." That generates noise, on-call burnout, and ignored alerts the day you actually need them. The error-budget approach inverts the question: don't ask whether any errors happened; ask whether they're happening fast enough to matter.

Pick a budget consumption rate that should fire a page — say, 2% of monthly budget in 1 hour. The error rate corresponding to that consumption is your alert threshold. If the running 1-hour error rate exceeds the threshold, you're burning fast and someone needs to know now. If it doesn't, the system is fine even if some users are seeing errors.

The trick that makes burn-rate alerts quiet as well as responsive is the multi-window check. Each alert is a conjunction of two conditions — a long window (1 hour) confirms "is this real?" and a short window (5 minutes) confirms "is it still happening?" Without the short-window check, you keep paging during recovery. Without the long-window check, you page on every twitch. Both together: you page on sustained problems that are still active, and the alert silences itself when the situation recovers.

Concrete walkthrough for the canonical 14.4× / 1h / 5m alert at 99.9% SLO:

Error rate threshold: 14.4 × 0.001 = 0.0144 (i.e., a 1.44% error rate)
If the 1-hour error rate > 1.44% and the 5-minute error rate > 1.44%, fire the alert
14.4× × 1h / 720h = 2% of the 30-day budget consumed during this hour
The reason 14.4 isn't a round number: it's the multiplier that produces "2% in 1h" exactly

Tuning burn-rate thresholds

The four rules in the canonical recipe — fast burn at 1h, sustained fast burn at 6h, slow burn at 24h, very slow burn at 3d — cover the realistic ways an SLO gets violated. Every team starts there. The variants below are calibrations, not replacements.

Conservative is the right starting point for most teams. Pages on real problems, files tickets on slow burns. The 1× / 3d ticket alert is intentional: even a tiny sustained drift will eventually consume the budget, and 3 days is the right horizon to do something about it.
Recommended tightens the warning thresholds (2× over 24h, 0.5× over 3d) for teams getting too much warning-tier noise. The reasoning: at 30-day windows, the canonical 3× ticket alert often fires after the team has already noticed via dashboards or customer reports; tighter thresholds page later but more reliably.
Aggressive halves the long windows on fast burns (30m, 2h) for high-criticality services running 99.99%+, where 1h to detect is too slow. Trade-off: more sensitivity to short bursts.
Custom is for batch jobs and other non-request-rate workloads where the canonical windows don't fit naturally.

The thing not to do: tune thresholds to silence pages. Threshold tuning is for calibration, not avoidance. If you're paging too often on a service, the answer is to fix the underlying reliability or rethink the SLO target — not to widen the alert windows until the same incident no longer trips them.

Composite SLOs and dependency math

Real services depend on other services. When yours depends on N upstream APIs and any one of them failing means your request fails, the effective SLO is the product of the upstreams: 0.999 × 0.9995 = 0.9985005, or 99.85%. Five sequential 99.9% dependencies cap your achievable SLO at 99.5%. This is why architects sketching out a service tree have to think carefully about dependency depth.

Parallel composition is the inverse: when redundant replicas mean any one succeeding is enough, the effective unavailability is the product of unavailabilities, and the SLO improves super-linearly. Two 99% replicas in parallel give you 99.99%; three give you 99.9999%.

Practical implication: if your service depends on 4 external APIs at 99.9% each, you can't realistically claim 99.9% — the math caps you at 99.6%. Either absorb the risk into your published SLO (claim 99.5%), invest in redundancy (multiple providers, fallback paths), or pick the dependencies more carefully.

SLO anti-patterns

The 100% target

Impossible — and even attempting it wastes effort. The cost of going from 99.9% to 99.99% is roughly 10× the cost of the original 99.9%; from 99.99% to 99.999% is another 10×. Pick the SLO that customers actually need, not the highest number you can imagine. A 99.9% SLO covers most internal tools; 99.95% is the sweet spot for consumer-facing reads; 99.99% is for transactional systems where downtime is genuinely expensive.

Alerting on every error

Without burn-rate awareness, every failed request becomes a page. Teams burn out within a quarter, alerts get ignored, the next real incident hits an exhausted on-call rotation. The point of an SLO is that you've already decided what level of failure is tolerable — the alerting should match.

Window too long

90-day windows hide acute incidents inside budget noise. A two-hour outage is barely visible against a 90-day budget; the same outage shows up clearly against 7 or 30 days. 30 days is the sweet spot for most services. 7 days fits very high-traffic services where individual incidents are consequential.

Confusing SLO with SLA

SLA is for customers; SLO is for engineering. Don't share SLO numbers externally — share SLA numbers, which should be more conservative. The point of the gap is that you'd rather miss your internal SLO and learn about it than miss your SLA and owe someone money.

No error budget policy

What happens when the budget is burned? "Freeze deploys for the rest of the period." "No new feature work until the next window." "Incident review required for any new change." The policy is what makes the SLO actionable — without it, the SLO is just a report nobody reads.

FAQ

What's the difference between SLA, SLO, and SLI?

SLA is the contract you sign with a customer — failure has business consequences (refund, credit). SLO is the internal target your engineering team commits to — failure means the team changes priorities. SLI is the underlying measurement (request success rate, latency percentile). SLAs should be more conservative than your SLOs; the gap is your safety margin.

Should I aim for 100% availability?

No. 100% is impossible (every dependency adds risk you can't control), and the cost of going from 99.9% to 99.99% is roughly 10× the cost of the original 99.9%. Pick the SLO your customers actually need, not the highest number you can imagine. A 99.9% SLO is appropriate for most internal services; consumer-facing tiers often need 99.95% or 99.99%.

What window should I use — 7 days, 28 days, 30 days?

30 days is the sweet spot for most services — long enough to absorb noise, short enough to act on. 7-day windows fit very high-traffic services where a single hour-long incident is consequential. Multiples of 7 (28, 35) align cleanly with weekly release cadences. Avoid 90+ day windows: they hide acute incidents in budget noise.

What does 'burn rate' actually measure?

Burn rate is the multiple by which the current error rate exceeds the SLO target. A 14.4× burn rate means errors are happening 14.4× faster than the SLO allows — at that rate, you'd consume the entire monthly budget in 30/14.4 ≈ 2 days. The Google SRE workbook picks 14.4 specifically because it consumes 2% of a 30-day budget in 1 hour.

Why do I need both a long-window AND short-window check in burn-rate alerts?

The long window (e.g. 1h) confirms the problem is sustained — not a 30-second blip in error rates. The short window (e.g. 5m) confirms the problem is still happening — not a recovered incident. If you alert on the long window alone, you keep paging during recovery; if you alert on the short window alone, you page on every twitch. The conjunction is what makes burn-rate alerts both responsive and quiet.

How do I pick burn-rate thresholds for my service?

Start with the Recommended profile — it's the canonical 14.4×/6× fast burns plus tightened slow-burn ticket thresholds (2× over 24h, 0.5× over 3d) that reduce noise without missing real problems. Move to Aggressive for high-criticality services (99.99%+) where 1h to detect a fast burn is too slow. Use Custom only when none of the above fits your traffic pattern.

What's an error budget policy and why do I need one?

An error budget policy is the written rule for what happens when you burn the budget: freeze deploys for the rest of the period; require post-incident review for any new change; pull engineering off feature work onto reliability work. Without a policy, the SLO is just a report. With one, it's the lever that keeps the team honest about reliability vs. velocity.