Service-level objectives at Remote

How we turned the daily stream of alerts into an actionable signal for team decisions, with a framework built for company-wide adoption.

28 June 2026 – Goulven CLEC'H

  1. What and why
  2. Remote’s framework
    1. Reusable module
    2. Multi-dimensional SLIs
    3. Multi-window alerting
    4. Pushing adoption
  3. Lessons learned

What and why

How reliable is your system? Reliable for whom, and measured how? When something breaks, does your monitoring catch it first, or your customers? And when prioritising shipping features against stabilising what already exists, what tells you the right call?

These aren’t abstract questions. They are the alerts channel so noisy everyone has muted it, or the endpoint that grows a little slower every month. They are the trust that erodes quietly, long before it surfaces as churn, and the goal that sounds right but can’t be acted on, like « keep the system fast ».

Service-level objectives (SLO) give teams a shared, measurable definition of « good enough ». You start from a concrete action a user is trying to accomplish, a critical user journey (CUJ) such as « a workflow runs to completion » or « an employee views their payslip », whose failure has a measurable business cost, from a support ticket to a lost customer. You measure it with a service-level indicator (SLI), the ratio of good events to total, which tells you whether the journey succeeded from the user’s perspective. The SLO is the threshold on that indicator, written as « x% of events meet the SLI over a rolling window », for instance « 99.5% of workflow runs succeed over 30 days ».

That target mechanically turns the stream of errors into an error budget. If the goal is 99.5%, then 0.5% of events are allowed to fail, and that remaining margin is a budget you are free to spend. While budget remains, you focus on shipping; as it runs low, you slow down and shift toward reliability work; once it is spent, stabilising the system becomes the priority. This can (and should) be formalised into a shared policy, negotiated between the developers, the PM, and the stakeholders.

The language of service levels predates software, and has framed telecoms and outsourcing contracts since the 80s. But Google’s Site Reliability Engineering (SRE) practice added the discipline around it, by recasting SLOs as user-facing targets governed by an error budget, and laying out the SLI / SLO vocabulary. Their book remains a reference today (very Google-centric tho), free to read online or as an audiobook on your favourite platform.

Site Reliability Engineering

  • By Betsy Beyer & al
  • O'Reilly Media, Inc. (2016)
  • ISBN: 9781491929124

From there the idea spread well past Google, carried by a now-mature ecosystem of tooling and open specifications, and deepened by Alex Hidalgo’s famous book. While some chapters were very technical and maths-heavy, forcing me to lean on Claude or skip them, this book remains the main source of my own grounding. If you prefer a more concrete and free-to-read take, I also leaned heavily on Nobl9’s SLODLC, which Hidalgo contributed to.

Implementing Service Level Objectives

  • By Alex Hidalgo
  • O'Reilly Media, Incorporated (2020)
  • ISBN: 9781492076810

Remote’s framework

In January 2026, after five years at a small startup, I joined Remote as a Senior Backend Engineer. Remote is building the modern infrastructure to hire, manage, and pay anyone, anywhere in the world, and we’re hiring full-time remote Elixir developers to achieve this goal.

There I met Dünya Kirkali, who would be my team lead for the first half of the year, tasked with taking over Remote’s automation and approvals engine and bringing it to the next level. On his blog, he published « Your system is fine. Your users aren’t », arguing for business-level SLOs anchored in the user outcome, rather than purely technical observability metrics. After discussing it in a team meeting, I took on the task, under his supervision, of setting up this tooling for our domain.

We quickly got in touch with Jan Niederhumer, Director of Platform Engineering, and Marco Micera, Senior Site Reliability Engineer, who already had a plan to refactor Remote’s very uneven existing SLOs into a coherent solution adopted company-wide. Jan would oversee the framework’s implementation, shaping its philosophy especially in the early days of the project. Marco would be my main point of contact: reviewing my work, helping me settle technical questions, writing the human-facing documentation, and driving adoption across the other teams.

The framework rests on Honeycomb, the observability platform our telemetry lands in, and Terraform, which manages our infrastructure.1 The goal was to build a Terraform module2 to help teams get started, without reinventing any of the wiring, and without SRE expertise.1 – Terraform is an infrastructure-as-code tool where, rather than managing your resources through a graphical interface, you declare the resources you want in configuration files that are versioned, reviewed, and reproducible, then Terraform creates and updates the resources to match.2 – A Terraform module is one such configuration packaged as a reusable, parameterised unit, instantiated with different inputs per team.

The module exposes two inputs: a team name and a map of slo_definitions. What a team must decide, and what it can leave to a default, both live in the typed shape of a single definition:

# the shape of one SLO definition (trimmed)
slo_definitions = map(object({
name = string
dataset = string
sli_total_expression = string # which events count
sli_good_expression = string # which of those are good
target_percentage = number # e.g. 99.5
time_period = optional(number, 30) # rolling window, days
slack_channel = optional(string, null)
use_default_burn_alerts = optional(bool, true)
burn_alerts = optional(map(object({ … })), {})
}))

The optional(…) fields are the shared opinion: a 30-day window, alerts on by default, and the like. They’re meant to stay as-is for most teams, yet remain overridable for specific needs (more on that in the next section).

Instantiation is per team, through Terragrunt, a thin wrapper around Terraform that removes the duplication across teams and environments:

terraform {
source = "…/infra-modules.git//stacks/honeycomb_slos?ref=v1.315.1"
}
include "root" { path = find_in_parent_folders("root.hcl") }
include "provider" { path = "…/generators/configuration_provider.hcl" }
inputs = {
team = "AI Team"
slo_definitions = {
mcp_availability = {
name = "MCP Availability"
dataset = "…"
sli_total_expression = "…" # see next section
sli_good_expression = "…"
target_percentage = 99.5
time_period = 30
slack_channel = "#your-team-alerts"
}
}
}

The two includes pull in the shared state backend and the Honeycomb provider; everything below is the team’s own definitions. Ideally, that file is the entire surface a team touches.

Early on, Jan pushed for multi-dimensional SLIs, an approach Danyel Fisher set out in his « Working Toward SLOs » series,3 and that Honeycomb have pushed since then.4 Rather than track error rate, availability, and latency as three separate indicators,5 you fold every condition that defines a good outcome into one criterion scoped to a business behaviour, an SLI both easier to declare and closer to what users actually experience.

3 – The original Honeycomb URL returns a 404, but you can find these articles on Medium or the Web Archive. 4 – The Honeycomb team generalised the event-based SLI and ties it to error-budget burn alerting ☞ Majors C & al (2022). Observability Engineering, ch. 11 « Using Service Level Objectives for Reliability ». O’Reilly Source 5 – The canonical Google model keeps availability, latency, and quality as distinct SLIs, each with its own SLO(s) ☞ Thurgood S & al (2018). The Site Reliability Workbook, ch. 2 « Implementing SLOs ». O’Reilly Source

In Honeycomb that criterion is a derived column evaluated over each span, returning good or bad; our module wraps the team’s two expressions into one as IF(sli_total_expression, sli_good_expression, null), where null events fall outside the budget entirely. The good expression is where the dimensions meet, for example this file_download clear span:

# a good event = succeeded AND fast
sli_good_expression = <<-EOT
AND(
LT(COALESCE($http.response.status_code, $http.status_code), 500),
LT($duration_ms, 5000)
)
EOT

And for endpoints where latency is harder to judge, or are still being explored, a good event can just as well turn on refined success conditions alone. Our first SLO in production « Workflow Run Executed Correctly », counts a run good only if it reaches a terminal state without a system error:

# a good event = terminal state reached, no system error
sli_good_expression = <<-EOT
IF(
EXISTS($workflow_engine.workflow_run_executed.has_system_error),
NOT(EQUALS($workflow_engine.workflow_run_executed.has_system_error, true)),
NOT(EQUALS($workflow_engine.workflow_run_executed.has_error, true))
)
EOT

It’s worth saying that folding everything into one boolean multi-dimensional SLI doesn’t destroy the diagnostic information. When you come to investigate, the SLI heatmap visually separates the causes (slow events versus error status codes) and BubbleUp contrasts successful and failed events across every dimension of the dataset. In the same way, an event-based SLI normalises by volume — 50% of four events failing triggers nothing — sparing you manual thresholds and false alarms, yet those quiet failures can still surface at investigation.

Once a team has declared its SLOs, the module needs to transform each into an actionable signal, a burn alert, watching how fast you spend the budget.

The unit is the burn rate: a multiple of the budget’s sustainable pace, where a rate of 1 lasts exactly its window, while a 30-day budget is gone in just over two days at 14.4. The catch is the window you measure it over, too long and it reacts to an error’s spike too late, too short and it misses a slow edge case erosion…

That’s why the Site Reliability Workbook recommends a multiwindow multi-burn-rate approach6 that we enforce in our defaults:7 a fast 14.4× over one hour, a medium 6× over six hours, and a slow 1× over three days.86 – Popularised in public by Jamie Wilkinson’s talk « SLO Burn—Reducing Alert Fatigue and Maintenance Cost in Systems of Any Size » ☞ Wilkinson J (2018). LISA18, USENIX Source7 – Based on the SRE@Google formulation, with its table of recommended burn rates and windows ☞ Thurgood S & al (2018). The Site Reliability Workbook, ch. 5 « Alerting on SLOs ». O’Reilly Source8 – The slower two are emitted only when the time_period is long enough to hold them, roughly two days and up for 6× and four for 1×, below which their thresholds or windows would overflow the budget.

But none of them reaches Honeycomb as a multiple. The budget_rate alert type triggers on the share of the budget that may burn within its window, and that share depends on the SLO’s time_period. A shorter window holds a smaller budget, so the same 14.4× burns 8.57% of a 7-day budget but only 2% of a 30-day one.9 Our module runs that conversion for every SLO, so a team can just declare a target and a window and let the thresholds follow.9 – decrease_percent = burn_rate × window_minutes × 100 / (time_period × 1440), time_period in days and 1440 the minutes in a day. For 14.4× over 1h it reduces to 60 / time_period: 2% at 30 days, 8.57% at 7.

Defaults also provide budget milestones: three budget_rate alerts as the budget drains (at 50, 75, and 90% spent) then an exhaustion_time alert once it is gone. These hooks are meant to help teams set up their first error budget policy, e.g. investigate at 50%, slow feature work at 75%, freeze deployments at 90%, and write a post-mortem at exhaustion.

So that teams can meet their specific needs, the default alerts are always overridable. The mechanism fits in one line, merge(use_default_burn_alerts ? defaults : {}, burn_alerts), where merge works key by key at the object level and, on a collision, burn_alerts wins and replaces the whole object (no field-by-field merge).

In practice, a team can keep the defaults (use_default_burn_alerts = true, see example below) and pass a burn_alerts map to replace one or more specific alerts (same key → the whole object is replaced) or to add new ones. Otherwise, use_default_burn_alerts = false discards all the defaults and lets the team supply its own complete set.

slo_definitions = {
my_slo = {
# …
slack_channel = "#your-team-alerts" # required as soon as there are alerts
use_default_burn_alerts = true # keep the defaults
burn_alerts = {
# REPLACE the default "burn_rate_14_4x_1h": same key → whole object overwritten,
# so re-supply ALL the alert_type's required fields (no partial merge).
burn_rate_14_4x_1h = {
alert_type = "budget_rate"
budget_rate_window_minutes = 60
budget_rate_decrease_percent = 5 # custom threshold instead of the computed default
description = "Fast-burn 1h (adjusted threshold)"
}
# ADD an alert: new key → adds to the defaults.
budget_25pct = {
alert_type = "budget_rate"
budget_rate_window_minutes = 10800
budget_rate_decrease_percent = 25
description = "25% of the budget spent"
}
}
}
}

To ease adoption, the SLO framework’s documentation is structured as an enablement guide: a short introduction to the why and what, then an actionable seven-step path (product understanding → CUJSLIs → SLOs → alerts/ownership → error budget policy → continuous iteration) laid out in two columns (theory on the left, concrete per-team examples on the right). Written largely by Marco, this documentation takes a little effort to keep its concrete examples up to date, but gives the autonomy to adopt the SLO framework by yourself.

To help teams get started, I wrote an AI skill define-behavior-slos that takes a simplified version of the seven-step path, and quickly drafts the Terragrunt module and first definitions, in particular by calibrating the initial targets against the historical values observed through Honeycomb’s MCP. We’ve been pleasantly surprised by the quality of the results this skill produces so far, which I largely attribute to its concision: it guides the agent through the workflow, redirecting to the human-facing documentation, or the relevant code modules, rather than duplicating information (which can go stale) and clogging the context.1010 – I cover the importance of a lean context at length in my article on AI agents.

Even so, we’re aware that teams don’t necessarily have the bandwidth to take this work on themselves, and that picking up a new framework in an unfamiliar syntax can be off-putting. That’s why the SRE team has set up a concierge system, run by Marco and Jan, that does ~80% of the initial work (mapping CUJs, drafting the « floor » SLI/SLO, opening the MR), which the team can then correct with its domain knowledge and improve through continuous iteration.

Today this adoption is mostly push-driven, with the SRE team actively reaching out to priority teams.11 But a few teams come of their own to ask for reviews through a dedicated Slack channel, where the framework’s contributors also discuss its development.11 – The SREs keep a backlog ranking teams on five criteria: customer harm, blast radius, downstream dependency, volume / contractual exposure, and readiness. A « deferred » list sets aside teams without clear CUJs, such as those owning internal tooling or infrastructure.

Once in place, Slack alerts are the main surface of interaction with the teams, so they need close attention, but automatic Linear ticket creation and paging (for critical services) are in development. Marco also set up an automatically generated per-team Honeycomb dashboard, a grid overview of our SLOs’ health, and is now building a board to track adoption across the company.

Finally, Honeycomb also offers some handy tools, such as an « Investigate » button right on the notification (which opens their Canvas AI) or, even better, their MCP server which works really well with Claude Code.

Lessons learned

Our SLO framework has been officially available since 30 April and, while company-wide adoption still has a long way to go, it has been taken up across the domains of at least six teams, already yielding interesting lessons and results.

The Workflow Engine (our automation tool) was the pilot team for Remote’s internal SLO framework, with a very broad SLO, « Workflow Run Executed Correctly », set at 99.5%. As soon as it went live, the SLO brought to light an architectural problem in our error handling, which didn’t properly surface whether an error was system-related (which should count against the SLO) or down to user misconfiguration (e.g. a Slack node targeting a channel that doesn’t exist). Once the errors were refactored, the burn rate became healthy again, and the SLO a useful tool: just this week, an incident was caught early thanks to the alerts, and investigated faster thanks to Honeycomb’s MCP.

Another example, earlier this month, the Shared Modules team took over the Contracts & Documents domain (a central piece used by many domains, but neglected for the past few months) and my first instinct was to identify the five CUJs covering the main endpoints, and set up SLIs and SLOs to get a clear picture of their usage and health. While the first four held pleasant surprises, building the SLI on the last endpoint exposed a 16% discard rate that had gone unnoticed until then… the OTel instrumentation shipped the same day made the causes queryable → four tickets opened in short order, and the SLO has been climbing back steadily since.

For an example from a domain I didn’t contribute to, the team in charge of our public APIs and embedded partnerships was the second to adopt the framework after the pilot, with four SLOs deployed in a day. The SLOs surfaced a few metrics that weren’t doing well, useful but not actionable in the strict sense, since the team had to dig deeper to find the causes, and the remediation is still ongoing through separate triggers (Terraform again, but outside the SLO framework). Even so, for little effort, the framework proved a useful stepping stone towards managing reliability through better ways than hand-edited triggers.

A general lesson worth repeating: the first SLO is bound to be imperfect. Whether it’s a target set loose then tightened against what you actually measure, or a clumsy SLI later brought closer to the business behaviour, the true value comes from the feedback loop (regular review, evidence-based adjustment) and from the continuous conversation around reliability.

To close, I’d like to mention a relative failure of the framework: error budget policies, which to my knowledge are still not formalised in any team. Perhaps it’s still too early in the adoption? Or the default alerts aren’t well suited? Or teams don’t yet fully grasp the value of SLOs? Whatever the reason, it’s clearly a subject to address in the future, so that SLOs can truly become a meaningful decision-making tool.