Shift-AI: Patterns for AI-Augmented Incident Response

Incident response has always depended on human judgment as much as technical skill. As systems became more distributed and dynamic, the work also became more fragmented: more tools, more context, more dependencies, and more tacit knowledge to load under pressure.

João Aires shares how his team has been exploring a different approach with AI agents in incident response, not by chasing autonomous fixes, but by building the harness around them: the tools, context, skills, and guardrails that let agents contribute usefully during incidents. What follows is the practical shape that work has taken so far: what felt broken in the old workflow, what changed when agents entered the loop, where they helped, where they failed, and what still has to remain firmly under human judgment.

Why traditional incident response started to break down
From runbooks to agents: making the method executable
What AI-augmented investigation looks like during a real incident
Where AI surprises and where it fails
Keeping humans firmly in control
The most common mistake: going tool-only
How the SRE role evolves with AI

Why traditional incident response started to break down

In the specific context of incident response, three things felt broken: the amount of mental context switching during incidents, the repetitive manual work in the first phase of an investigation, and how much success depended on deep process, tribal, and system knowledge.

The team knew how to investigate, but the work was scattered and mechanical. One tab for monitors, another for logs, another for deploy history, then configs, docs, and tickets. Every incident started with the same sweep across the same systems, and under pressure that constant back-and-forth added friction fast.

What made it worse was that much of the required knowledge was not shared by default. A few people carried the mental model: the architecture, which services were noisy, which deploys to distrust, what the usual suspects were for a given symptom. Everyone else was dependent on them. That is a fragile way to run on-call, and it does not scale.

Underneath that was a deeper issue: the method lived in people’s heads. It was undocumented, rebuilt from scratch every incident, and only partially transferred through experience. The learning did not accumulate either. Post-mortems produced documents, but most of them sat in folders. The next incident started from very close to a blank slate.

As the models improved and became more useful in adjacent engineering work, it became reasonable to ask whether, with the right setup, they could also help with investigation work.

From runbooks to agents: making the method executable

A written runbook is codified, but it still depends on someone to follow it under pressure, with the right architectural and operational context loaded at the right moment. An agent can execute the method directly: pull the context, correlate the signals, and surface the evidence consistently, regardless of who is on-call.

That only works if the method is explicit and the context is somewhere the agent can reach. Past incidents, service knowledge, runbooks, deployment history, and investigation routines all have to be available in a form the agent can use. Anything that still lives in chat threads or in people’s heads is invisible to it.

That encoding work is larger than most teams expect. What you are really building is a harness: the tools, context, feedback loops, and guardrails that allow agents to do reliable work.

What AI-augmented investigation looks like during a real incident

In practice, the clearest change has been on the RCA side. Before, a human would start opening dashboards, checking logs, looking at deploys, and building the first picture by hand. Once the harness is in place, the agent can do that first sweep quickly and hand a structured starting point back to the human.

A typical first pass looks like this:

Scope: parse the alert or ticket context, including timeframe, impacted components, symptoms, and scope.
Signals: correlate recent deployments, release content, and config diffs.
Telemetry: deep-dive into logs, traces, metrics, and dependency edges.
Memory: cross-reference runbooks, wiki pages, and historical incidents.
Output: produce a structured report with ranked hypotheses, confidence level, supporting evidence, and next actions.

SREs and developers still investigate, but they do not start from zero. The context is already gathered, a first set of hypotheses is on the table, and the work shifts from assembling context to reacting, refining, and disproving.

In the setups that have held up well, each major claim ties back to concrete data, so people can audit the output instead of relying on a confident summary.

Where AI surprises and where it fails

The positive surprise is speed once the method is codified. Agents can reach a credible first hypothesis quickly because they already know which tools and data to pull and how to connect them.

A good example came from a Kubernetes service stuck in CrashLoopBackOff. The value was the correlation step: the agent connected the container restart pattern in the platform signals with the application logs and narrowed the failure down much faster than a human tab-by-tab sweep would have.

The harder surprise is that the agent does not know what it does not know. If the harness does not expose some source of information, or if a critical bit of organisational context was never encoded into the skills, the agent can produce a plausible analysis while being blind to the key missing piece.

We saw that with an RCA investigation agent analysing an issue in one service while the actual problem sat in a newly introduced direct dependency. The dependency existed in reality, but the harness lacked the architecture and infrastructure context needed to follow that path. The result was a plausible investigation that stayed too local to the wrong service. That failure was useful because it showed where the harness was stale.

In the setups that have held up best, calibration is explicit:

Confidence per hypothesis.
Checks for missing context and disconfirming evidence, not just contradictory evidence.
Humans continuously refining the harness when the agent misses something important.

A related failure mode is overconfidence on high-impact actions. Read-only investigator mode is a safe starting point: the agent can be wrong without causing damage. Once agents move into action territory, explicit checkpoints become non-negotiable.

Keeping humans firmly in control

Human control does not happen by accident. A few patterns have helped:

Break incident response into phases: RCA, mitigation, resolution.
Let the agent help within a phase, but keep humans in the loop at the transition between phases.
Require human judgment before moving from analysis to action.
Keep actions, evidence, and rationale fully logged.
Assign explicit accountability ownership. An agent cannot be responsible for its output; a human must be.

This is where the distinction between human-in-the-loop and human-on-the-loop matters. High-impact or irreversible actions should require explicit approval. Lower-risk, reversible steps can be supervised instead. The mode should depend on the scope and reversibility of the action, not on how polished the system looks in a demo.

That phase separation matters because it lets the agent contribute heavily inside RCA without pretending it should also autonomously decide mitigation or resolution. Humans stay at those boundaries and judge whether the output is actually good enough to act on.

Accountability also needs to stay concrete. Agents carry no professional or organisational consequence for their decisions. When something goes wrong, a human still answers for it. That means supervision cannot be vague or distributed.

Humans also sit at both ends of the loop, not just the middle. They provide context, goals, and constraints at the start, and they validate whether the output is actually correct before anything happens. Agents accelerate the middle. When the agent gets something wrong, the owner feeds that back into the skills, tools, and documentation so the harness does not drift out of date.

The most common mistake: going tool-only

The most common trap is going tool-only. Connecting agents to observability, ticketing, and platform data is necessary, but it is only the baseline. The real work is enriching the harness with the right combination of tools, context, and Agent Skills.

That skills layer is where tribal knowledge and operational process have to be encoded. Agents do not come with that by default, and introducing AI before that structure exists just adds another unpredictable variable.

The same mistake shows up when teams treat the harness as a one-time setup project. If it becomes a static library of stale prompts and outdated runbooks, it turns into its own legacy burden. In practice, skills need to be versioned like code and updated after incidents.

A safer progression looks much closer to how you would build trust with a junior engineer: RCA first, then mitigation, then resolution, with tighter controls as the stakes increase.

Start with search, not action.
Keep the agent in suggestion mode first.
Give it narrow, low-risk actions next, and only where there is a clear rollback path.
Treat bad suggestions and failed actions as feedback to improve the harness.
Expand scope only after the agent has earned trust.

In our case, we are still very much in the RCA phase and focused on automating RCA well before granting broader action-taking scope.

The SRE Book 2nd Edition lands in the same place: suggest fixes first, act on small things next, and expand scope only after earning trust.

How the SRE role evolves with AI

The role moves toward building and evolving the harness around operational work. The more capable agents become, the more valuable it is to have SREs shaping the context, controls, workflows, and quality bar around them.

That work does not stop at RCA. The same harness idea can extend into mitigation and resolution, with stronger feedback loops around whether the action actually worked. An agent can suggest a rollback, a config change, or a feature-flag adjustment, then observe the system afterwards and check whether the expected recovery happened. The work here is more challenging and higher stakes than the initial investigation, not less.

The role shifts toward designing the investigation and escalation protocols, defining safety boundaries, curating the domain knowledge agents need, and leading the judgment calls at the decision points that actually matter. Less of the work is doing the first manual sweep. More of it is shaping the environment in which that work happens and evaluating it honestly when it falls short.

Harness improvements also compound. Work done today helps current agents, and the same harness benefits again as the models improve. The term harness engineering was first defined in software development, but the logic transfers directly here as well.

If the SRE Book 2nd Edition is right that AI raises the reliability ceiling, the likely consequence is not less reliability work. It is the same teams being asked to support a broader reliability scope across more services.

Shift-AI: Patterns for AI-Augmented Incident Response

Contents

Why traditional incident response started to break down

From runbooks to agents: making the method executable

What AI-augmented investigation looks like during a real incident

Where AI surprises and where it fails

Keeping humans firmly in control

The most common mistake: going tool-only

How the SRE role evolves with AI

Share this Article

Related articles

Selenium Conference 2026: AI, BiDi, and the Future of Test Automation

The Art of the Juggle: Balancing Business, Team and Ownership as a Product Owner

From UX Rotterdam 2026 to Mercedes-Benz.io: Reflections on AI, Strategy and the Future of User Experience