Agent Recovery: How to Fix a Broken AI System Fast

An AI system that was working yesterday and is broken today is one of the most frustrating technical problems to diagnose. Unlike traditional software bugs, AI failures are often probabilistic, context-dependent, and non-reproducible in isolation. The failure happens in production under specific conditions that are difficult to replicate in a debugging environment.

Most teams respond to AI system failures with trial and error — changing prompts, rolling back recent changes, restarting services, hoping something works. This approach sometimes fixes the problem and always wastes time. A systematic recovery process is faster, more reliable, and produces learnings that prevent recurrence.

The First Five Minutes

When an AI system fails, the first priority is not diagnosis — it is containment. Before you understand what went wrong, you need to understand how bad it is and whether it is getting worse.

Scope the failure immediately. Is this affecting all users or some? Is it affecting all tasks or specific task types? Is the failure rate 100% or partial? A 100% failure rate has different root causes than a 20% failure rate. Scoping the failure is not wasted time — it eliminates entire categories of possible causes.

Check the obvious external dependencies. Has anything changed in your model provider's status? Is the failure correlated with an API change, a service degradation notice, or a rate limiting event? External dependency failures account for a significant fraction of AI system incidents and take two minutes to check. Check them first.

Stop the bleeding if possible. If the failure is causing customer-visible errors or data integrity issues, can you disable the affected component and fall back to a manual process? A degraded but functional system is better than a broken system while you diagnose. Make the call to reduce impact quickly, before diagnosis is complete.

The Diagnostic Framework

After containment, systematic diagnosis starts with the question: what changed?

Recent deployments. The single most common cause of AI system failures is a recent change — to prompts, to configuration, to underlying models, to tool dependencies. Review every change made in the 72 hours before the failure began. Not just changes to the AI component directly, but changes to any system it depends on.

Input distribution shift. Has the nature of inputs the system is receiving changed? A surge in a particular request type, a change in user behavior, a new category of input that the system was not designed for — any of these can cause failures that look like system bugs but are actually distribution shift problems.

Model provider changes. Model providers update their models, change their APIs, adjust their safety filters, and modify their rate limiting policies. Any of these can cause previously working AI systems to fail. Check provider changelog and status pages for recent changes.

Context length creep. For systems with memory or context accumulation, gradually growing context eventually hits a limit — either a hard token limit that causes truncation, or a soft quality limit where the model's attention becomes too distributed to handle the full context effectively. If your system has been in production for months without context management maintenance, this is worth checking.

Common Failure Modes and Their Signatures

Knowing the common failure patterns significantly accelerates diagnosis. Each pattern has a characteristic signature:

Prompt injection or manipulation. Signature: failures that are highly correlated with specific users or input patterns, often involving instructions embedded in user-supplied text. Check: does the failure reproduce when the user-supplied content is replaced with neutral test content?

Context poisoning. Signature: gradual quality degradation over a conversation or session, with earlier outputs from the same session being higher quality. Check: does the failure occur on a fresh context with the same core inputs?

Tool call cascade failure. Signature: agent takes longer than usual before failing, often with an error that mentions a downstream service. Check: run the same task with all tool calls disabled (if possible) to determine whether the failure is in the reasoning or in the tool dependencies.

Rate limiting. Signature: failures that correlate with time of day or high-traffic periods, often with specific error codes (429, 503). Check: does the failure reproduce at a different time or at reduced request volume?

Schema validation failure. Signature: agent runs complete but produces output that your downstream processing rejects. Often caused by a model update that subtly changed output formatting. Check: examine the raw model output before post-processing to see whether it matches the expected schema.

Prompt drift. Signature: gradual quality degradation over weeks with no discrete failure event. Caused by accumulated small edits to prompts that individually seem harmless. Check: compare current prompt to a known-good version from your prompt version history.

The Recovery Checklist

Once the failure mode is identified, recovery follows a standard sequence:

1. Isolate the affected component. Narrow the failure to the smallest possible scope — a specific agent, a specific tool, a specific prompt. Broad failures often have narrow causes.

2. Reproduce in a controlled environment. Before making any changes, reproduce the failure in a test environment. If you cannot reproduce it, you cannot verify that your fix worked.

3. Implement the fix. Apply the minimum change that addresses the root cause. Avoid the temptation to make multiple improvements simultaneously — it makes it impossible to know which change fixed the problem.

4. Verify in test. Confirm the failure no longer reproduces in your test environment. Also confirm that the fix does not degrade performance on cases that were working before.

5. Deploy with monitoring. Deploy the fix with enhanced monitoring active. Watch failure rates, latency, and quality metrics for 30 minutes after deployment. AI system fixes sometimes fix the surface failure while introducing new issues on adjacent cases.

6. Document the root cause. Write down what happened, what caused it, and what fixed it. The documentation takes ten minutes and prevents the same failure from consuming the same time again.

Building Recovery Into the Architecture

The fastest recovery is one you designed for in advance. Every AI system should include:

Circuit breakers. Automatic degradation when failure rates exceed thresholds — falling back to simpler processing, queuing requests for later, or returning explicit uncertainty rather than producing potentially bad output.

Rollback capability. The ability to revert to a known-good version of prompts, configuration, and tool wiring within minutes. This requires version control for all of these components, not just code.

Runbooks. Documented recovery procedures for common failure modes, specific enough that someone unfamiliar with the system can execute them. Runbooks are written before failures happen, not during them.

The Recovery product provides both the diagnostic framework and the infrastructure to implement these capabilities for your existing AI systems. When something breaks — and eventually, something always breaks — having recovery architecture in place is the difference between a 20-minute incident and a three-day crisis.

The First Five Minutes

When an AI system fails, the first priority is not diagnosis — it is containment. Before you understand what went wrong, you need to understand how bad it is and whether it is getting worse.

The Diagnostic Framework

After containment, systematic diagnosis starts with the question: what changed?

Common Failure Modes and Their Signatures

Knowing the common failure patterns significantly accelerates diagnosis. Each pattern has a characteristic signature:

The Recovery Checklist

Once the failure mode is identified, recovery follows a standard sequence:

1. Isolate the affected component. Narrow the failure to the smallest possible scope — a specific agent, a specific tool, a specific prompt. Broad failures often have narrow causes.

2. Reproduce in a controlled environment. Before making any changes, reproduce the failure in a test environment. If you cannot reproduce it, you cannot verify that your fix worked.

4. Verify in test. Confirm the failure no longer reproduces in your test environment. Also confirm that the fix does not degrade performance on cases that were working before.

6. Document the root cause. Write down what happened, what caused it, and what fixed it. The documentation takes ten minutes and prevents the same failure from consuming the same time again.

Building Recovery Into the Architecture

The fastest recovery is one you designed for in advance. Every AI system should include:

Agent Recovery: How to Fix a Broken AI System Fast

The First Five Minutes

The Diagnostic Framework

Common Failure Modes and Their Signatures

The Recovery Checklist

Building Recovery Into the Architecture

Recovery

More from the Blog

Agent Recovery: How to Fix a Broken AI System Fast

The First Five Minutes

The Diagnostic Framework

Common Failure Modes and Their Signatures

The Recovery Checklist

Building Recovery Into the Architecture

Recovery

More from the Blog