Most teams start with a global retry limit or max-iteration cap on agent runs.
It feels safe. It gives one number to monitor.
In production, a single unstable tool can consume most of that budget while the rest of the workflow makes no progress. The run finishes from the orchestrator’s perspective, metrics look active, and cost per successful outcome silently doubles.
We traced exactly this in a multi-step support automation. A knowledge-base search API became intermittent after a provider migration. It retried aggressively within the run’s iteration budget. Downstream steps (CRM update, ticket resolution) had almost no budget left for their own recovery. Every run “succeeded” from the orchestrator’s perspective. Quality told a different story.
In our previous posts, we covered how execution semantics break first and why orchestration is not execution control. This post is about a specific failure mode inside execution control: where retries accumulate, and why run-level caps cannot tell you.
The failure mode
Consider a workflow with search, extraction, and write stages.
If search is flaky and retries aggressively, the run-level budget burns early. Downstream steps either receive degraded inputs or never get enough budget for recovery. A global cap cannot tell you which component is leaking reliability budget.
8 retries"] --> B["🟡 extract
2 retries"] B --> C["🟡 crm_write
2 retries"] end style A fill:#7F1D1D,stroke:#FECACA,color:#F87171 style B fill:#78350F,stroke:#FDE68A,color:#FBBF24 style C fill:#78350F,stroke:#FDE68A,color:#FBBF24
Search consumed 67% of the budget. The run “succeeded.” The CRM write had no room for error.
Why run-level caps are not enough
Run-level budgets answer one question: “How much total work happened?”
They do not answer: “Which dependency consumed it, and was it worth it?”
If Tool A burns 80% of retries and contributes little incremental value after the third attempt, the system needs to throttle Tool A directly, not punish the entire run.
This is the same reason SRE teams track error budgets by service, not only system-wide uptime. A single degraded dependency should not drain reliability for everything downstream.
A better model: per-tool retry budgets
Treat retries as a scoped budget at the tool boundary. Aggregate at run level as a safety net.
For each tool, define:
- max retry count
- backoff policy (fixed, exponential, or none)
- timeout budget
- failure classification (retryable vs terminal)
- escalation path when budget is exhausted
Then enforce a run-level ceiling as the final cap.
This gives two control surfaces:
- Local control for unstable dependencies
- Global control for total blast radius
Policy shape
run:
max_iterations: 12
max_duration_seconds: 300
tools:
search_api:
max_retries: 3
timeout_ms: 3000
backoff: exponential
on_exhaust: degrade # accept partial result
web_fetch:
max_retries: 2
timeout_ms: 5000
on_exhaust: skip # proceed without this step
crm_write:
max_retries: 1
timeout_ms: 10000
on_exhaust: escalate # page a human
Search can retry three times. If it exhausts that budget, the workflow degrades gracefully instead of draining the run. CRM write gets one retry with a longer timeout. If it fails twice, a human gets paged.
No tool can silently drain the entire run.
Enforcement at the step boundary
Per-tool budgets require enforcement at the point where tools are invoked, not inside the tool itself.
One way to implement this is to gate tool calls at a step boundary. Here is a concrete example using AxonFlow, where each tool invocation passes through a step gate that evaluates per-tool context:
gate = await client.step_gate(
workflow_id=workflow.workflow_id,
step_id="search-kb",
request=StepGateRequest(
step_name="Search Knowledge Base",
step_type=StepType.TOOL_CALL,
tool_context=ToolContext(
tool_name="search_api",
tool_type="function",
),
),
)
if gate.is_allowed():
result = search_kb(query)
await client.mark_step_completed(
workflow.workflow_id, "search-kb",
MarkStepCompletedRequest(output=result),
)
The gate evaluates policies scoped to search_api, including retry budget, cost threshold, and risk classification, before the tool executes. If the tool has exhausted its budget, the gate blocks re-execution at the boundary.
This is the same pattern from Orchestration Is Not Execution Control: gate first, commit second. Per-tool budgets add a third dimension: budget-aware gating.
Remaining?} C -->|yes| D[Execute Tool] C -->|exhausted| E{on_exhaust} D --> F[Record Outcome] F --> G[Decrement Budget] E -->|degrade| H[Accept Partial Result] E -->|skip| I[Proceed Without Step] E -->|escalate| J[Page Human] style B fill:#0F766E,stroke:#134E4A,color:#fff style C fill:#0F766E,stroke:#134E4A,color:#fff style D fill:#166534,stroke:#14532D,color:#fff style E fill:#78350F,stroke:#92400E,color:#fff style H fill:#F1F5F9,stroke:#E2E8F0 style I fill:#F1F5F9,stroke:#E2E8F0 style J fill:#FEF3C7,stroke:#FDE68A
Metrics that matter
If you adopt per-tool budgets, track:
| Metric | What it reveals |
|---|---|
| Retry share by tool | Which dependency is consuming reliability budget |
| Cost per successful outcome | Whether retries are improving results or just burning cost |
| Budget exhaustion events by tool | Which tools hit their ceiling most often |
| Quality pass rate after exhaustion | Whether fallback paths produce acceptable results |
| Incident rate per 100 runs | Overall system reliability trend |
The goal is not fewer retries. It is more intentional retries.
Operational impact
Per-tool budgets change incident response.
Without them, postmortems read:
“Run exceeded iteration budget.”
With them:
“
search_apiexhausted retry budget in 90 seconds, forcing low-quality fallback for downstream write path.crm_writesucceeded on first attempt.”
That is a fixable engineering finding. The first is noise.
Litmus test
If your run hits its retry limit, can you identify which tool consumed the budget?
If the answer is “we would have to check logs,” retries are still a global knob. They should be a resource allocation policy.
Budget retries where risk is created, at the tool boundary, and aggregate at the run level as a ceiling.
Are you enforcing retry budgets per run, per tool, or both?
If you want implementation details, here are the relevant links: