Why Retry Budgets Should Be Per Tool, Not Per Run

AxonFlow is a source-available control plane for production AI systems.

engineering production llm cost-control reliability

Most teams start with a global retry limit or max-iteration cap on agent runs.

It feels safe. It gives one number to monitor.

In production, a single unstable tool can consume most of that budget while the rest of the workflow makes no progress. The run finishes from the orchestrator’s perspective, metrics look active, and cost per successful outcome silently doubles.

We traced exactly this in a multi-step support automation. A knowledge-base search API became intermittent after a provider migration. It retried aggressively within the run’s iteration budget. Downstream steps (CRM update, ticket resolution) had almost no budget left for their own recovery. Every run “succeeded” from the orchestrator’s perspective. Quality told a different story.

In our previous posts, we covered how execution semantics break first and why orchestration is not execution control. This post is about a specific failure mode inside execution control: where retries accumulate, and why run-level caps cannot tell you.

The failure mode

Consider a workflow with search, extraction, and write stages.

If search is flaky and retries aggressively, the run-level budget burns early. Downstream steps either receive degraded inputs or never get enough budget for recovery. A global cap cannot tell you which component is leaking reliability budget.

graph LR subgraph "Run Budget: 12 iterations" A["🔴 search_api
8 retries"] --> B["🟡 extract
2 retries"] B --> C["🟡 crm_write
2 retries"] end style A fill:#7F1D1D,stroke:#FECACA,color:#F87171 style B fill:#78350F,stroke:#FDE68A,color:#FBBF24 style C fill:#78350F,stroke:#FDE68A,color:#FBBF24

Search consumed 67% of the budget. The run “succeeded.” The CRM write had no room for error.

Why run-level caps are not enough

Run-level budgets answer one question: “How much total work happened?”

They do not answer: “Which dependency consumed it, and was it worth it?”

If Tool A burns 80% of retries and contributes little incremental value after the third attempt, the system needs to throttle Tool A directly, not punish the entire run.

This is the same reason SRE teams track error budgets by service, not only system-wide uptime. A single degraded dependency should not drain reliability for everything downstream.

A better model: per-tool retry budgets

Treat retries as a scoped budget at the tool boundary. Aggregate at run level as a safety net.

For each tool, define:

  • max retry count
  • backoff policy (fixed, exponential, or none)
  • timeout budget
  • failure classification (retryable vs terminal)
  • escalation path when budget is exhausted

Then enforce a run-level ceiling as the final cap.

This gives two control surfaces:

  1. Local control for unstable dependencies
  2. Global control for total blast radius

Policy shape

run:
  max_iterations: 12
  max_duration_seconds: 300

tools:
  search_api:
    max_retries: 3
    timeout_ms: 3000
    backoff: exponential
    on_exhaust: degrade    # accept partial result
  web_fetch:
    max_retries: 2
    timeout_ms: 5000
    on_exhaust: skip       # proceed without this step
  crm_write:
    max_retries: 1
    timeout_ms: 10000
    on_exhaust: escalate   # page a human

Search can retry three times. If it exhausts that budget, the workflow degrades gracefully instead of draining the run. CRM write gets one retry with a longer timeout. If it fails twice, a human gets paged.

No tool can silently drain the entire run.

Enforcement at the step boundary

Per-tool budgets require enforcement at the point where tools are invoked, not inside the tool itself.

One way to implement this is to gate tool calls at a step boundary. Here is a concrete example using AxonFlow, where each tool invocation passes through a step gate that evaluates per-tool context:

gate = await client.step_gate(
    workflow_id=workflow.workflow_id,
    step_id="search-kb",
    request=StepGateRequest(
        step_name="Search Knowledge Base",
        step_type=StepType.TOOL_CALL,
        tool_context=ToolContext(
            tool_name="search_api",
            tool_type="function",
        ),
    ),
)

if gate.is_allowed():
    result = search_kb(query)
    await client.mark_step_completed(
        workflow.workflow_id, "search-kb",
        MarkStepCompletedRequest(output=result),
    )

The gate evaluates policies scoped to search_api, including retry budget, cost threshold, and risk classification, before the tool executes. If the tool has exhausted its budget, the gate blocks re-execution at the boundary.

This is the same pattern from Orchestration Is Not Execution Control: gate first, commit second. Per-tool budgets add a third dimension: budget-aware gating.

flowchart TD A[Tool Invocation] --> B{Step Gate} B --> C{Budget
Remaining?} C -->|yes| D[Execute Tool] C -->|exhausted| E{on_exhaust} D --> F[Record Outcome] F --> G[Decrement Budget] E -->|degrade| H[Accept Partial Result] E -->|skip| I[Proceed Without Step] E -->|escalate| J[Page Human] style B fill:#0F766E,stroke:#134E4A,color:#fff style C fill:#0F766E,stroke:#134E4A,color:#fff style D fill:#166534,stroke:#14532D,color:#fff style E fill:#78350F,stroke:#92400E,color:#fff style H fill:#F1F5F9,stroke:#E2E8F0 style I fill:#F1F5F9,stroke:#E2E8F0 style J fill:#FEF3C7,stroke:#FDE68A

Metrics that matter

If you adopt per-tool budgets, track:

Metric What it reveals
Retry share by tool Which dependency is consuming reliability budget
Cost per successful outcome Whether retries are improving results or just burning cost
Budget exhaustion events by tool Which tools hit their ceiling most often
Quality pass rate after exhaustion Whether fallback paths produce acceptable results
Incident rate per 100 runs Overall system reliability trend

The goal is not fewer retries. It is more intentional retries.

Operational impact

Per-tool budgets change incident response.

Without them, postmortems read:

“Run exceeded iteration budget.”

With them:

search_api exhausted retry budget in 90 seconds, forcing low-quality fallback for downstream write path. crm_write succeeded on first attempt.”

That is a fixable engineering finding. The first is noise.

Litmus test

If your run hits its retry limit, can you identify which tool consumed the budget?

If the answer is “we would have to check logs,” retries are still a global knob. They should be a resource allocation policy.

Budget retries where risk is created, at the tool boundary, and aggregate at the run level as a ceiling.

Are you enforcing retry budgets per run, per tool, or both?

If you want implementation details, here are the relevant links: