Cost Per Call Is a Useless Metric for Agent Systems

AxonFlow is a source-available control plane for production AI systems.

engineering production llm cost-control observability

Most teams track cost per LLM call.

It is the easiest metric to compute and the least useful for understanding what your system actually costs.

Every provider returns token counts. Multiply by price per token. Done.

Cost per call treats every invocation equally. A successful extraction that resolved a support ticket costs the same as a failed retry that produced nothing. A cached response that returned in 40ms gets the same weight as a 12-second generation that timed out and was discarded.

We traced this in a document-processing pipeline. Average cost per call held steady at $0.03. The team reported costs as stable. But cost per successfully resolved document had doubled over three weeks, because a provider migration introduced intermittent failures that triggered retries. Each retry was cheap individually. The aggregate cost of producing one good outcome was climbing silently.

In our previous posts, we covered execution semantics, execution control vs orchestration, and per-tool retry budgets. This post is about the cost metric that ties them together: what did you spend to produce one successful outcome?

The problem with cost per call

Cost per call answers: “How much did this invocation cost?”

It does not answer:

  • How many calls did it take to produce one successful result?
  • What fraction of spend went to retries that contributed nothing?
  • Which tools are expensive because they fail often, not because they are priced high?

Consider a workflow with three tool calls. On a good run, each executes once. Total cost: $0.09.

On a bad run, the search tool fails three times before succeeding, and the extraction step fails once. Total cost: $0.21. Cost per call is still $0.03. Cost per successful completion jumped from $0.09 to $0.21.

graph LR subgraph good ["Good Run - $0.09"] A1["search_api
$0.03 ✓"] --> B1["extract
$0.03 ✓"] B1 --> C1["crm_write
$0.03 ✓"] end subgraph bad ["Bad Run - $0.21"] A2["🔴 search_api
3 retries × $0.03 = $0.09"] --> A5["search_api
$0.03 ✓"] A5 --> B3["🔴 extract
$0.03 ✗"] B3 --> B2["extract
$0.03 ✓"] B2 --> C2["crm_write
$0.03 ✓"] end good ~~~ bad style A1 fill:#166534,stroke:#14532D,color:#fff style B1 fill:#166534,stroke:#14532D,color:#fff style C1 fill:#166534,stroke:#14532D,color:#fff style A2 fill:#7F1D1D,stroke:#FECACA,color:#F87171 style A5 fill:#166534,stroke:#14532D,color:#fff style B3 fill:#7F1D1D,stroke:#FECACA,color:#F87171 style B2 fill:#166534,stroke:#14532D,color:#fff style C2 fill:#166534,stroke:#14532D,color:#fff

The dashboard shows “cost per call: $0.03, stable.” The business is paying 2.3x more per resolved document.

Three metrics that cost per call hides

1. Retry waste ratio

What fraction of total spend went to calls that did not contribute to the final outcome?

In the bad run above, 4 of 7 calls were wasted retries. That is 57% retry waste. Cost per call cannot surface this. You need cost attribution per step outcome, not per invocation.

2. Cost per successful completion

The metric that actually matters for unit economics. Not “what did each call cost?” but “what did each resolved ticket, processed document, or completed workflow cost?”

This requires tracking cost at the workflow level, not the call level. Every step records its cost. The workflow sums completed steps. Failed runs still accumulate cost but produce no successful outcome. That keeps the denominator honest.

3. Tool-level cost efficiency

Some tools are expensive because of their per-token price. Others are expensive because they fail often. Cost per call cannot distinguish these.

Tool Price/call Failure rate Avg retries Effective cost
search_api $0.03 25% 2.1 $0.063
extract $0.05 2% 1.02 $0.051
crm_write $0.01 1% 1.0 $0.010

search_api looks cheap at $0.03 per call. Its effective cost is $0.063 because it fails frequently. extract is nominally more expensive but more cost-efficient because it almost always succeeds on the first attempt.

Tracking cost at the step boundary

To compute cost per successful completion, you need cost recorded at each step, not just at each LLM call.

One way to implement this is to record cost alongside step completion. Here is a concrete example using AxonFlow, where each step records its cost as part of the completion:

gate = await client.step_gate(
    workflow_id=workflow.workflow_id,
    step_id="search-kb",
    request=StepGateRequest(
        step_name="Search Knowledge Base",
        step_type=StepType.TOOL_CALL,
        tool_context=ToolContext(
            tool_name="search_api",
            tool_type="function",
        ),
    ),
)

if gate.is_allowed():
    result = search_kb(query)
    await client.mark_step_completed(
        workflow.workflow_id, "search-kb",
        MarkStepCompletedRequest(
            output=result,
            cost_usd=result.usage.total_cost,
        ),
    )

Each step’s cost is persisted to the step ledger. The workflow’s total cost is the sum of all step costs, including retries. A failed step that retried three times before succeeding has its cost reflected in the workflow total.

This is the same step gate pattern from our previous posts. Cost tracking adds one field to the completion record. The infrastructure already exists if you have per-step tracking.

Budget enforcement with cost visibility

Once you have per-step cost, budget enforcement becomes precise. Instead of a single spending cap on the entire system, you can scope budgets by workflow, agent, team, or organization:

Example policy shape (pattern, not current config schema):

budgets:
  support-automation:
    scope: agent
    limit_usd: 50.00
    period: daily
    alert_thresholds: [50, 80, 100]
    on_exceed: warn

  document-processing:
    scope: workflow
    limit_usd: 2.00
    period: per_run
    on_exceed: block

The support automation agent gets $50/day across all runs. Each document-processing workflow gets $2 per run. If a single run burns through retries and hits $2, the system blocks further steps rather than letting cost accumulate silently.

flowchart TD A[Step Invocation] --> B{Step Gate} B --> C{Budget
Check} C -->|within budget| D[Execute Step] C -->|threshold reached| E[Alert + Continue] C -->|exceeded| F{on_exceed} D --> G[Record Step Cost] G --> H[Update Budget Usage] F -->|warn| E F -->|block| I[Block Step] F -->|downgrade| J[Route to Cheaper Model] style B fill:#0F766E,stroke:#134E4A,color:#fff style C fill:#0F766E,stroke:#134E4A,color:#fff style D fill:#166534,stroke:#14532D,color:#fff style F fill:#78350F,stroke:#92400E,color:#fff style I fill:#7F1D1D,stroke:#FECACA,color:#F87171 style J fill:#F1F5F9,stroke:#E2E8F0

The budget check happens at the step gate, before execution. If the workflow has already spent $1.80 of its $2 budget, the gate can block an expensive LLM step or route it to a cheaper model. This is proactive cost control, not retroactive reporting.

The metrics that replace cost per call

If you shift from cost per call to cost per successful completion, track these:

Metric What it reveals
Cost per successful completion True unit economics of your AI system
Retry waste ratio Fraction of spend that produced no outcome
Cost efficiency by tool Which tools are expensive due to failure, not price
Budget utilization rate How close runs come to their cost ceiling
Cost variance (P50 vs P95) Whether cost is predictable or spiky

The first metric is the one your finance team actually cares about. The rest tell engineering where to optimize.

Operational impact

Cost per call produces reports like:

“Average cost per call: $0.03. Monthly spend: $12,400. Within budget.”

Cost per successful completion produces:

“Cost per resolved ticket: $0.47 (up 38% from last month). search_api retry rate increased from 5% to 25% after provider migration. Retry waste: $2,100/month. Recommendation: add fallback provider or reduce search_api retry budget.”

The first is a finance summary. The second is an engineering action item.

Litmus test

If your LLM costs increase 40% next month, can you identify whether it was:

  • More traffic (good)?
  • Higher failure rates driving retries (fixable)?
  • A model price increase (vendor decision)?
  • A single tool consuming disproportionate budget (optimization target)?

If the answer is “we would need to dig into logs,” you are tracking cost per call. You should be tracking cost per successful completion, with per-step attribution.

Measure cost where outcomes are produced, at the workflow boundary, and attribute it where spend occurs, at the step boundary.

What is your cost per successfully completed workflow, and do you know which steps drive it?

If you want implementation details, here are the relevant links: