Most LLM systems start life as request-response calls.
You send a prompt. You get a response. If it fails, you retry.
This works beautifully in demos.
It breaks quickly in production.
Not because the model is inaccurate. Not because latency is high.
But because execution semantics become implicit. By “execution semantics” we mean: what exactly happens on retries, failures, cancellation, and partial completion. Most frameworks leave this undefined.
Once an LLM call is embedded inside a long-running workflow that touches real systems — databases, ticketing tools, approval flows, external APIs — the system stops behaving like a stateless API and starts behaving like a distributed system.
And most tooling still treats it like a function call.
This is the set of primitives we ended up building in AxonFlow — not because we planned to build a workflow engine, but because every deployment we ran kept breaking on the same class of problems.
Before we get into the details, a few terms that will come up throughout:
Step ledger — a persistent, per-step completion record for a workflow run. It answers: “did this step already execute?” without relying on logs or memory.
Step gate — a checkpoint before each step that evaluates policies, checks budgets, and records the admission decision before allowing the step to proceed.
Execution replay — the ability to reconstruct exactly what happened during a workflow run by reading recorded inputs, outputs, and decisions. This is distinct from re-execution: replay reads the ledger; re-execution runs the code again.
Here is what breaks first.
1. Partial Execution Is More Dangerous Than Failure
In early prototypes, a failed LLM call is simple:
response = call_llm(prompt)
if error:
retry()
In production, that LLM call might be step 3 of 7:
Step 1 Read ticket
Step 2 Classify issue
Step 3 Generate remediation plan (LLM)
Step 4 Create Jira task
Step 5 Notify Slack
Step 6 Update database
Step 7 Close ticket
If step 5 fails, steps 1–4 have already executed.
A blind retry does not “resume.” It replays.
That can mean:
- Duplicate Jira tickets
- Duplicate Slack notifications
- Duplicate database writes
- Conflicting state
The failure is not the model. The failure is that the workflow has no durable execution identity.
We saw this in a real deployment: a customer support automation hit a Slack rate limit on step 5, retried from the top, and created a duplicate Jira ticket. A human had already picked up the first ticket and resolved it. The retry reopened a closed issue. The LLM worked perfectly every time — the execution semantics were the problem.
2. Retries Amplify Side Effects
Retries are safe in stateless systems. They are unsafe in side-effecting systems.
If you do not have explicit idempotency boundaries per step, retries are indistinguishable from replays.
Consider this simplified flow:
if not run.completed("create_ticket"):
create_ticket()
Without a persistent per-step ledger, the system cannot deterministically answer whether create_ticket() already succeeded.
Many AI frameworks rely on in-memory state or log inspection. That works until:
- A process restarts
- A pod is rescheduled
- A timeout occurs mid-call
At that point, retry becomes archaeology — reconstructing history from logs instead of reading it from a run record.
This is what led us to build per-step tracking into AxonFlow’s Workflow Control Plane. Each step gets an explicit gate check and completion record:
# Check the gate — has this step already run?
gate = await client.step_gate(
workflow_id=workflow.workflow_id,
step_id="create-ticket",
request=StepGateRequest(
step_name="Create Jira Ticket",
step_type=StepType.TOOL_CALL,
),
)
if gate.is_allowed():
result = create_ticket()
# Output persisted to the ledger
await client.mark_step_completed(
workflow_id=workflow.workflow_id,
step_id="create-ticket",
request=MarkStepCompletedRequest(
output={"ticket_id": result.id}
),
)
On retry, the system doesn’t guess. It reads the ledger. Step already completed? Skip it. Step failed? Resume from there.
Run ID: wf_9f3a...
Step 1 [completed] Read Ticket
Step 2 [completed] Classify
Step 3 [completed] Generate Plan
Step 4 [completed] Create Jira
Step 5 [failed] Notify Slack <-- resume here
Step 6 [pending] Update DB
Step 7 [pending] Close Ticket
Without this, retry means re-running step 4. With it, retry means resuming from step 5.
3. Logs Show What Happened. Not Why It Was Allowed.
During incident review, the hardest question is rarely “What happened?”
It is: “Why was this allowed?”
- Which inputs were present?
- Which policy version was active?
- What permissions existed?
- What cost thresholds were in place?
- What decision context admitted this action?
Traditional logging systems capture events. They do not capture admissibility.
That distinction becomes critical under compliance review, audit investigation, regulator inquiry, and internal postmortems.
This is why AxonFlow records a decision chain for every execution step — not just what happened, but the policy state, risk classification, and gate decision at the time of execution:
{
"execution_id": "wf_9f3a...",
"step": "create-ticket",
"decision": "allow",
"policy_version": "v2.3",
"risk_level": "medium",
"cost_usd": 0.02,
"input_hash": "sha256:a1b2c3...",
"output_hash": "sha256:d4e5f6...",
"audit_hash": "sha256:789abc..."
}
The hashes create a tamper-evident chain. If someone asks “was this decision legitimate?” six months later, you can reconstruct exactly what the system knew at execution time.
4. Long-Running AI Workflows Behave Like Distributed Systems
Once workflows become multi-step, conditional, time-separated, multi-agent, and side-effecting, you now have:
- State that must survive process restarts
- Identity that must be globally unique
- Cancellation that must propagate cleanly
- Replay that must be deterministic
- Policy enforcement that must happen at step boundaries
This is no longer prompt engineering. This is workflow control.
┌──────────────────┐
User ──────▶│ LLM Orchestrator │
└────────┬─────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Tool │ │Database │ │External │
│ Call │ │ Write │ │ API │
└─────────┘ └─────────┘ └─────────┘
Side effects. Each one needs governance.
Most systems only observe the left side (the LLM call). Production discipline requires controlling the right side (the side effects).
In our Go implementation, each workflow gets a durable run ID (wf_ prefix), and each step transition passes through a gate that evaluates policies, checks budgets, and records the decision — all in under 30ms P95:
gate, err := client.StepGate(
workflow.WorkflowID, "step-3",
axonflow.StepGateRequest{
StepName: "Generate Plan",
StepType: axonflow.StepTypeLLMCall,
},
)
switch gate.Decision {
case "allow": // proceed
case "block": // stop, record reason
case "require_approval": // pause for human review
}
The gate decision is recorded. The step output is recorded. The entire execution is replayable. If you need to understand what happened at 3am on a Tuesday, you don’t grep logs — you replay the execution.
5. This Is a Control-Plane Problem
In Kubernetes, we separate the data plane (workloads) from the control plane (desired state, reconciliation, policy).
LLM systems today mostly implement the data plane — they call models, chain prompts, run tools. The missing piece is the execution control plane:
- Durable run identity
- Per-step completion ledger
- Explicit cancellation semantics
- Replay from recorded inputs
- Policy enforcement at execution time
- Audit trails that include decision context
Orchestrators decide what to do next. Control planes decide what is allowed to happen next.
Without this separation, reliability and compliance become accidental.
That’s the architectural bet behind AxonFlow. The system sits between your orchestration framework and the side effects, providing execution control at each step transition. LangChain runs your workflow. AxonFlow decides when each step is allowed to proceed.
6. The Production Readiness Checklist
If you are running LLM-driven automation in production, here is what you need:
| Requirement | Why |
|---|---|
| A run ID that survives process restarts | Without it, retries replay from scratch |
| A per-step completion ledger | Without it, you can’t resume from failure |
| Idempotency boundaries for side effects | Without them, retries create duplicates |
| Explicit cancellation semantics | Without them, abandoned workflows leave partial state |
| Execution-time policy evaluation | Without it, policy is advisory, not enforced |
| Audit artifacts tied to each decision | Without them, incident review is guesswork |
Not because of regulation. Because of operational safety.
Quick self-check
If you can’t answer “yes” to all of these, you have a gap:
- Can you resume a failed workflow from the exact step that failed — without re-executing earlier steps?
- Can you prove, six months later, which policy version authorized a specific action?
- Can you cancel a running workflow and guarantee no further side effects execute?
- Are your retries idempotent at the step level, not just the request level?
- Do you track decision context (not just events) for every step?
The first production failures in LLM systems are rarely about model quality. They are about execution semantics. And execution semantics require explicit control.
If you want to see this pattern end to end, run the execution-tracking example:
git clone https://github.com/getaxonflow/axonflow.git
cd axonflow/examples/execution-tracking/http
./example.sh
Working code in Go, Python, TypeScript, and Java. The workflow-control example demonstrates step gates and the execution ledger.