← All posts June 23, 2026 8 min read

Guardrails in the Agent's Memory Can't Police a Runaway Loop

AxonFlow is a source-available runtime control layer for production AI systems.

engineering production llm execution-control governance decision-mode

In this post

A nightly reconciliation agent had a guardrail. Before every payout it ran a validation step in its own loop: check the amount against the budget it was tracking in its own working memory, check the destination against an allowlist, then call the tool. It had passed review precisely because that check existed. In staging it behaved.

One night an upstream record came in malformed, and the planning step started to loop. It kept concluding there was one more payout to settle. The guardrail ran every single time, exactly as written. It approved every single time, because from inside the loop each individual payout was small, sent to an allowed account, and within a budget the loop kept telling it was fine. The check was correct about every transaction and blind to the only thing that mattered: that it was the same transaction, firing in a circle.

By morning the daily payout budget was gone, spread across a few hundred small transfers. The guardrail had not failed in the sense of throwing an error. It had run flawlessly. It just could not police a loop that it was running inside of.

A check that shares the agent’s fate is not a control

The now-common way to describe an agent harness is three parts: a loop that calls tools, a context manager that compresses history, and a permission layer that keeps things safe. It is a useful definition, and it hides the whole problem in one word. Notice where the permission layer sits. Inside. Same process, same memory, same context window, same bad night as the loop it is supposed to restrain.

A control that shares the failure modes of the thing it controls is not a control. It is a second feature that breaks at the same time as the first. We accept this nowhere else. A circuit breaker is not a polite request the appliance makes to itself. An auditor does not report to the team being audited. The brakes are not implemented in the engine’s imagination. Yet the default shape of agent safety today is exactly that: the agent checking the agent, in the agent’s own head.

This breaks down in three ways, and they are not the same way three times.

The runaway loop is the case the in-loop check is worst at. The whole point of a safety layer is the moment things go wrong. But a runaway loop is, definitionally, the agent’s own control flow coming off the rails. The in-loop check is part of that control flow. The failure you most need it to catch is the one that takes it down too. It is a smoke detector wired into the circuit that is on fire.

Injected text does not argue with the agent. It argues with the guardrail too. We have written before that agent instructions are not a security boundary. The same logic applies one level in. A prompt injection in a tool result, a document, an email the agent reads, lands in the same context the in-loop guardrail reasons over. The text that talks the agent into the bad action talks the in-context check into allowing it, because they are reading the same page. You cannot put the referee in the locker room and expect a clean call.

It does not compose across agents. An in-loop guardrail is reimplemented in every harness, by every team, slightly differently each time. Five agents, five subtly different versions of “the rule,” none of which can see what the others are doing. The thing you wanted, one consistent answer about what is allowed, is the one thing this shape cannot give you.

In the agent’s memory versus a separate decision

Pull the check out of the agent’s reasoning entirely. The agent, or a thin enforcement point next to it, sends the intent of an action to a separate decision point and waits for a verdict before the action runs.

flowchart TB subgraph A["In the agent's memory (shared fate)"] direction LR L[Reasoning loop] --> G[In-loop guardrail] G --> T[Tool call] G -. same context, same failure .-> L end subgraph B["A separate decision (separate fate)"] direction LR L2[Reasoning loop] --> P[Thin enforcement point] P -->|may this run?| D[Decision point] D -->|verdict + reason + trace| P P --> S[Real system] end A ~~~ B

The difference is not cosmetic. It is whose failure modes the check inherits.

	In the agent’s memory	A separate decision
Runs inside the agent’s reasoning and context	Yes	No
Survives a runaway loop	No, it is part of the loop	Yes, it has separate fate
Can be talked out of it by injected text	Yes, same context	No, it does not read the agent’s prompt as instructions
One answer across many agents and harnesses	No, reimplemented each time	Yes, one decision point answers all
Leaves an independent record	Rarely, and in the loop’s own logs	Yes, every verdict recorded outside the agent

One clarification, because this is where the argument is easy to overstate. The line that matters is not the operating-system process. It is the agent’s reasoning. A deterministic check at the tool-call boundary, a hook that fires before the call no matter what the model decided, already escapes most of the trap: the model does not execute it, cannot skip it, and cannot talk it out of a verdict by reasoning differently. Moving that check into a separate process as well is the strongest form, because then a wedged loop or a crash in the agent cannot take the decision down with it. In the agent’s memory is the failure mode. Out of the agent’s reasoning is the fix, and out of its process is the fix with belt and suspenders.

A separate decision point still receives the runaway loop’s request, just like the in-loop check did. The difference is that it can keep a record the loop cannot reach: a count of how many times this action has already fired, a budget the earlier calls already drew down. The hundredth identical payout gets judged against what the first ninety-nine already spent, not against the clean slate the loop keeps handing its own check. And because it sits outside the agent, the injected line in the agent’s context is not an instruction to it. It is just data attached to a request it will judge on its own terms.

Where this lives in practice

This is the shape of AxonFlow’s Decision Mode: an external decision point that a thin enforcement point calls over a plain authenticated HTTP request, returning allow, deny, or needs approval, each with a reason and a trace id. Because it is just a call, the same decision point answers for a coding agent, an MCP tool server, a workflow node, or a custom service, and none of them share a framework. Where a harness exposes a hook at the tool-call boundary, the enforcement point can ride that hook; where it exposes none, the enforcement point can be a proxy in front of the tool calls, asking the same question before any of them land. Either way the decision is made outside the agent’s reasoning. Separately, the kind of stateful check the reconciliation story needed, velocity caps and cumulative spend budgets, lives in its own family of policy types that are tracked across calls rather than recomputed per action. That is what lets a control see a burst as a burst instead of a hundred innocent singletons.

The footprint worth mentioning ends there. The point is not the endpoint. It is the location. The enforcement that can actually stop a runaway agent is the enforcement the runaway cannot reach.

Where this holds, and where it is still hard

Moving the check out of the agent’s reasoning does not make placement free. The enforcement point still has to sit where the action truly crosses into the real system. If an agent has a second, ungoverned path to the same tool, the decision point never sees that traffic and cannot rule on it. The enforcement is only as complete as the set of paths it actually intercepts, and finding all of them in a real system is genuine work.

And a verdict is a statement about admission, not about everything downstream. Allowing a call does not vouch for what the tool does with the bytes after it runs. This buys you one thing precisely: the runtime “no” no longer lives inside the reasoning it is meant to restrain. That is a smaller claim than “nothing can go wrong,” and it is the one that holds on the night the loop comes apart.

The closer

Look at whatever you are counting on to stop your agent from doing the worst thing it could do. Ask one question about it: when your agent next loses the plot in production, spawning, retrying, firing the same action in a circle, will that check be standing outside the wreck, or will it be inside it, going around with everything else?

If it shares the agent’s memory, it shares the agent’s bad night. A control you can trust is one that does not.

This is the second post in a series on execution control: the runtime layer that decides what your AI systems are actually allowed to do. Previous: Who owns the “no” in your AI stack?. Next: a decision you cannot explain is not a control.

A check that shares the agent’s fate is not a control

In the agent’s memory versus a separate decision

Where this lives in practice

Where this holds, and where it is still hard

The closer

Related posts and references