Blog

Engineering insights on AI governance, LLM infrastructure, and building production control planes.

Subscribe via RSS

June 9, 2026 6 min read

Who Owns the "No" in Your AI Stack?

A workflow had run cleanly in staging for weeks. In production, one downstream call timed out. The orchestrator did the sensible thing and retried. The retry re-issued an action the workflow had …

engineering production llm execution-control governance decision-mode

May 5, 2026 7 min read

The Case for an AI Execution Control Plane

A team had a workflow that retried itself when an LLM call timed out. The retry budget was set on the wrong layer, so a single bad query drained their daily quota in 90 seconds. They added a budget. …

engineering production llm architecture control-plane governance wcp map

April 28, 2026 11 min read

Idempotency Boundaries in Multi-System AI Automation

A workflow tried a wire transfer. The bank API succeeded. The orchestrator crashed before it could record completion.

On retry, the workflow hit the bank again. Same Idempotency-Key. The bank …

engineering production llm workflow idempotency retry wcp governance

April 21, 2026 8 min read

Agent Instructions Are Not a Security Boundary

Previous post in this series: SOUL.md Is Not a Security Boundary. The OpenClaw-specific version of the argument below, with CVE context and the ClawHavoc supply-chain incident.

“We told the …

engineering production security governance plugins openclaw claude-code cursor codex

April 14, 2026 9 min read

SOUL.md Is Not a Security Boundary

“The agent won’t do that. I told it not to in SOUL.md.”

This is the most common response when someone asks how an OpenClaw agent is prevented from running rm -rf /, exfiltrating …

engineering production openclaw security governance plugins

March 31, 2026 7 min read

You Can't Pause an AI Workflow. You Can Only Stop at the Right Boundary.

“Pause the workflow before it sends anything.”

This sounds simple. In a traditional batch system, you checkpoint state, stop the process, and resume later from the checkpoint. In a …

engineering production llm workflow pause-resume hitl governance wcp

March 24, 2026 8 min read

Audit Logs Are Not Enough. You Need Decision Context.

Every production AI system has audit logs.

Most of them record the wrong thing.

They capture what happened: which model was called, what the input was, what the output was, how long it took, how much …

engineering production llm audit compliance governance eu-ai-act rbi mas-feat

March 10, 2026 7 min read

Cost Per Call Is a Useless Metric for Agent Systems

Most teams track cost per LLM call.

It is the easiest metric to compute and the least useful for understanding what your system actually costs.

Every provider returns token counts. Multiply by price …

engineering production llm cost-control observability

March 3, 2026 5 min read

Why Retry Budgets Should Be Per Tool, Not Per Run

Most teams start with a global retry limit or max-iteration cap on agent runs.

It feels safe. It gives one number to monitor.

In production, a single unstable tool can consume most of that budget …

engineering production llm cost-control reliability

February 24, 2026 3 min read

Orchestration Is Not Execution Control

Your orchestration graph can run exactly as designed and still produce incidents.

Tools fire. Models respond. Branches resolve.

Yet you end up with duplicate writes, partial state, and postmortems …

engineering production llm orchestration execution-control

February 12, 2026 7 min read

What Breaks First When LLM Workflows Hit Production

Most LLM systems start life as request-response calls.

You send a prompt. You get a response. If it fails, you retry.

This works beautifully in demos.

It breaks quickly in production.

Not because the …

engineering production llm execution-control