15 June 2026AI Agents, Engineering Delivery

What a Production Multi-Agent System Actually Looks Like

The phrase multi-agent system gets used for a lot of things, most of them small. Two prompts calling each other in a notebook is not a multi-agent system. Neither is a single model asked to play several roles in one long conversation. What we mean by it, and what it takes to run one in production, is a set of distinct agents that each own a stage of real work, hand off to each other in a controlled way, and ship a result a team can rely on. This piece describes how that looks in practice, using Hermes, our own multi-agent coding system, as the worked example.

Hermes plans, writes, tests, and moves software toward deployment, with people reviewing at the points that matter. It is live and it builds real systems. That last part is the whole point. A demo answers whether agents can do a task at all. Production asks a harder question: can they do it reliably, day after day, in a way a team trusts enough to leave running.

Specialised agents, not one model wearing many hats

The first design choice is to split the work. Software delivery is not one task. It is planning a change, writing the code, testing it, and getting it safely toward production. Each of those stages rewards a different posture. Planning wants breadth and caution. Writing wants focus on a narrow, well-specified change. Testing wants suspicion, a deliberate hunt for what is wrong rather than confirmation that something works.

So in Hermes those stages are handled by separate agents rather than one model asked to do everything at once. A planning agent turns an instruction into a concrete, scoped change. A writing agent implements that change against the codebase. A testing agent checks the result and tries to break it. Splitting the work keeps each agent's job small and legible, which is what makes its output reviewable. It also means a weak step shows up as a weak step, not as a vague failure somewhere inside a thousand-line conversation.

The handoffs are the architecture

If the agents are the pieces, the handoffs between them are the system. The hard engineering in a multi-agent setup is not the prompts. It is what passes from one stage to the next, and whether the next stage can trust it.

We treat every handoff as structured data, not free text. The planning agent does not hand the writing agent a paragraph of intentions. It hands over a defined shape: what is changing, where, and what the result should satisfy. We parse that against a strict schema before the next agent ever sees it, so a malformed or half-formed output is caught at the boundary rather than turned into bad code three steps later. The same applies coming out of the writing and testing stages. Outputs are validated, not assumed.

Each stage produces a defined, parseable output, not prose the next stage has to interpret.
Outputs are validated against a schema at the boundary, so a bad result fails fast instead of propagating.
Each handoff is recorded, so the path from instruction to shipped change can be reconstructed step by step.
Model choice is per stage: the right model is routed to each job, and it can be swapped without rewriting the system.

This is also where the reliability comes from. A long single-prompt agent fails in ways that are hard to localise. A pipeline of validated handoffs fails at a named boundary, with a record of what went in and what came out. When something goes wrong, and it will, you can see exactly which stage produced the bad result and why.

Decision gates: where a person stays accountable

Agents doing the work does not mean agents making every call unattended. Hermes has decision gates, points where the system stops and a person reviews and approves before it proceeds. The skill is putting the gates in the right places. Gate every step and you have rebuilt a manual process with extra latency. Gate nothing and you have shipped an unsupervised system that changes production code on its own.

We place gates where the cost of being wrong is highest and the action is hardest to reverse: a plan before it becomes code, and a change before it moves toward deployment. Low-stakes, easily reversible steps run on their own with the record available afterwards. The result is that a person is never in the loop for everything, but is always accountable for the outcome. That distinction, doing the work versus owning the outcome, is the one that lets a team actually trust the system.

Agents do the work. People stay accountable for the outcome. The decision gates are where those two facts meet, and getting their placement right is most of the design.

How it ships software

Put together, a run through Hermes looks like a delivery process rather than a chat. An instruction comes in. The planning agent scopes it into a concrete change, which a person reviews at the first gate. The writing agent implements the approved plan against the codebase. The testing agent checks the result and looks for failures. The change then waits at a second gate for a person to approve before it moves toward deployment. At every stage the inputs, outputs, and decisions are recorded, so nothing the system did is unaccountable after the fact.

None of this is exotic. It is the discipline of ordinary software delivery, scoping, review, testing, sign-off, applied to a system whose workers happen to be agents. That is deliberate. The teams who can trust agents are the ones who can see the same controls they already rely on, not a black box that produces commits.

Why we build them this way

Hermes is how we pressure-test what we recommend to clients. The patterns in it, specialised agents, validated handoffs, model choice as a replaceable dependency, and human-in-the-loop gates, are the same ones we bring to agent work for regulated firms. We did not arrive at them because they sounded rigorous. We arrived at them because the alternatives, the long single prompt and the unsupervised agent, do not survive contact with production.

A production multi-agent system, then, is not a cleverer model. It is an architecture: small specialised agents, structured and validated handoffs, gates placed where reversibility runs out, and a complete record underneath. The model is the part that gets the headlines. The structure around it is the part that ships.

Book a call

Direct access to a senior engineer. Tell us what you need and we will tell you if we can help.

Book a call All articles