AI Agents, Regulated Tech

What It Takes to Run AI Agents in a Regulated Environment

Most demonstrations of AI agents end where the hard part begins. An agent that reads a ticket, writes some code, and posts a summary looks finished in a five-minute video. Put the same agent inside a regulated financial-services firm and the questions change. Who approved that action? Where is the record? What stops it doing the same thing wrong a thousand times before anyone notices? In a regulated environment, the model is the easy part. The controls around it are the work.

We build AI agents that run in production for regulated firms, so we live inside these questions. This piece sets out what we have learned about the difference between an agent that demos and an agent your compliance function will sign off.

What a production agent actually is

A production agent in a regulated setting is a piece of software that takes an instruction, decides on one or more actions, and carries them out against real systems. It might triage a support queue, reconcile a ledger, draft a regulated communication, or move a delivery workflow forward. The defining feature is that it acts. It is not a chatbot that answers a question and stops. That is exactly why a regulator, an auditor, or a board will treat it as a system that needs governing, not a productivity gadget.

Once an agent can act, the firm inherits responsibility for every action it takes. The Senior Managers and Certification Regime does not have a carve-out for software that made the decision. A named individual still owns the outcome. So the practical question for any agent is not how clever it is. It is whether the firm can explain, evidence, and control what it does.

Audit trails: if it is not recorded, it did not happen

The first thing we build into an agent is its record. Every meaningful step is logged: the input it received, the context it retrieved, the action it chose, the action it took, and the result. Not a free-text note after the fact. A structured, timestamped, append-only record that maps one to one to what happened.

We design these records so a reviewer can reconstruct a single decision end to end without reading the code. That means capturing:

  • The exact prompt and the data the agent saw, with personal data handled under the firm's retention rules.
  • Which model and version produced the output, so behaviour can be tied to a specific configuration.
  • The validated output the agent acted on, separate from the raw model text.
  • The downstream effect: the API called, the record changed, the message sent.

We treat the model output as untrusted until it has been validated. In our own systems, agent outputs are parsed against a strict schema before anything reaches business logic, so a malformed or unexpected response is rejected rather than acted on. The validation step is also a control point: it is where the record is written and where a bad output is caught.

Human in the loop, placed where it matters

Human oversight is not a slogan. It is a set of decisions about which actions an agent may complete on its own and which require a person to approve before anything happens. The skill is putting the gate in the right place. Gate everything and you have rebuilt a manual process with extra steps. Gate nothing and you have shipped an unsupervised system into a regulated firm.

We map each action an agent can take to a risk level, then set the oversight to match. Low-risk, reversible actions run on their own with the record available for review after the fact. Actions that move money, touch a customer, or change a regulated record stop at a checkpoint and wait for a person. Hermes, our multi-agent coding system, follows this pattern: it plans, writes, and tests software on its own, but a person approves at defined decision gates before changes progress. The agents do the work; the gates keep a human accountable for the outcome.

Change control for systems that change themselves

Regulated firms already have change control for software: review, testing, sign-off, and a record of what shipped and when. Agents stretch that discipline because the thing that governs their behaviour is not only code. It is also the prompts, the tools they can call, the model version, and the policies that decide what they may do unattended.

We put all of those under the same control as code. A prompt change goes through review. A new tool an agent can call is a reviewed addition, not a quiet edit. A model upgrade is a versioned change that is tested before it reaches production, because a newer model can shift behaviour even when nothing else moved. The point is simple: if a change can alter what the agent does, it gets the same scrutiny as a change to the code, with the same record.

Scrum Master Agents, which we run inside a regulated insuretech, sit alongside human engineering teams and manage delivery workflows. Treating their configuration as controlled change is what lets them operate next to people in a regulated setting rather than as an experiment off to one side.

Vendor diligence: you own what you depend on

Almost every agent depends on a model provider, and that provider is a third party under the firm's outsourcing and operational-resilience obligations. Choosing a model is a procurement decision with diligence attached, not a default in a config file.

When we select a provider we look at where data is processed, what the provider may do with prompts and outputs, the data-residency and retention terms, and how the firm would carry on if that provider had an outage or changed its terms. We design so a model can be swapped without rewriting the system: a shared provider layer, validated outputs that do not depend on one model's quirks, and per-call cost and usage records so the firm can see what it is spending before the invoice arrives. That is as much a resilience requirement as a commercial one.

How we approach it

Our starting point is that the controls come first and the agent is built to fit them. We agree which actions are in scope, what each one is allowed to do unattended, and where a person has to sign off. We build the record before we build the behaviour, so there is never an action the firm cannot account for. We validate every output, version every change, and keep the model as a replaceable dependency rather than a foundation.

An agent that demos answers one question: can it do the task. An agent that ships in a regulated firm answers a harder one: can you prove, control, and stand behind everything it did.

None of this makes agents slower to build in any way that matters. It makes them the kind of system a regulated firm can actually run, because the audit trail, the oversight, and the change control are part of the design rather than something bolted on when a review comes round. That is the difference between a clever prototype and software you can put in front of an auditor.

Book a call

Direct access to a senior engineer. Tell us what you need and we will tell you if we can help.