Most AI agents demo beautifully and fail in production. The pitch looks clean: connect a capable model, hand it a few tools, watch it book meetings or draft proposals. Then it runs for an hour without supervision, loses the thread, acts on information that was true yesterday, and you spend more time checking its work than the work would have taken you.
The instinct at that point is to blame the model and reach for a bigger one. That is almost always the wrong lever. The thing determining whether an agent is reliable is not the model. It is everything wrapped around it.
That “everything” finally has a name. It is called the harness, and the discipline of building it is harness engineering. We have been doing it since we wrote the first line of Axia.
What harness engineering actually is
Harness engineering is the discipline of designing the system that surrounds an AI model so it keeps doing the right thing over time, rather than doing the right thing once. The shorthand that has taken hold is simple: Agent = Model + Harness. The model reasons. The harness decides what the model sees, which tools it can reach, how it checks its own work, and when it must stop.
The term comes from Mitchell Hashimoto, who built HashiCorp and co-created Terraform. In a February 2026 post documenting his AI adoption journey, he described a habit: every time an agent made a mistake, he engineered a permanent fix into its environment so it could not make that specific mistake the same way again. Within weeks, OpenAI and Anthropic published their own pieces, and Birgitta Bockeler formalised it into a working vocabulary on Martin Fowler’s site. (Hashimoto’s original post, Fowler’s taxonomy.)
It is worth being precise about where this sits:
- Prompt engineering improves a single answer.
- Context engineering manages what the model sees in one turn.
- Harness engineering governs whether an agent can run for hours, across many decisions, without you watching.
These are not competing schools. They are a progression. Most teams arrive at the third only when the first two stop being enough.
The model was never the moat
Here is the finding that should reset how you think about buying AI. Across several 2026 studies, the same underlying model produced sharply different results depending only on the harness around it. In a planning-agent study, researchers noted that harness design alone can shift end-to-end performance on a fixed model by as much as six times. Treat that as a ceiling, not a typical result.
The controlled numbers are still striking, and they come from separate studies measuring different things, so it is worth keeping them distinct. In one benchmark, the same model moved from roughly 38% to 62% accuracy purely on harness design. In a separate result, a new self-improvement method lifted a coding benchmark from 59% to 78% in a single round, with no change to the model at all. (Heavy-lifting study, self-improvement result.)
The practical implication is direct. If two systems use the same frontier model, and one is reliable and one is not, the model did not make the difference. The harness did. The model is increasingly a commodity. The defensible work is in the system around it.
We were already living it: how Axia solves the harness problem
We did not adopt harness engineering after reading about it. We arrived at it the hard way, through production failures, building Axia, our autonomous business operating system for SME B2B sales. Our build methodology, Scaffold, has one rule at its centre: humans in at the decisions, AI in at the execution. That is a harness principle stated as a working practice.
Here is how we solve the four problems the discipline names.
Context rot. A large context window does not help if the signal is buried under old logs and noise. We never hand the model the full history. Axia uses the CRM as a context spine: for any given task, the agent retrieves only what that task needs, not everything that ever happened. The window stays clean because we curate it, not because the model is clever about ignoring junk.
Stale memory. An agent that trusts an old note will confidently apply a fix to a situation that changed yesterday. We treat memory as a hint, never as a fact. The live record is the single source of truth, and the agent verifies the current state before it acts. One of the rules baked into our system is simply: do not assume. It earned its place after the system assumed and got it wrong.
Skill routing. Give an agent more tools and you create a new problem: knowing which one to use, and when. We solve this with a library of named, versioned skills, each with explicit routing triggers, and a two-phase pattern for anything consequential. The agent prepares an action, then confirms it, before it touches the outside world. It does not freelance.
Verification and governance. The harness has to stop an agent acting blindly. Ours sorts every action into one of three lanes: the agent executes it, the agent drafts it for review, or the agent only flags a reminder. Anything that can be sent, booked, or changed passes through a confirmation gate first. We also run the system against itself, deliberately attacking it to find failure modes before a client ever does.
Behind all four sits the principle Hashimoto named and we had already written into our build notes: when something breaks, you do not just hope the model does better next time. You change the environment so that entire class of mistake stops happening. We call these constraints scars, not designs. Every one of them is a production failure turned into a permanent rule.
The skill library is the harness, made durable
The clearest proof that this is real, not retrofitted, is the skill library itself. Recent work already names the SKILL.md file as a canonical harness artifact: a structured, machine-readable instruction set that lets a general model behave reliably. We run a working library of exactly these — persona skills that hold a consistent voice across content, and functional skills that govern outreach, proposal building, research verification, and scheduling.
Each skill is a capability or a correction encoded into the environment once, so it does not have to be re-explained every time. That is the whole game. The harness is where institutional knowledge lives, so the model does not have to carry it.
Where this is heading: harnesses that improve themselves
The next move in this field is automating the loop we currently run by hand. A June 2026 method called Retrospective Harness Optimization lets an agent improve its own harness by looking back at past runs, diagnosing what went wrong, and testing fixes, without a labelled answer key. (RHO paper.)
For a one-operator shop building at small-team velocity, that is the long-term prize: a system that tunes itself from its own production logs instead of waiting for us to edit it after each failure. It is unproven in production today, so we are watching it rather than betting on it. But it is the automated version of what Scaffold already does manually, which tells us the direction is right.
What this means if you are buying AI
If you are evaluating any AI system that runs without constant supervision — for sales, for operations, for anything that acts on your behalf — change the question you ask. Do not ask which model it uses. Ask about the harness:
- What happens when it is wrong?
- How does it know it is wrong?
- What stops it making the same mistake twice?
- What does it verify before it acts, and what does it never act on without you?
A system that cannot answer those is a demo. A system that can is a product. Axia is built to answer all four, because we learned what each question costs before we had a name for the discipline.
If you want to see what that looks like running live rather than described on a page, that is a conversation worth having.
Ready to take the next step?
V8 builds AI operating systems for sales and marketing — and runs them. Scaffold is how that gets built around your operations.
See how Scaffold works