What Is Harness Engineering — Designing the Reins for AI Agents

In Part 1 of this series, I talked about the decline of prompt engineering. With CLI-based tools on the scene, the value of manually crafting elaborate prompts was fading.

But as 2026 unfolded, I realized that what replaced prompt engineering wasn't simply "better tools." Prompt engineering gave way to context engineering, and now context engineering is giving way to an entirely new paradigm: harness engineering.

In this post, I'll break down what harness engineering is, why it matters right now, and what its key components look like.

A Harness for a Horse, a Harness for an Agent

A harness originally refers to the tack fitted onto a horse. Bridle, saddle, stirrups — equipment designed not to suppress the horse's power, but to channel it in the right direction.

In AI, the term means exactly the same thing. A harness is the entire external system that controls and directs an AI agent's powerful capabilities toward the right outcomes.

Written as an equation:

Agent = Model + Harness

Everything that isn't the model is the harness. System prompts, tool definitions, sandbox environments, orchestration logic, feedback loops, memory management, middleware hooks. The practice of designing all of this around an agent is harness engineering.

Why the Model Alone Isn't Enough

LLMs have inherent limitations:

They can't maintain state across sessions
They can't execute code directly
They can't access real-time information
They can't reliably verify their own output

Even a "conversation" with ChatGPT is really the product of a basic harness. Tracking previous messages, running a while loop — that itself is a primitive harness.

Anthropic experienced these limitations firsthand. When they gave Claude Opus 4.5 a high-level instruction like "build a clone of claude.ai," the same failure patterns kept appearing:

The context window filled up, leaving the implementation incomplete
As the context limit approached, "context anxiety" set in — the model would try to wrap up prematurely
When asked to evaluate its own work, it would confidently praise even terrible results

This isn't a problem with the model's intelligence. Take away handoff documents, progress boards, and a testing environment, and even the most talented engineer would hit the same walls. Agents are no different. They need an environment.

Prompt → Context → Harness: Three Layers

These three concepts don't replace each other. Each one encompasses and extends the last.

Layer	Core Question	Design Target
Prompt Engineering	"What should I ask?"	The instructions sent to the LLM
Context Engineering	"What should I show?"	All tokens provided to the model at inference time
Harness Engineering	"How should I design the entire environment?"	Constraints, feedback, and operational systems external to the agent

Think of it this way. A prompt is the voice command: "Turn right." Context is the map and road signs. The harness is the bridle, saddle, fences, and road maintenance all combined — the entire environment.

2022–2024 was the golden age of prompt engineering. How you structured a question determined the quality of the answer.

In mid-2025, Andrej Karpathy used the phrase "context engineering," and the paradigm shifted. RAG, MCP, memory systems — system-level design that delivers the right information to the model at the right time became the focus.

Then in February 2026, Terraform creator Mitchell Hashimoto explicitly used the term "harness engineering" in a blog post. Around the same time, OpenAI published an experiment report where they completed production software using only agents — no human typing code. The secret wasn't model performance. It was meticulous environment design.

The era of harness engineering had officially begun.

The Five Core Components of a Harness

Synthesizing Anthropic's official engineering blog and various real-world case studies, a harness consists of five core components.

1. Filesystem and Persistent Storage

Giving the agent a workspace. This is the most fundamental harness component.

LLMs forget everything when a session ends. But the filesystem persists. Anthropic leveraged this simple fact to achieve continuity across sessions.

The key pattern is the claude-progress.txt file. Each session records what it accomplished, and the next session reads this file along with the Git log to understand the current state. It's codifying what an effective software engineer does every day — check where you left off yesterday and figure out what to do today.

Combined with Git, it becomes even more powerful. Version control, experimental branches, rollback to a previous commit when things go wrong. Multiple agents can even collaborate in the same repository.

The filesystem is the agent's long-term memory.

2. Code Execution and Sandboxing

The agent's hands and feet. Tool use fills this role.

You can't pre-build tools for every possible action. That's why general-purpose tools — bash execution, file read/write, code execution — are essential. The agent needs to be able to generate tools on the fly as code.

But blindly executing agent-generated code is dangerous. That's why sandboxing is necessary. Run only permitted commands in an isolated environment, restrict network access, spin up and tear down as needed.

In Anthropic's latest architecture (Managed Agents), this is called the "Brain and Hands separation." The Brain (LLM + control logic) and Hands (sandboxed execution environment) are physically separated. The Hands are stateless, have no access to long-lived credentials, and are created only when needed.

The security benefits of this separation are clear. Previously, untrusted code ran in the same container that held credentials. A single prompt injection could read environment variables. With separation, that attack surface vanishes.

3. Context Management and Preventing Context Rot

"Context Rot" — the phenomenon where the model's reasoning ability degrades as the context window fills up.

Research from Chroma shows that performance degrades as context length increases, and the degradation is worse when semantic similarity between the query and relevant information is low. A bigger context window isn't inherently better. It's just a bigger haystack — it doesn't improve the needle-finding ability.

Harnesses address this problem through several strategies.

Compaction: When context limits are reached, summarize and trim existing content to keep work going. In Claude Code, when token usage hits 98%, earlier history is automatically summarized.

Tool call offloading: Keep only the head and tail of large tool outputs, offloading the full content to a file that's referenced only when needed.

Progressive disclosure via Skills: Instead of loading all instructions into the context upfront, load relevant instructions and tools only when the agent actually needs them.

The "silence on success, noise on failure" principle: A practical lesson from the HumanLayer team. Early on, they ran the entire test suite every time, flooding context with 4,000 lines of passing results. Switching to reporting only failed tests dramatically improved context efficiency.

4. Sub-Agents and Context Isolation

The real value of sub-agents isn't role-based division like "frontend team" and "backend team." It's context isolation.

Sub-agents absorb all the intermediate noise of investigation, exploration, and implementation, then deliver only the final result to the parent agent in a concise summary. They act as a "context firewall."

This is also advantageous for cost control. You can use an expensive model (Opus) for the parent session and cheaper models (Sonnet, Haiku) for sub-agents. Tasks with narrow, well-defined scope work fine with less powerful models.

What you need isn't a longer context — it's better context isolation.

5. Hooks and Back-Pressure Mechanisms

Hooks are user-defined scripts that run automatically at specific points in the agent's lifecycle. Similar to Git hooks, but more flexible.

Claude Code's hook system supports 21 lifecycle events. PreToolUse, PostToolUse, SessionStart, Stop, and more. For example:

Auto-run type checks and formatters after file modifications
Block the --no-verify flag on git commit
Automatically block modifications to config files

Back-pressure is a mechanism that makes agents verify their own work. Type checks, test execution, coverage reports, browser automation tests.

The HumanLayer team called this "the highest-leverage investment." An agent's task success rate correlates strongly with its self-verification capability.

When Anthropic gave agents access to Puppeteer MCP (browser automation), the agents found and fixed bugs that would have been impossible to catch from code alone. The key was making agents test like a real user before marking a feature "done."

Experiments That Proved the Power of Harness

LangChain Terminal Bench 2.0

The model stayed the same; only the harness was improved:

Score: 52.8 → 66.5 (13.7-point increase)
Ranking: ~30th → top 5 (25 places up)

Not a single byte of model weights was changed. Only the system prompt, tools, middleware, and feedback loops were adjusted.

Hashline File Editing Experiment

An experiment by Can Boluk that improved only the file editing format. By appending a hash to each line so the model could reference positions:

Grok Code Fast 1: 6.7% → 68.3% (61.6-point increase)
All models average: ~20% reduction in output tokens

No model weight changes — just one tool interface change in the harness.

Anthropic's 3-Agent Harness

A one-line prompt — "build a retro game maker" — was tested two ways:

Single agent: 20 minutes, $9 — flashy interface but broken game mechanics
3-agent harness: 6 hours, $200 — 16 features across 10 sprints, a polished result

The $9 o u t p u t l o o k s f i n e a s a d e m o, b u t f a l l s a p a r t t h e m o m e n t y o u a c t u a l l y u s e i t . T h e$ 200 output included AI-generated sprites and functional gameplay — close to production quality.

Co-Evolution of Model and Harness, and the Paradox

Here's a fascinating fact. Frontier coding models are post-trained inside their own harnesses. Claude is trained in the Claude Code harness; Codex models are trained in the Codex harness.

This produces a paradoxical result. On Terminal Bench 2.0, Claude Opus 4.6 ranked 33rd within Claude Code (its training harness), but climbed to the top 5 when used with a different harness.

This means a model can become "overfitted" to its default harness. Using the default harness as-is may not be optimal — customizing the harness to match the task characteristics can yield meaningful performance gains.

A Fundamental Shift in the Engineer's Role

The implications of harness engineering are clear.

"Writing code" is becoming "designing the environment where AI writes code correctly."

Borrowing the words of the OpenAI Codex team: "The hardest challenge is now designing the environment, feedback loops, and control systems."

Chad Fowler called this the "relocation of rigor." The rigor of writing code line by line with precision is shifting to the rigor of designing systems that make agents work correctly.

Does that mean harnesses become unnecessary as models get smarter? No. Just as prompt engineering is still useful, a well-designed environment, the right tools, persistent state, and verification loops make any agent more effective — regardless of the model's baseline intelligence.

The HumanLayer team's conclusion captures this best.

"The model is probably fine. It's a harness problem."

When a coding agent doesn't perform as expected, check the harness before blaming the model. The contents of the CLAUDE.md file, the connected tools, whether feedback loops exist, how efficiently context is managed. The answer is almost always there.

In the next post, I'll cover how to actually apply harness engineering — Anthropic's specific architecture patterns and implementation examples.

사랑을 직접 올리지 않는 설계

- 4월 09, 2026

자세한 내용 보기

이 블로그 검색

goldtag