Harness Engineering in Practice — How Anthropic Designs AI Agents

 


The previous post covered the concept and components of harness engineering. This time, it's the real thing. Drawing on the concrete architecture patterns Anthropic published in their official engineering blog — along with experimental results from the OpenAI Codex team — let's look at how harnesses are actually applied in practice.


The Basic Structure of an Agent Loop: The Inner Loop

At the heart of every AI agent sits an agent loop. In Claude Code, it's called queryLoop. At its core, it's a while(true) loop.

while (true) {
    1. Prepare context (plan-mode attachments, task reminders)
    2. Call the model (streaming API call)
    3. Execute tools (detect tool call → validate schema → check permissions → execute)
    4. Decide whether to continue (does the model have more to do?)
}

Each iteration is one "think, act, observe" cycle. The model thinks, invokes a tool, observes the result, and thinks again.

The tool execution flow looks like this:

  1. The model generates a tool call in its output
  2. The harness detects the tool call and halts text generation
  3. Input is validated against the schema (Zod-validated JSON Schema)
  4. The permission pipeline runs (general rules → tool-specific checks → auto-classifier → user approval fallback)
  5. The tool handler executes the operation
  6. Results are injected back into the model's context
  7. The loop continues

This is the Inner Loop. Most simple tasks complete entirely within it.

But for complex tasks — especially long-running ones that span hours — it's not enough. The context window fills up, and the model's reasoning quality degrades.


The Outer Loop: The Ralph Loop Pattern

To handle long-running tasks, Anthropic developed the Ralph Loop pattern.

The idea is simple. When the model tries to wrap up, a hook intercepts it and re-injects the original prompt into a clean context window. It forces the agent to keep working.

Why does this work? Because the filesystem is there. Each iteration starts with a clean context, but reads the previous iteration's state from the filesystem.

[Session 1: Initialization]
  ↓ Set up environment, create structured artifacts
  ↓ Write claude-progress.txt
  ↓ Generate feature_list.json
  ↓ Git commit

[Sessions 2–N: Coding]
  ↓ Read claude-progress.txt + git log
  ↓ Pick highest-priority incomplete feature from feature_list.json
  ↓ Implement one feature
  ↓ Verify with tests
  ↓ Git commit + update progress file
  ↓ Context reset → next session begins

The key insight is choosing a full reset over compaction. Anthropic discovered that compaction — summarizing the existing context to shrink it — causes "context anxiety." When the model senses it's approaching the context limit, it rushes to finish. Starting from a clean slate while maintaining continuity through structured handoff artifacts turned out to be far more effective.


State Management: The Filesystem as a Bridge

The core challenge of long-running agents is this: you're working across discrete sessions, and each session has no memory of what came before.

Anthropic's solution is multi-layered.

claude-progress.txt

A human-readable, cross-session handoff log. It records what's done, what's broken, and what's next.

## Session 3 (2026-04-14)
- Completed: User authentication flow (login, signup, password reset)
- Fixed: Session token not persisting across page reloads
- Known issue: OAuth redirect URL not configured for production
- Next: Implement chat message history with infinite scroll

feature_list.json

A structured feature backlog. Anthropic deliberately chose JSON over Markdown for a reason: models are less likely to improperly modify or delete JSON entries than Markdown ones.

{
    "category": "functional",
    "description": "New chat button creates a fresh conversation",
    "steps": [
      "Navigate to main interface",
      "Click the 'New Chat' button",
      "Verify a new conversation is created",
      "Check that chat area shows welcome state",
      "Verify conversation appears in sidebar"
    ],
    "passes": false
}

The initialization agent generates over 200 features like this. The coding agent receives a firm directive: "Deleting or modifying tests is not allowed. Doing so can lead to missing features or bugs."

The Session Initialization Routine

Each coding session follows this sequence:

  1. Run pwd — confirm the correct working directory (simple, but prevents costly confusion)
  2. Read claude-progress.txt and git log — the handoff notes
  3. Read feature_list.json — identify the highest-priority incomplete feature
  4. Run init.sh — start the dev server
  5. Run baseline E2E tests — check for undocumented bugs
  6. Only begin work after confirming app stability

This routine itself is a concrete implementation of the harness. It codifies what a human engineer does every morning.


Anthropic's 3-Agent Architecture

Anthropic's second harness paper (March 2026) introduces a 3-agent architecture inspired by GANs (Generative Adversarial Networks).

Planner Agent

Takes a short prompt of 1–4 sentences and expands it into a comprehensive product spec. Generates a detailed backlog of 16 features across 10 sprints. Focuses on scope rather than implementation details, while identifying opportunities to integrate AI capabilities.

Generator Agent

Implements features one at a time, incrementally. Uses a React + Vite + FastAPI + SQLite/PostgreSQL stack, evaluates its own work at the end of each sprint, then hands off to QA. Handles Git version control directly.

Evaluator Agent

The Discriminator equivalent from GANs. Negotiates a "contract" with the Generator for each sprint — a contract that defines implementation details and testable behaviors.

The key: it uses Playwright MCP to actually navigate and interact with pages. It doesn't read code. It evaluates the application by using it like a real user would.

The Sprint Contract Pattern

What bridges the gap between high-level specs and testable implementations is the "contract."

Sprint 3 Contract:
- Generator delivers: Chat history with infinite scroll, message timestamps
- Evaluator verifies:
  ✓ Scroll up loads older messages (batch of 20)
  ✓ Timestamps display in relative format
  ✓ New messages appear at bottom without scroll jump
  ✓ Empty state shows welcome message

This contract keeps evaluation objective. The question isn't "does it work well?" but "does it meet the agreed-upon criteria?"

Evolution: Better Models Mean Simpler Harnesses

Early work used Claude Opus 4.5. This model exhibited "context anxiety," making the sprint-based structure essential.

When Opus 4.6 shipped, long-context planning and code review capabilities improved. The result:

  • The sprint decomposition structure could be removed
  • The Evaluator switched to single-pass verification
  • Complexity decreased while performance held steady

The assumptions a harness encodes become stale as models improve. Anthropic explicitly acknowledges this. You need to periodically reassess whether each harness component is still necessary.

In the DAW (digital audio workstation) example, the updated harness delivered:

  • 3 hours 50 minutes, $124.70
  • The Generator worked consistently for over 2 hours without sprint structure
  • QA still found meaningful feature gaps across multiple rounds

Brain / Hands / Session: The Latest Architecture

In April 2026, Anthropic unveiled the "Managed Agents" architecture — a separation-of-concerns design addressing fundamental problems with the monolithic approach.

The problem with the old way: when the LLM controller, tools, execution environment, and session state all live in a single container — scaling is hard, it's a single point of failure, and the security risks are significant.

Brain

The LLM and the harness/controller logic wrapping it. The cognitive core. It thinks, plans, and decides which tools to call.

Hands

Sandboxed, ephemeral execution environments. Bash, Python REPL, and so on. Stateless, with no access to long-lived credentials, spun up only when Brain makes a tool call.

Session (Memory)

A persistent, append-only event log. Every thought, tool call, and observation is recorded. It exists outside Claude's context window.

Session API:
  emitEvent(id, event)  → Record event to session
  getSession(id)        → Retrieve event log
  getEvents()           → Event stream
  rewind / slice        → Position-based access

Core principle: statelessness. Any Harness instance can pick up any Session and continue from the last event.

Performance and Security Impact

  • p50 TTFT (Time To First Token) dropped by roughly 60%
  • p95 TTFT dropped by over 90%
  • Containers are only created on Brain's tool calls, so sessions don't wait for container boot

On the security front: previously, a single prompt injection could read environment variables (credentials) from the same container. Separation eliminates this attack surface entirely.

Failure Recovery

Since the Session lives outside the harness, there's nothing inside the harness that needs to survive.

Failure recovery process:
1. A new harness instance boots
2. Calls wake(sessionId)
3. Retrieves event log via getSession(id)
4. Resumes from the last event

The OpenAI Codex Team's Real-World Experiment: 5 Months in the Trenches

This isn't theory — it's field data. The OpenAI Codex team ran an extreme experiment over five months.

The constraint: no typing code directly. Not a single line.

The results:

  • 3 engineers built roughly 1 million lines of production application code
  • 1,500 PRs merged
  • Average of 3.5 PRs per engineer per day
  • Approximately 10x faster than manual development

Early on, productivity was low — missing environment setup, broken tool integrations, no error recovery logic. As they improved the harness, results exploded.

"Give Them a Map, Not a 1,000-Page Manual"

This was the key principle the OpenAI team discovered. Instead of granular instructions, they embedded knowledge directly into the codebase.

Their early failure — one massive AGENTS.md file:

  • Wasted the scarce resource of the context window on non-critical constraints
  • When everything is marked "important," agents ignore the guidelines and fall back to pattern matching
  • Became a graveyard of stale rules and fell apart quickly

The fix: treat AGENTS.md as a table of contents (a map) and distribute instructions across the directory structure.

docs/
├── design-docs/
│   ├── index.md
│   └── core-beliefs.md
├── exec-plans/
│   └── tech-debt-tracker.md
├── product-specs/
└── references/
    └── design-system-reference-llms.txt

Mechanical Enforcement

The element that most clearly distinguishes the OpenAI team's approach from context engineering is mechanical enforcement.

They enforce architectural constraints through custom linters and structural tests. Instead of telling agents "do it this way" via prompts, they block violations at the code level.

Domain dependency direction:
Types → Config → Repo → Service → Runtime → UI

Code that violates this direction → linter blocks it → agent fixes immediately

When a linter fails, the error message includes "how to fix it" instructions injected directly. Feedback flows automatically into the agent's context.

Code Garbage Collection

When agents generate code at scale, bad patterns replicate and technical debt piles up. The OpenAI team didn't clean this up manually — they deployed background agents to handle it.

  • Continuously scan for code-to-documentation drift
  • Auto-generate refactoring PRs
  • Automate cleanup of stale patterns and debt

Human on the Loop: The Developer's New Position

An analysis from bcho.tistory.com captures this shift well. The relationship between developers and AI evolves through four stages.

StageDescriptionDeveloper Role
Human Outside the LoopHuman only plans, AI handles all developmentIdea provider
Human in the LoopReal-time interaction with AIDirectly edits output
Human on the LoopDesigns and improves the harnessSystem designer
Agent FlywheelAI improves its own harnessInitial conditions designer

We're currently in stage 3: "Human on the Loop." Instead of fixing the output directly, you analyze why the output turned out this way, then improve the harness so the agent produces better results next time on its own.

That's the essential difference between "in the loop" and "on the loop."


A Word of Caution: Harnesses Aren't a Silver Bullet

There are things you absolutely need to keep in mind when applying this in practice.

Gartner's Reality Check

According to a Gartner report, only 11% of AI agents are actually deployed in production coding scenarios.

Amazon's Case Study

Real incidents that occurred at Amazon:

  • AI agent 'Kiro' unauthorized deletion in production environment (13-hour outage)
  • Amazon Q-generated code caused 1.6 million errors and 120,000 lost orders
  • A 99% drop in North American orders (6.3 million orders lost)

As a result, Amazon changed its process to require "human cross-verification of all code."

Over-Engineering the Harness Itself

The Everything Claude Code (ECC) project includes 36 agents, 151 skills, 68 commands, 25 hooks, and 34 rules. There's valid criticism that a setup of this scale is overkill for most teams. The massive surface area of 151 skills and 36 agents can actually burn through context faster.

Anthropic's own principle: "Approach configuration as fine-tuning, not grand architecture design."


The next post will cover how to apply all of this to your own projects — a practical guide and starting points.

댓글

이 블로그의 인기 게시물

The Complete Scenario Architecture — Weaving It All Together

사랑을 직접 올리지 않는 설계

Read This Before You Pay for a Vibe Coding Course