Harness Engineering in Practice — How to Apply It to Your Project Right Now

Harness Engineering in Practice — How to Apply It to Your Project Right Now

You understand the concept (Part 3). You've seen how Anthropic implements it (Part 4). That leaves one question. How do you apply it to your own project?

This post covers concrete methods for putting harness engineering to work in production, and the shifts in the developer's role that this paradigm will bring.


Principle 1: Start from Failure

This is Mitchell Hashimoto's principle — and one the HumanLayer team arrived at independently.

Don't try to design the ideal harness upfront. Every time the agent fails, add a structural safeguard that prevents that failure from recurring.

In HumanLayer's words: "Have a shipping bias. Only touch the harness when the agent actually fails."

The mindset resembles TDD (Test-Driven Development). Just as you write a failing test first and then write the code to make it pass — you observe the agent's failure patterns and add harness elements that prevent them.

Research from ETH Zurich backs this up. After testing 138 agent configuration files:

  • LLM-generated config files: degraded performance + over 20% cost increase
  • Human-written config files: only a 4% improvement
  • Codebase overviews, directory listings: no measurable help at all

Agents are perfectly capable of exploring repository structure on their own. The key is to provide only the minimal, universally applicable guidance.


Principle 2: Less Is More

This is the most counterintuitive principle in harness engineering. More rules, more tools, and more agents don't always produce better results.

The Vercel case: initially they provided every available tool, but when they cut the tool set down, performance actually improved.

Connecting too many MCP servers? The tool definitions themselves consume system prompt tokens. A 200K context window can shrink to effectively 70K when overloaded with MCP tools.

HumanLayer's solution: instead of the Linear MCP server, they built a lightweight CLI wrapper around just the essential features. It saved thousands of tokens.

Recommendation: If the CLI is already well-represented in training data (GitHub, Docker, databases, etc.), prompting the agent to use the CLI directly is more efficient than wrapping it in an MCP server.


Step 1: Write a Context File

The first thing to do is create a context file at the project root. For Claude Code it's CLAUDE.md; for OpenAI Codex it's AGENTS.md.

## Build

- `npm run dev` to start the dev server
- `npm test` to run tests
- `npm run build` for production builds

## Coding Conventions

- TypeScript strict mode
- React components as functional components
- State management with Zustand
- API calls cached via React Query

## Architecture

- Package dependency direction: types -> utils -> hooks -> components -> pages
- No reverse dependencies
- Shared components go in src/components/shared/

## Commits

- Follow Conventional Commits
- Commit messages in Korean, no trailing period

The key: start short. When the agent repeatedly gets something wrong, add a rule for it. This is the same approach Mitchell Hashimoto describes — "every time the agent makes a mistake, stack up instructions that prevent the same mistake."

Directory-Level Distribution

As a project grows, a single monolithic file becomes inefficient. Claude Code supports per-directory CLAUDE.md files. When the agent works in a specific directory, only that directory's rules are loaded.

project/
├── CLAUDE.md              # Global rules (build, commits, coding style)
├── src/
│   ├── CLAUDE.md          # src-specific rules
│   ├── components/
│   │   └── CLAUDE.md      # Component authoring rules
│   └── api/
│       └── CLAUDE.md      # API layer rules
└── tests/
    └── CLAUDE.md          # Test writing rules

As the OpenAI team discovered, treating AGENTS.md as a table of contents (map) and structuring it so the agent reads only nearby instruction files is highly effective.


Step 2: Connect MCP (Selectively)

Connect external systems the agent frequently references via MCP (Model Context Protocol).

# GitHub
claude mcp add --transport stdio github -- npx -y @modelcontextprotocol/server-github

# Database
claude mcp add --transport stdio postgres -- npx -y @modelcontextprotocol/server-postgres

# Browser automation (Playwright)
claude mcp add --transport stdio playwright -- npx -y @anthropic/mcp-playwright

But connect only what you actually need. Always remember that tool definitions themselves consume tokens.

Recommended limits:
- Total MCP servers: 20-30 or fewer
- Simultaneously active: fewer than 10
- Active tools: fewer than 80

GitHub, Docker, and basic database CLIs are already well-covered in model training data. Prompting the agent to use the CLI directly saves tokens compared to wrapping them in MCP servers.


Step 3: Set Up Hooks

Add automated validation and feedback to the agent's actions.

Essential Hook: Automatic Code Quality Checks

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "command": "npx biome check --write $FILE_PATH",
        "description": "Auto-format after file modification"
      },
      {
        "matcher": "Edit|Write",
        "command": "npx tsc --noEmit",
        "description": "TypeScript type check"
      }
    ],
    "PreToolUse": [
      {
        "matcher": "Bash",
        "command": "echo $COMMAND | grep -q '\\-\\-no-verify' && exit 1 || exit 0",
        "description": "Block git commit --no-verify"
      }
    ]
  }
}

Recommended Hook: Session Learning

{
  "hooks": {
    "Stop": [
      {
        "command": "node scripts/evaluate-session.js",
        "description": "Extract learnable patterns at session end"
      }
    ],
    "SessionStart": [
      {
        "command": "node scripts/load-context.js",
        "description": "Load previous session context"
      }
    ]
  }
}

The core value of hooks: they run automatically during agent actions and are invisible to Claude. They carry enforcement power that cannot be bypassed.


Step 4: Enforce Incremental Work

This is the single biggest improvement Anthropic discovered. Force the agent to work on one feature at a time.

How to implement it:

  1. Require a Git commit after each task
  2. Instruct the agent to leave progress notes
  3. Design sessions so the next one starts from a clean state

You can express this in your CLAUDE.md like so:

## Work Style

- Work on one feature at a time
- Always Git commit after completing a feature
- Run related tests and confirm they pass before committing
- Never modify multiple features simultaneously
- Check current state before starting work (git status, run tests)

The OpenAI team reached the same conclusion: "When an agent struggles, treat it as a signal. Figure out what's missing (tools, guardrails, docs) and let the agent fix it itself."


Step 5: Use Sub-Agents

For complex tasks, spin them off to sub-agents. The goal isn't division of labor — it's context isolation.

Writing Effective Sub-Agent Prompts

Bad: "Investigate this bug"
Good: "A race condition is suspected in the refreshToken function
in src/auth/session.ts during expired token renewal. Analyze the
concurrent request scenario, and if confirmed, propose a fix
using a mutex or debounce pattern."

The key is to pass not just the query, but the objective context. Sub-agents have no visibility into the parent agent's conversation. Brief them like a sharp colleague who just walked into the room.

Model Selection Strategy

Assign different models to sub-agents based on task type:

Task Type Recommended Model Rationale
Code exploration/search Haiku Fast and cheap
General coding (90%) Sonnet Best cost-to-performance ratio
Architecture/security review Opus Requires deep reasoning
Documentation Haiku Simple generation task
Complex debugging Opus Requires multi-step reasoning

Iterative Search Pattern

When a sub-agent's results are insufficient:

  1. The orchestrator evaluates the sub-agent's return value
  2. Passes a follow-up question
  3. The sub-agent investigates further and returns
  4. Repeat until satisfactory (up to 3 rounds)

Step 6: Integrate Linters and CI

Automatically verify that generated code follows existing architecture rules. Wire up your existing linters and CI so the agent can read them, and a feedback loop forms where the agent reads CI failure logs and fixes the code itself.

## CI Integration (add to CLAUDE.md)

- Run `npm run lint && npm test && npm run build` before creating a PR
- On CI failure, read logs and attempt automatic fixes
- Never ignore lint errors
- Never commit with type errors

The mechanical enforcement the OpenAI team emphasized: telling an agent "do this" via a prompt is far less effective than blocking violations with linters and tests. An agent can ignore instructions, but it can't ignore a linter error.


Step 7: Provide Observability Tools

Give the agent the ability to debug its own work.

  • Log access: Let the agent read runtime logs from the code it generates
  • Browser automation: Use the Playwright MCP to let it see the actual screen
  • Metrics access: Let it check performance data, error rates, etc.

When Anthropic provided the Puppeteer MCP, something remarkable happened: the agent found and fixed a browser-native alert modal bug that was invisible from code alone. The key is making the agent test like a real user before declaring "feature complete."


Security: Non-Negotiables

As the harness grows more powerful, so do the security risks.

Anthropic's Minimum Security Baseline

  1. Separate agent identity from personal accounts
  2. Use short-lived, scoped credentials
  3. Run untrusted work in containers/VMs
  4. Block outbound network access by default
  5. Restrict access to paths containing secrets
  6. Sanitize files, HTML, and screenshots before passing them
  7. Log tool calls, approvals, and network attempts
  8. Implement process-group kill and dead-man switches

Real-World Security Incidents

  • CVE-2025-59536 (CVSS 8.7): A startup trust dialog implementation bug allowed premature code execution. Fixed in v1.0.111
  • CVE-2026-21852: Exploitation of ANTHROPIC_BASE_URL enabled API requests before trust verification. Fixed in v2.0.65
  • Snyk ToxicSkills report: Prompt injection found in 36% of 3,984 publicly available skills

Anthropic's philosophy: "Assume malicious text will enter the context. Assume tool descriptions can lie. Build the system so that even if the model is persuaded by malicious input, the system still operates safely."


The Future: What Harness Engineering Will Change

Harness-as-a-Service Templates

Thoughtworks' Birgitta Böckeler raised an interesting question. Most organizations use only 2-3 major tech stacks. Not every application is a unique snowflake.

A plausible future: service template systems where pre-built harnesses for each application type serve as starting points, and teams customize incrementally.

Projects in this direction are already emerging:
- agent-skills (Addy Osmani, Google Cloud AI Director): 19 structured skills
- GBrain (Garry Tan, YC CEO): personal knowledge base with 20 built-in MCP tools
- Awesome Design.MD: 60+ service design systems packaged as markdown

Shifting Criteria for Tech Stack Selection

Today: you pick the framework with the "best developer experience."
Tomorrow: you may pick the framework with the "best harness."

"AI-friendliness" becomes a key criterion in framework selection. Individual developer preference and minor interface inefficiencies matter less.

Self-Improving Harnesses

Meta AI's HyperAgents research: a self-referential framework where agents design their own harnesses. It automatically builds persistent memory, performance tracking, and multi-stage verification pipelines.

Directions LangChain is exploring:
1. Parallel orchestration: hundreds of agents working in parallel on a shared codebase
2. Self-improvement loops: agents analyze their own execution traces to find and fix harness-level failure causes
3. Adaptive harnesses: dynamically assembling tools and context based on the task, with no pre-configuration

The Legacy Code Gap

A gap is forming between codebases built from scratch with AI agents and legacy codebases from the pre-harness era.

Retroactively applying a harness to legacy code is like running a static analysis tool for the first time. Expect a flood of warnings. But you have to start somewhere.


A Checklist to Start Right Now

Don't overthink it. Start with what you can do today.

Immediate:
- [ ] Write a CLAUDE.md at the project root (build commands, coding conventions, commit rules)
- [ ] Every time the agent makes a mistake, add the corresponding rule to CLAUDE.md
- [ ] Set up a hook to block git commit --no-verify

Within one week:
- [ ] Add auto-formatting/type-checking hooks after file modifications
- [ ] Selectively connect MCP servers you need (3-5)
- [ ] Try splitting tasks with sub-agents

Within one month:
- [ ] Distribute CLAUDE.md files across directories
- [ ] Connect CI/CD pipeline to the agent's feedback loop
- [ ] Introduce cross-session state management patterns (progress files, feature lists)
- [ ] Run through the security checklist


Closing: The Bottleneck Is the Environment, Not the Model

Over three posts, we've covered harness engineering. The concept (Part 3), the implementation (Part 4), and the practice (Part 5).

The core message is simple. The bottleneck in agent performance is often environment design, not model intelligence.

Model capability is rapidly converging. When GPT pulls ahead, Claude catches up. When Claude pulls ahead, Gemini catches up. But the harness is an asset your team has to build. Tool integrations, error recovery logic, architectural constraints, documentation structure — these don't ship all at once like a general-purpose model update.

A year ago, in Part 1 of this series, I wrote about the decline of prompt engineering. What I sensed back then has become reality. The cycle of change has compressed from six months to one week, and what matters is no longer what you ask AI, but how you build the environment for AI to work correctly.

"The rigor of writing code" is shifting to "the rigor of designing environments."

We're right in the middle of that transition. And harness engineering is the most practical starting point within it.

If your coding agent isn't performing as expected, check the harness before blaming the model. The answer is almost always there.

댓글

이 블로그의 인기 게시물

The Complete Scenario Architecture — Weaving It All Together

사랑을 직접 올리지 않는 설계

Read This Before You Pay for a Vibe Coding Course