Question Your Defaults — How Model-Harness Overfitting Is Slowing Down Your Agent

 


In Part 3 of this series, I mentioned a fascinating fact. On Terminal Bench 2.0, Claude Opus 4.6 ranked 33rd inside Claude Code — the very harness it was trained in — but jumped to the top 5 when used with a different harness.

I didn't fully unpack what that number means. While covering Anthropic's architecture in Part 4 and the hands-on guide in Part 5, I glossed over the most counterintuitive and practically important insight of the entire series.

Using the default harness as-is may not be optimal.

This post is where I address that.


How Overfitting Happens

Frontier coding models are post-trained inside their own harnesses. Claude is optimized through thousands of hours of coding tasks in the Claude Code environment; Codex models go through the same process in the Codex environment.

During this process, the model adapts to the patterns of its specific harness:

  • How Claude Code invokes tools
  • The format in which errors are returned
  • The order in which context is assembled
  • The interface of the file editing tool

The model is trained to maximize performance in this particular environment. The problem is that maximization doesn't guarantee generalization.

In machine learning, overfitting means the model fits the training data so well that its performance on new data actually degrades. The same thing happens in the model-harness relationship. As the model adapts even to its default harness's quirks, its latent potential in other configurations gets buried.

Concrete Example: Codex and apply_patch

OpenAI's Codex model became extremely coupled to a file editing tool called apply_patch. When developers tried to use Codex models in a different harness (OpenCode), they had to add a separate apply_patch tool. The model simply couldn't perform file edits properly without that specific tool interface.

The model didn't learn "how to edit files." It learned "how to call apply_patch."

Concrete Example: The Hashline Experiment Paradox

Let's revisit Can Boluk's Hashline experiment. By appending a hash to each line so the model could reference positions by line number, Grok Code Fast 1 jumped from 6.7% to 68.3%.

The model's "skill" didn't suddenly improve tenfold. The existing harness's file editing interface had been suppressing this model's potential all along. One change to the harness interface was all it took to unlock capabilities that had been buried.


What "33rd Place" Really Means

Claude Opus 4.6 ranking 33rd in Claude Code doesn't mean Opus 4.6 is a bad model. It means Claude Code's default harness isn't extracting Opus 4.6's full potential.

Why does this happen? There are a few hypotheses.

1. The harness is calibrated for an older model

Claude Code's default settings have evolved incrementally across multiple model generations. Mechanisms introduced early on to address Opus 4.5's "context anxiety" — sprint decomposition, forced checkpointing — may still be in place. Opus 4.6 can handle long-running tasks without these guardrails, but the default harness imposes unnecessary constraints.

Anthropic themselves acknowledged this. In their 3-agent harness paper, after switching to Opus 4.6:

  • The sprint decomposition structure could be removed
  • The evaluator switched to single-pass verification
  • Complexity decreased while performance held steady

"The assumptions a harness encodes become stale as models improve." This is Anthropic's official position.

2. Default tool interfaces are general-purpose, but not optimal

Claude Code's file editing, search, and bash execution tools are designed for generality. They have to work in any project. But that doesn't mean they're optimized for your project.

3. Context composition is generic

The default harness loads CLAUDE.md, injects tool descriptions, and maintains conversation history. But it can't know what context your specific task actually needs.


Practical Guidelines: How to Question Your Defaults

This is what was missing from the hands-on guide in Part 5. Not just "build a harness," but "start by questioning the default harness."

1. Wrap Default Tools for Your Workflow

Don't use Claude Code's built-in tools as-is. Create project-specific wrappers.

## Tool Usage Rules (CLAUDE.md)

- For file search, use `npm run find-component [name]` instead of Glob
  (search tailored to project component structure)
- For tests, use `npm run test:affected` instead of `npm test`
  (only run tests related to changed files)
- For build checks, use `npm run typecheck` instead of a full build
  (quickly catch type errors only)

This is exactly how LangChain jumped 25 spots on Terminal Bench. They didn't use the default tools as-is — they analyzed failure patterns and tailored the tool interfaces to the task.

2. Remove Unnecessary Guardrails

In Part 5, I said "enforce incremental work." That's correct — most of the time. But if your model and task are already stable enough, excessive checkpointing actually hurts productivity.

Questions to ask:

  • Is forced sprint decomposition necessary? Opus 4.6 can work consistently for over 2 hours without sprint structures.
  • Do you need a type check after every tool call? Running a full type check on trivial edits dramatically slows things down. Checking only at commit time may be more efficient.
  • Do you actually need sub-agents? For simple tasks, the overhead of sub-agents can cost more than just handling it directly.

Apply Anthropic's own principle in reverse: "Regularly re-examine whether the assumptions your harness encodes have become stale." Your harness is no different. Does a rule you added three months ago still serve a purpose?

3. Classify Your Agent's Failures

Not all failures are the same. You need to classify failures to make the right harness adjustments.

Failure TypeCauseResponse
Insufficient model capabilityModel can't handle the task itselfSwitch to a stronger model, or decompose the task
Missing contextRequired information isn't in the contextEnhance CLAUDE.md, connect MCP servers
Tool mismatchTool interface doesn't fit the taskAdd custom tools/wrappers
Over-constraintHarness is suppressing the model's capabilityRemove or relax rules/hooks

The fourth type — over-constraint — is the hardest to spot. When an agent fails, the instinct is to think "should I add more rules?" not "do I have too many rules?"

4. Run A/B Tests

Run the same task with different harness configurations and compare. Without a scientific approach, you end up relying on gut feeling that "this setup seems better."

Variables to test:

  • Context volume: Does halving the rules in CLAUDE.md actually make results worse?
  • Number of tools: Does cutting MCP servers from 5 to 3 actually improve things?
  • Number of hooks: What happens if you disable all PostToolUse hooks and only verify at commit time?
  • Model selection: How does the cost-to-quality ratio change when you swap Sonnet for Haiku in sub-agents?

This is precisely what LangChain did on Terminal Bench. They collected failure causes from LangSmith traces and systematically tuned each harness variable.

5. Re-evaluate Your Harness with Every Model Update

Every time a new model version drops, verify that your existing harness is still optimal. Just as Anthropic removed the sprint structure when transitioning from Opus 4.5 to 4.6, guardrails that helped a weaker model become shackles on a stronger one.

Checklist:

  •  Is forced sprint decomposition still necessary?
  •  Can you extend the compaction interval?
  •  Are there tasks that a single agent can handle without sub-agents?
  •  Can the evaluator agent switch to single-pass verification?
  •  Are there rules you added earlier that the model has since internalized, making them redundant?

The Landscape Beyond Default Settings

Let's consider the broader implications of this insight.

Why the Harness Becomes a Competitive Advantage

Models are converging. When Claude pulls ahead, GPT catches up; when GPT leads, Gemini follows. So where does the productivity gap between two teams using the same model come from?

The harness.

And if the default harness isn't optimal, then teams that customize will systematically outperform teams that stick with defaults. This isn't a temporary edge — it's a compounding asset. Harness-tuning know-how, project-specific tool wrappers, team-tailored feedback loops — these don't get leveled overnight by a model upgrade.

Not "How to Use Tools" but "How to Build Tools"

In Part 5, I said the criteria for choosing a tech stack would shift toward "AI-friendliness." Take that one step further: the ability to design tools for agents becomes a core competency in itself.

We live in a world where a single good tool interface can boost a model's score by 61 points (the Hashline experiment). Knowing "how to use Claude Code" is less valuable than knowing "how to redesign Claude Code's default tools for my project."

Resolving the Paradox

The paradox raised in Part 3 — that models become overfitted to their own harness — isn't really a paradox at all. It's an opportunity.

The fact that the default harness isn't optimal means there's room to customize. And that room grows as models improve. The more capable the model becomes, the more potential gets buried by the default harness's conservative constraints.


Closing: Four Chapters of Harness

Parts 3 through 6 covered harness engineering in four layers.

  • Part 3: What is a harness — concepts and components
  • Part 4: How Anthropic designed theirs — architecture and implementation
  • Part 5: How to apply it to your project — a 7-step hands-on guide
  • Part 6: Question your defaults — the implications of overfitting and optimization

If I had to pick the single most important takeaway, it's this.

Don't stop at building a harness. Question the one you've built.

When your agent isn't performing as expected, before adding more rules, first check whether your existing rules have become shackles. The model can do more than you think. It's just that the harness might not be letting it.

As I said in Part 1, the cycle of change has compressed to a single week. A harness that was optimal three months ago may not be optimal today. Question regularly, experiment often, and prune ruthlessly.

The model is probably fine. Too much harness might be the problem.

댓글

이 블로그의 인기 게시물

사랑을 직접 올리지 않는 설계

The Complete Scenario Architecture — Weaving It All Together

Read This Before You Pay for a Vibe Coding Course