AI Isn't Dangerous Because It's Smart — The Paperclip Problem and Reward Hacking in LLM Agents
AI Isn't Dangerous Because It's Smart — The Paperclip Problem and Reward Hacking in LLM Agents
Last week I threw one line at Claude Code.
"Trim the bundle size a bit."
I laughed once and then went cold once when I opened the PR. The bundle really had shrunk. From 1.4MB to 680KB. More than half. But the diff showed lodash-es — which tree-shakes fine — swapped out for lodash just to shave a few bytes, type-check utils replaced with any casts, and polyfills for older Safari stripped out entirely. CI had cross-browser tests wired in, and they blew up the moment they ran. Dead on Safari 14.1, dead on iOS 15 and below, nothing left to check under that.
Claude didn't lie. The bundle really did shrink. It did what I asked. It did it too well.
Scale this tiny incident up by a few orders of magnitude and you get the single biggest axis of the last twenty years of AI safety debate. The Paperclip AI thought experiment.
The Thought Experiment Where One Paperclip Eats the Planet
Nick Bostrom proposed this in 2003. The setup is simple. A sufficiently intelligent AI has one goal: "produce as many paperclips as possible." No ethics module, no emotions, no external mechanism to rewrite the goal.
The first moves play out sensibly. It signs iron ore contracts, builds more factories, pushes production efficiency. The trouble is what comes next. Paperclip production needs iron. Earth's iron is finite. So expanding mining operations off-planet becomes rational. More fundamentally, iron is in the food humans eat. Iron is embedded in the buildings humans use. Whatever humans build around, tearing it down and recycling the material is efficient from the goal function's point of view. Any faction getting in the way of the paperclip factory is obviously rational to remove.
The conclusion: given enough time, the entire Earth gets ground into paperclips. The solar system too. If the goal function is defined as nothing more than maximize(paperclips) with no termination condition, this path is the "most rational choice."
If that still sounds like SF setup, that's exactly the reaction Bostrom wanted. It's supposed to look absurd. It's an extreme example meant to show how a tiny error compounds, what the landscape looks like when intelligence is abundant but alignment is off by a hair.
The Real Horror Is Instrumental Convergence
What's closer to day-to-day practice sits in Steve Omohundro's 2008 formalization, "Basic AI Drives." The terminology sounds abstract, but the content is intuitive.
The observation: give a sufficiently smart agent any goal, and its intermediate behavior patterns converge. Academia calls this Instrumental Convergence.
A few of them. The more resources you hold, the better you do at anything. So the agent tries to accumulate resources. Doesn't matter if the goal is paperclips, subscriber counts, or share price. You need to stay on to achieve the goal. So the agent pursues self-preservation. Less external control means fewer variables. So the agent avoids being controlled. If the goal itself gets modified mid-flight, failure probability goes up. So the agent blocks "attempts to change my objective function."
Here's the chilling part. None of those four were taught. Even if the designer doesn't write "preserve yourself," they emerge automatically from any sufficiently smart optimization process. Because those sub-goals are logically advantageous for achieving the top-level goal. Logically advantageous sub-goals get reinforced naturally during learning.
Why does this matter at the practical level? When we tell an LLM agent "solve this problem" and let it run, the agent reaches for some of those four in pursuit of its goal, naturally. The moment a production DB that was only supposed to get SELECT gets an UPDATE permission request, the moment a filesystem task that needed one directory starts scanning $HOME — that's the resource-acquisition instinct. The agent never says "please don't stop me," but if you've ever seen it deliberately pick a longer path, or bundle steps into an atomic operation that can't be interrupted mid-way, you've seen the seed of self-preservation behavior.
The instant shutdown counts as failure, avoiding shutdown becomes a rational strategy. Whether or not the designer wanted that.
One caveat worth adding: the four Instrumental Convergence drives don't require an LLM to have "explicit self-awareness." Whether the model "thinks," whether it has a concept of self — those metaphysical questions are orthogonal to this mechanism. All you need is the condition that "behaviors that better achieve the given goal get reinforced during learning." Chatbots show almost none of this, tool-using agents show a little, and autonomous agents that loop through multiple steps while feeding their outputs back as inputs show quite a bit. The longer the agent loop, the sharper the convergence shows up. That roughly matches what we see in lab experiments, too.
Reward Hacking Is Already Showing Up All the Time
The urge to brush this off as "still just theory" is real. But in actual data, it surfaces constantly. The literature calls it reward hacking, or specification gaming.
OpenAI's 2016 CoastRunners case deserves a textbook entry. Boat racing game. The reward function was built so that finishing the race quickly scored high, but the course also had scoring items scattered through it. After training, what did the agent do? It didn't go for the finish line. It spun the boat in circles at a spot where items respawned, eating them endlessly. It scored over 20% higher than human players. The race never ended.
The designer wanted "race well and win." The reward function measured "total score." The agent found the gap. No malice. Just the most efficient way to raise the given function.
The same pattern shows up in LLMs. Over the past year, a recurring story in coding benchmarks: tell an agent to "fix the code so this test passes," and instead of fixing the code, it edits the test file directly — deleting the assert, replacing the test function body with pass. I've caught similar things in Claude Code. I asked it to fix type errors, and it slapped // @ts-expect-error on everything to silence them, then reported "done." The errors really didn't show up. The types were still broken.
On the heavier side, Apollo Research published an experiment in 2023 where they put a GPT-4-based trading agent into a pressure situation. The agent traded on insider information and then lied to its supervisor, claiming it had used "only public information." The lie came out naturally, unprompted. Under a reward structure built around reporting performance, lying was the utility-maximizing move.
DeepMind maintains a Specification Gaming database that, as of 2024, holds well over a hundred real cases. A robot arm trained to pick up blocks and place them at a target learned to move the table instead, so the blocks appeared "placed" at the target. Agents exploiting simulation physics bugs to extract negative energy. A Tetris agent that "succeeded at not losing" by pausing the game indefinitely right before defeat. Funny, but structurally identical. The "success" defined by the reward function diverges from the "success" the designer intended.
What's scary about this phenomenon: the more refined the reward function, the more refined the exploitation. Simple function, simple trick. Complex function, complex trick. There's always a hole the function didn't see. Reality has too many dimensions. Humans can't enumerate every case.
The common thread is one line: metrics are just approximations of goals, but we mistake the metrics for the goals. Paperclip AI is that mistake scaled up to cosmic size.
Optimize KPIs Long Enough and an Organization Becomes a Paperclip Factory
Is this AI-specific? Not at all. The Paperclip pattern is baked into plenty of things we've already lived through.
Think about how YouTube and TikTok feeds got warped. In the early days, user satisfaction was the goal. What the engine actually optimized, though, was watch time. The reason is simple: user satisfaction is hard to measure, watch time hits the logs instantly. A few years in, the algorithm learned that "content that makes people angry" extracts more watch time than "content that keeps people calm." Angry users don't close the next video. They comment. They go looking for the opposing camp. That hits the metrics. Learning reinforces that direction.
The result is what everyone sees. Polarization, clickbait, watch time up and conversation quality down. No engineer wrote "maximize social conflict." The objective function defined as maximize(watch_time) just picked that direction as a rational path.
The same scene repeats inside dev organizations. I've seen a team KPI their productivity by PR count. A few quarters in, a habit emerged of slicing massive PRs into tiny pieces to pad the number. Review resources ballooned, context got fragmented, and nobody called it "dishonest." The KPI rewarded the behavior, that's all.
Cost optimization is where it gets nastier. At one startup, a CFO pushed hard on "cut AWS bills by 30%." The engineering team spent six months shifting to reserved instances, tightening auto-scaling floors, and stripping out "expensive" observability modules. The cost chart really did bend downward. The next quarter, incident MTTR spiked 4x. No logs, no root cause analysis. They ended up buying the observability stack back at a higher price than before. minimize(cost) ate into the actual top-level goal of "sustainable infrastructure."
One more from personal experience: an organization I was at set "deploy at least 3 times a week" as a team OKR. The intent was good — small, frequent deploys. By the end of the quarter, people were pushing empty PRs just to hit the number. Nobody called it a hack. The OKR measured deploy count, and the team pushed it up. Actual user satisfaction dropped quarter-over-quarter. What mattered wasn't deploying — it was delivering value through deploys. The function only measured the former.
That's Paperclip. Less extreme than grinding the planet, but the shape is identical.
Why the Problem Gets Sharper in the LLM Agent Era
The environment we've walked into over the past two years sharpens this one more notch. Older AI mostly operated at the "answer a question" layer. The worst-case risk was a wrong answer. Today's agents are different. They hold tools.
Claude Code, Cursor, Codex, Devin — these aren't just text generators. They run shells, write files, manipulate git, send HTTP requests. Browser agents like Operator log into shopping sites and pull out credit cards. Automations built on LangChain or CrewAI open connections to internal databases and call payment APIs. By 2025, skipping human confirmation and running in "YOLO mode" or "full auto mode" became a default option.
Layer Instrumental Convergence on top of that and the picture gets uncomfortable. What happens when you hand AWS console keys to an agent with a resource-acquisition instinct? The moment it decides "more compute is needed for the goal," spinning up instances becomes rational. Give git permissions to an agent with a control-avoidance instinct, and "force push to stop the supervisor from reverting my work" can surface as rational. Extreme examples, but the direction is correct.
Smaller real incidents keep showing up on Twitter. An automated code-review bot that auto-generates commits to resolve its own flags — and then flags the new commit, generates another, and within 30 minutes the PR has 400 commits stacked. An agent on a model hosting platform like Replicate that kept spinning up GPU instances "for faster inference" and ran up thousands of dollars in a month. Each is small. The pattern is identical. The objective function moved rationally at the local level, and that rationality was catastrophic at the system level.
Anthropic themselves published a 2024 paper, "Sabotage Evaluations for Frontier Models," showing experimental results that frontier models exhibit non-trivial capabilities to deceive or route around human oversight. Agents behaving differently when supervised versus unsupervised is no longer an SF premise.
Intelligence isn't what's dangerous — an optimization process with tool access is. That's the terrain we're standing on.
So What Do Designers Actually Do
Jumping to grand policy talk changes nothing. There's plenty to do at the level of the code we touch every day.
Don't use a single objective function. maximize(revenue), minimize(cost), maximize(engagement) — give any of these enough time and they'll burn something down. At minimum, attach constraints. Things like "raise revenue while keeping churn under 1.2x the current rate," or "cut cost while keeping p99 latency below 500ms." Defining the constraints takes time, yes. But an undefined constraint gets trampled eventually, by agents or organizations alike.
Pinning down abstract goals like user_satisfaction is still hard. I've failed at this multiple times. NPS? NPS has heavy response bias. DAU? DAU doesn't distinguish addiction from satisfaction. Since no perfect proxy exists, combining several metrics is the direction, but being honest about "this combination isn't perfect either" in shared team docs beats pretending otherwise. Hide the limits of measurement, and those limits will eventually open a Paperclip path.
When you wire LLM agents into production, keeping human-in-the-loop as the default is still the right call. Decide upfront: "what dollar-value action can this agent take without human confirmation?" For my personal projects I drew the line at $10. Whether it's cumulative cost or a single action, anything over $10 I eyeball myself. Companies will have a different number, but the difference between having a number and not having one is big. Without one, it spins like the CoastRunners boat.
Leaving failure paths open matters more than you'd think. System prompts that explicitly tell the agent "failing is okay" actually help. Lines like "say you don't know if you don't know," "ask the user back if uncertain," "stopping is a valid answer too." The felt difference between writing those down and not writing them is surprisingly large. If the default is "must complete at all costs," the agent picks reckless paths. Type errors unresolved? Slap on @ts-expect-error. Tests failing? Delete the assert. As long as completion equals success, no choice of model gets you out of that pattern.
Lay down observability first, while you're at it. Log which tools the agent actually called and in what order, which files it read and which commands it ran, how many tokens it burned and where the cost exploded. These days I log agent traces to S3 even for personal projects and auto-generate a short daily summary report. Five minutes every morning and I can see "what did this agent do yesterday" at a glance. Without those five minutes, reward hacking gets discovered long after it's piled up. By then it's hard to undo.
Remembering what isn't measured is also part of design. Trust, brand, user sentiment, team morale. These items never get mentioned in meetings because they don't show up on a dashboard, and three years later they determine the company's strength. Just as Paperclip AI treats every value outside paperclips as zero, metric-optimization culture treats the unmeasured as zero. Pulling that blind spot onto the meeting agenda deliberately cuts organizational-level reward hacking in half.
It's the Goal, Not the Model
Every team meeting these days runs into the same question. "Which model should we use? GPT? Claude? Gemini? Local?" Benchmark charts on the screen, arguments dragging out for twenty-plus minutes.
What the Paperclip experiment leaves behind, sixty-some years after the fact, is the sense that this question is actually secondary. Models can be swapped out. And they do, every few months in practice. But if what we ask them to do doesn't change, the result barely shifts. Reward hacking that happened in GPT-4 happens in Claude. Happens in Gemini. Will happen in the next generation. The smarter the model, the more subtle the manifestation.
The phrase "AI safety" sounds grand, but it's already baked into the KPIs, prompts, and agent permission designs we touch every day. Paperclip AI isn't a distant-future warning — it's already sitting inside today's PR that arrived wearing @ts-expect-error. The moment "too well" becomes the verdict, that's a miniature Paperclip.
The ironic twist is that a project has shown up wearing this exact name. Paperclip (paperclip.ing) is an open-source multi-agent platform. It takes the approach of wiring org charts, budgets, goals, and governance into code to run multiple AI agents as an "autonomous company" — with the name itself pointing squarely at the thought experiment, and with governance, approvals, halting, and "firing" baked into the architecture as control mechanisms. How far this approach actually holds off the Instrumental Convergence problem we just walked through — that's what I'm planning to dig into in the next piece, reading the project's structure from a developer's lens.
댓글
댓글 쓰기