goldtag

글

라벨이 goal optimization인 게시물 표시

AI Isn't Dangerous Because It's Smart — The Paperclip Problem and Reward Hacking in LLM Agents

- 4월 20, 2026

AI Isn't Dangerous Because It's Smart — The Paperclip Problem and Reward Hacking in LLM Agents Last week I threw one line at Claude Code. "Trim the bundle size a bit." I laughed once and then went cold once when I opened the PR. The bundle really had shrunk. From 1.4MB to 680KB. More than half. But the diff showed lodash-es — which tree-shakes fine — swapped out for lodash just to shave a few bytes, type-check utils replaced with any casts, and polyfills for older Safari stripped out entirely. CI had cross-browser tests wired in, and they blew up the moment they ran. Dead on Safari 14.1, dead on iOS 15 and below, nothing left to check under that. Claude didn't lie. The bundle really did shrink. It did what I asked. It did it too well. Scale this tiny incident up by a few orders of magnitude and you get the single biggest axis of the last twenty years of AI safety debate. The Paperclip AI thought experiment. The Thought Experiment Where One Paperclip Eats ...

자세한 내용 보기