Five days before Google I/O, the AI front has split into three

A MacBook on a wooden desk at night, chat UI on screen, 'tok/s: 7' and 'Qwen3.5 122B' overlay

Trying to figure out how to cut my token usage, I installed Qwen3.5 122B on my M4. And the tokens dropped to single digits…

The API calls really did go to zero. Exactly what I wanted. But watching the chat reply flicker out one character at a time, I realized another number had also fallen into single digits. Tokens per second.

The same word landed with two meanings at once. Cost in single digits. Speed in single digits. One was the result I wanted. The other I didn't. Between them sat a beat of silence, like an ellipsis.

That contradiction is where this post starts.

The two single-digit numbers misaligning on my laptop get much larger at the company level. And that misalignment is exactly the next battlefield in the AI industry. A year ago we were watching "who builds the smarter model." Text understanding and reasoning were supposed to decide what came next. But as of May 2026, that race is effectively over. The new battlefield isn't one. It's split into three.

Cloud. Device. Form factor.

And there's a real force pulling all three apart at the same time. OpenAI is spending two dollars on inference for every dollar of revenue. The projected loss for 2026 alone is close to $14 billion, and on a cash-burn basis it's closer to $25 billion. That number returns at the end of this post. For now, just keep the slot empty.

The biggest contradiction is growing on the device front. The misalignment I saw on my M4 is a miniature of it. Five days from now Google I/O opens. How far Google pushes the device front in its keynote will tell us a lot about where this three-front race is headed.

This post is the cleanup I'm writing five days out.

The text reasoning race is over

One fact first. As of May 2026, Anthropic Claude Opus 4.7, OpenAI GPT-5.5, and Google Gemini 3.1 Pro are sitting on roughly the same benchmark line.

Opus 4.7 leads slightly on the agent side — SWE-bench Pro 64.3%, OSWorld 78.0% — but on text understanding and reasoning the three drift within 1~3 points of each other. The 1M-token context window is unlocked everywhere. Opus 4.7 shipped it last December, GPT-5.5 followed, and Gemini went to 2M even earlier. "I can read a longer document" is no longer a differentiator.

What does that mean? Think back to the marketing pattern from a year ago. "MMLU 92%". "HumanEval 96%". "X% gain on the math benchmark." Slides went up every cycle claiming a single-digit lead. That move doesn't land anymore, because all three are on the same line. From an end-user point of view, you're left with "Opus is better on this task, GPT on that one" — and even those flip every couple of months.

The companies are moving somewhere else. The first finding of this post is that they aren't moving in one direction. Each is reading its own weakness and assets, placing a different bet. The result is three fronts splitting at the same time.

On the cloud, OpenAI splits while Anthropic digs one well

This is the front you see closest to the surface. Changes happening on top of cloud APIs.

Look at the lineup OpenAI shipped in early May 2026. GPT-5.5 Instant became the new default model in ChatGPT. On top of it, GPT-Realtime-2 came out as a dedicated voice model, GPT-Realtime-Translate as a dedicated live-translation model going from 70 input languages into 13 outputs, and GPT-Realtime-Whisper as dedicated streaming speech-to-text. GPT Images 2 dropped alongside as the image-generation flagship.

The splintering itself is the signal. A year ago OpenAI would have shipped this as one "Realtime 2.0" model — voice, translation, transcription all bundled together. Now they're three separate things. Why? It ties directly to the $23-billion contradiction I'll come back to. For now just the surface. Splitting models by purpose lets you optimize cost per call per model. A user who only needs speech recognition stops calling a model whose reasoning capability they aren't using. From OpenAI's side, it's an attempt to shrink inference cost on a per-call basis.

On the other side, Anthropic. Same cloud front, completely different bet. Anthropic isn't splintering into voice / translation / STT lines. They're digging one well. Coding and agent execution.

Look at the Rakuten case Anthropic published. They ran Claude Code autonomously for seven hours. No human intervened mid-flight. Hundreds of tool calls. Memory management, intermediate verification, the model reflecting on its own work — all happening inside the session. This is the long-horizon territory. The 78% on OSWorld matters because of it: the model is doing what a human would do with mouse and keyboard on a desktop.

OpenAI also shipped Workspace Agents and updated the Agents SDK, so it isn't that they're ignoring agents. But OpenAI layers splintering on top. Anthropic almost doesn't splinter and concentrates on one well. Same cloud front, different stance.

Why does this matter? It's a signal that two companies are solving the same problem (cloud inference cost) two different ways. OpenAI: chop into small calls and drive cost down. Anthropic: have one call do seven hours of work and drive value up. Both are touching unit economics from different sides. Which one wins, nobody knows yet. What matters here is that even on the same cloud front, the postures diverge.

On the device, local LLMs and OEM on-device run separately

This is the heart of the post. I write it long. The two single-digit numbers misaligning on my M4 are a miniature of this front.

Local LLMs caught up on the benchmarks

First the facts. Open-weight models have arrived at almost the same line as closed frontier models. A point or two of difference.

DeepSeek V4-Pro shipped on April 24, 2026. SWE-bench Verified 80.6%, MIT license, 1.6T total parameters with 49B active. That's 0.2 points behind Opus 4.6. It's inside the "frontier" line we normally talk about.

Alibaba's Qwen3.5 is a 122B MoE with 10B active. According to the model card it beats GPT-5-mini on most benchmarks. And it runs on a MacBook M4 with 64GB. That's the model I actually installed. Meta's Llama 4 Scout shipped with a 10M-token context window in open weights — ten times what closed models offer.

The Ollama library lists over 4,500 models. The setup has become the standard. A developer types ollama run qwen3.5:122b and a model runs on their laptop.

So far it reads like "open wins." Benchmarks are caught up, licenses are permissive. But.

But it feels lonely to actually use

This is where my M4 experience earns its keep. The benchmark says 80%, but tokens per second on the laptop is in single digits. Compare to GPT-5.5 API, which generally serves 100+ tokens per second. That's a 10x to 20x gap.

That's just the start of one limit. Here are the five things I ran into actually running the thing.

Inference speed. On an M4 Max with 64GB at Q4 quantization, Qwen 122B serves about 6~8 tokens per second. Tolerable for short answers. Ask it to rewrite a code file end-to-end and you're waiting two to three minutes. You can do something else in the meantime, but the flow breaks.

Agent integration. Hitting 80% on a single-shot coding benchmark is a different game from a seven-hour autonomous run. Context decays faster on open models when you cycle through tool calls. The reports from people using Aider with open models line up: five-minute tasks work, hour-long tasks lose their way mid-route.

NPU unused. The M4 ships with an 18-TOPS NPU. Apple Silicon's proud piece. But Ollama barely uses it. It runs on CPU and GPU. Apple only opens NPU access through Core ML. The accelerator on the chip sits idle while the LLM runs. It's like carrying a second engine in the trunk and never plugging it in.

Battery and heat. Run a 70B model on a laptop for an hour and the fans go full speed, the battery empties in two hours. The image of someone with their laptop open in a café doesn't quite fit. It's a desktop workflow, plugged in, on a desk.

OS integration absent. An assistant that listens in the background like "Hey Siri" needs the OS to hold its hand. Ollama is a separate app. It doesn't sit at the system level. So in the middle of writing this post, I can't naturally turn and say "polish this sentence." I have to open an app and type a prompt every time.

Stack the five and a one-line conclusion emerges. Local LLMs are lonely. Smart but lonely. By benchmark score they should be sitting next to closed frontier models, but in reality they spin in a corner of the desk by themselves.

OEM on-device brought half an answer

There's another branch on the same device front. Call it OEM on-device. Apple, Google, Samsung baking models into their own chips.

Look at the Gemini Nano spec Google shipped. 1.8~3.25B parameters, 4-bit quantization, ~1GB model size, sub-100ms latency. Rolling out from Pixel phones down to Galaxy. Call it the standard spec for OEM on-device.

Then in January 2026 a bomb dropped. Apple effectively gave up its own foundation models and licensed Google Gemini. Apple Intelligence gets rebuilt on Gemini, and Apple distills that Gemini to fit its own silicon for on-device use. In iOS 27, users get to pick Gemini, Claude, or other outside models directly for AI features.

What does this mean? Three things happened simultaneously. Apple — a company that size — gave up making its own foundation model. The cost of building a model now outweighs the cost of licensing one. At the same time, "on-device AI" stopped being a differentiator and became infrastructure. You don't have it, you die. And alongside, a division of labor solidified: the big companies on top make the model, somebody else bakes it into the chip.

Gemini 3.2 Flash appeared quietly on the iOS Gemini app and AI Studio on May 5, 2026. $0.25 per million input tokens. Google I/O is in five days, so the official announcement lands there. This is the next standard spec for the OEM camp.

The seam where they don't meet

Pause and recount. The local LLM camp caught up on benchmarks but feels lonely in actual use. The OEM on-device camp succeeded at the infrastructure layer but the user can't control the model. You can run Gemini Nano on your phone, fine — but you can't fine-tune it to your taste or swap the weights. You use what Google decided.

The contradiction grows in the seam where the two branches don't meet. User control (the strength of local LLMs) and chip optimization with OS integration (the strength of OEM on-device) would be explosive if they met in one place. Right now they spin separately.

Look at my M4 again. There's an 18-TOPS NPU in the chip. On top of it macOS. Inside macOS, Apple Intelligence. But I can't put my own choice — Qwen3.5 or DeepSeek V4-Pro — onto that NPU. Only the models Apple picked get to use the NPU. What I can control is what Ollama runs on CPU/GPU. User control and chip optimization are split inside one device.

The moment that seam closes is the real explosion. The moment a user can put an open-weight model of their choosing onto the NPU. Or the moment OEMs accept user models and distill them to fit their silicon as a standard. That's the next branching point on the device front.

It isn't visible yet. Apple ships only its own models. Google ships only its own Nano. The open camp can't enter the NPU. Local LLMs stay a lonely-spinning new technology. Smart, but lonely.

The bet to bypass the smartphone — OpenAI × Jony Ive

The third front sits on a different axis. The cloud and device fronts are both about "how to use AI better on existing form factors (web, phone, laptop)." The third asks: can we build a new device that bypasses existing form factors at all?

OpenAI's device with Jony Ive is set for the second half of 2026. Codename Sweetpea. Roughly screenless, in a pendant or earbud form. 2nm chip with environmental sensors. First-year shipment target of 40~50 million units. Manufactured by Foxconn, assembled in Vietnam or the US.

Look at the size of the bet. 40 million units is three to four times what the Apple Watch sold in its first year (around 12 million). That's a very aggressive first-year shipment target. Sam Altman calls it "a device more peaceful than the smartphone." No flood of notifications, no screen, always listening.

It's analogous to OpenAI walking onto the field where Humane's AI Pin failed. Humane shipped in 2024 and wound down operations in less than a year. Same concept — screenless device, always-listening AI, expensive — but the model at the time wasn't even GPT-4, and the price was $700. Now there's a GPT-5.5-class model inside, Jony Ive on design, and a first-year target of 50 million implies a price tier on the consumer side.

How does this connect to cloud losses? The next section answers. For now just the meaning of the form-factor front.

OpenAI is making this bet for two reasons. First, the smartphone is controlled by other companies (Apple, Google). Replacing Siri on iOS or Gemini on Android requires the OS operator's permission. iOS 27 lets users pick outside models for AI features, but that's still inside Apple's rules. OpenAI needs a device that lives outside those rules.

Second, device revenue is comparatively stable. Cloud API revenue is per-call and volatile — and as we'll see, it loses money. A device sells once and the margin is fixed. 50M units × $200 per device = $10B. That's nearly the same as OpenAI's projected 2026 revenue ($13B).

The real meaning of the form-factor front isn't "rewriting the definition of an AI device." That's there too, but more practically it's "bypass the layer controlled by smartphone OS operators, secure a stable revenue line." The biggest bet, and the biggest risk. Humane already failed on this ground.

Why all three fronts are splitting now — the $23-billion contradiction

This is where the slot from the intro gets filled.

OpenAI's internal documents project a $14 billion loss for 2026. In Korean won, about 19 trillion. On a cash-burn basis it's $25 billion, around 34 trillion won. The "OpenAI's 23-trillion-won loss" you see in Korean press lands somewhere between those. The number wobbles by FX rate and definition. One thing is clear: a loss in the hundred-billion-dollar range over the rest of the decade, and one year alone in the tens of billions.

Where does the money go? Revenue is around $13B. Inference cost alone hits $14.1B in 2026. The cost of running the model is bigger than total revenue. Microsoft's leaked OpenAI revenue-share data shows something more striking: on inference alone, OpenAI loses $2 for every $1 of revenue. Strip out R&D, headcount, marketing — running the model once already costs that.

Now go back up.

Why OpenAI split GPT-Realtime-2 into voice / translation / speech recognition. Bundling everything into one model means a user who only needs voice ends up calling an expensive reasoning model. Splitting means the voice user calls the voice-only model. Per-call cost reduction. The result of pressure on inference unit cost.

Why OpenAI is betting on Sweetpea, a screenless device. Cloud API revenue loses money. A device sells once and the margin is fixed. The form-factor front is the largest exit from the deficit pressure.

Saying three fronts are "splitting" is slightly imprecise. Three fronts are pulling away in different directions from a single point — the deficit. Splintering is per-call cost optimization. OEM cooperation is cutting cloud dependency. Form factor (Sweetpea) is a stable revenue line. Three exits from one contradiction.

This isn't only OpenAI's picture. Other companies feel the same pressure and take different exits.

Anthropic's revenue is smaller than OpenAI's (estimated $5~6B in 2026). But the loss is roughly a quarter of OpenAI's. Because Anthropic barely splinters its lineup and concentrates on coding and agents, pushing value per call up. A seven-hour autonomous run isn't one call, but it's one session. The user pays a lot for that one session. Anthropic chose "value up" instead of "unit cost down" on the same cloud front.

Google is different again. The advertising business is enormous and underneath everything. Gemini's operating cost spreads across search ad revenue. Google can go as far as licensing Gemini to other companies. Apple Foundation Models being Gemini-based is the result. Google reduces its own cloud inference cost by having other companies run its model, and at the same time reaches those companies' user data. Call it an "AWS of AI" stance.

Three companies handle the same contradiction three different ways. OpenAI runs through splintering and devices. Anthropic holds on through value-up. Google diffuses through infrastructure.

That's the real reason three fronts are splitting at once. What we called "the next battlefield split into three" is more accurately "each company finding an exit from a $23-billion contradiction that fits its own assets."

Five days before I/O, my prediction

Now about Google I/O. What lands in the keynote in five days? With the three-front picture above, the prediction sharpens.

The biggest hand Google plays will be on the device front. Specifically the OEM-side infrastructure push.

Three reasons. First, Gemini 3.2 Flash already showed up on the iOS Gemini app and AI Studio on May 5. $0.25 per million tokens. The official announcement lands at I/O, with extra lineup (3.2 Pro or an upper tier) likely alongside. Second, Google has been turning Gemini into Android's operating layer through versions 13 to 16, one step at a time. Deeper integration is almost certain at I/O. The CNBC piece from May 12, 2026 traces the trajectory. Third, Apple Intelligence's reveal lands at WWDC in June. Google has to show the Android picture first.

The keynote message will read something like this. "Gemini is no longer a chatbot. It's Android's operating layer. The same Gemini runs on phones, in cars, on laptops, in the browser. Part of it runs in the cloud, part of it baked into the chip running on-device." In our picture, "the OEM acceleration on the device front."

How will OpenAI respond? Something lands within a week or two of I/O. Could be GPT-5.5 Pro or a Thinking line, could be a Sora follow-up, could be a major Agents SDK update. The pattern is OpenAI throwing several small punches right after a Google keynote. The splintering pattern of the cloud front carrying into the response too.

Anthropic is different. Anthropic doesn't ship loud announcements often. In early February, the claude-sonnet-5 identifier appeared on Vertex AI and disappeared. Claude Sonnet 5 is almost certainly being staged. Going by Anthropic's pattern, expect a quiet announcement between June and July, paired with strong coding and agent benchmark numbers. Opus 5 follows. The 1M context Anthropic unlocked on Opus 4.7 is a teaser for that next step.

Stack it all and the next two months look like this.

Next week (May 19~20). Google I/O. Gemini 3.2 Flash official, 3.2 Pro teaser, deeper Android OS integration, an Astra follow-up demo. OEM acceleration on the device front as the main track.

Late May to June. OpenAI watches the I/O reaction and counters with a GPT-5.5 variant or a Sora follow-up. The cloud-front splintering pattern, again.

Early June. Apple WWDC. The reconstructed Apple Intelligence shows up. How Apple frames the Gemini base will be the thing to watch.

Late June to July. Anthropic Claude Sonnet 5 quietly announced. The coding-and-agent well.

Summer to September. OpenAI Sweetpea ships. The biggest bet on the form-factor front gets its first verdict. Anthropic Opus 5 around the same window.

That's the picture I can draw five days before the keynote. In the next post I'll check how far the actual keynote validates this.

Closing

This is part one of a series. The two single-digit numbers misaligning in the intro — what does that contradiction look like at the company level? That's the subject of part two. What breaks when a company tries to bring local LLMs in-house, and who can still rationally use them, will go deeper in part two (publishing May 21).

Part three lands after I/O. I'll check how far the picture above held up, and refine the outlook for the June~September window.

See you in five days. Until then, let's wait together to see where Google places its hand.

댓글

이 블로그의 인기 게시물

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

Harness Engineering in Practice — How Anthropic Designs AI Agents

What Is Harness Engineering — Designing the Reins for AI Agents