I Put a Local LLM on a Company GPU. The ROI Math Got Stranger.

A few days after Part 1, I dropped the same Qwen3.5 122B onto a company GPU workstation.

Part 1 ended with my M4 laptop falling into single-digit tokens-per-second territory. The natural next question followed me around the office: what happens at company scale? The "lonely" place where my notebook ended up — does it look the same on a workstation with a real GPU plugged in? An NVIDIA RTX Pro 6000 Blackwell 96GB happened to be sitting on a test bench. I loaded the same model.

Speed jumped 5x to 7x. The 6–8 tokens/second I saw on the M4 became 35–50 tokens/second on the workstation. Chat replies stopped flickering letter-by-letter and started flowing as full sentences. One half of the contradiction from Part 1 — the speed half — was gone.

Then I added five teammates to the same box. Average response slid back to single digits.

The same phrase reappeared in a different dimension. In Part 1, single digits meant my laptop's raw inference rate. In Part 2, single digits meant five people queuing against one company GPU. Token-per-second collapsed under contention. If that were the only finding, "buy more GPUs" would be the answer. But the same week I walked an early ROI sheet over to accounting, and another single-digit problem showed up: when you decompose seat-level cost into hardware depreciation, electricity, headcount, and model evaluation, the local setup came in more expensive than the cloud ZDR plan we were already using. And then security asked, "do you have logs of which engineer fed which code to which model?" Ollama doesn't ship that by default.

Enterprise adoption isn't one answer.

What Part 1 Called "Lonely" Looks Different at Company Scale

A one-line summary of Part 1: local LLMs caught up on benchmarks but feel lonely on a laptop. Single-digit tokens/sec, plus five other gaps — agent integration, NPU access, battery, thermals, OS integration.

Some of those gaps close on a company GPU. Speed closes. Battery and thermals stop mattering because you're on a desktop-class box. NPU underutilization stops being a story because datacenter GPUs marshal their Tensor Cores fully. The "lonely" half of Part 1 partially dissolves at company scale.

But new gaps show up.

Concurrent users become a variable. Five people on one workstation collapse 35 tokens/sec back to single digits. Governance logs become another variable: the company needs to track which engineer fed which code to which model, and a vanilla inference server doesn't generate that log. Accounting becomes a third variable. The single KPI of "$0 API spend" fractures into four separate lines: per-seat cost, uptime, autonomy, and model deprecation risk. Sum those four and you get the actual cost.

That's how Part 1's "lonely" stops being one answer. Enterprise adoption is a matrix of three axes — security and regulation (F1), cost accounting (F2), and per-role workflow (F3). ROI only lives where the three intersect, and that intersection moves company by company.

"Self-host everything" is wrong. "Cloud-only" is wrong. The right question is which cell of the matrix your company actually sits in.

F1 — Security and Regulation: Some Code Never Leaves the Building

Security has to be the first axis. Once it's decided, the other two fall into line.

What Security Teams Actually Fear

Let me clear up a common misread first. The thing your security team fears isn't "model training data leakage" — the scenario where your code gets folded into a model's training set and surfaces in someone else's autocomplete. ZDR (Zero Data Retention) has mostly closed that door.

The real fear is the inference context itself. When one file of your core codebase hits a cloud API, even if it's never used for training, it passed through someone else's memory. If the request gets flagged for policy review, it can be retained for up to two years. Anthropic's ZDR doc spells this out — "retention may apply where required by law or to address Usage Policy violations."

ZDR is not absolute. The normal path doesn't retain, but exception paths exist.

Where ZDR Stops Being Enough

Those exception paths are why first-tier assets at banks, hospitals, law firms, and game studios can't fully rely on ZDR. The US OCC's cloud guidance and similar frameworks elsewhere ask companies to separately evaluate "data governance dependencies on third-party policy." Even with a ZDR agreement, the fact that the agreement depends on the vendor's policy is itself a risk line item.

The EU AI Act ratchets this pressure up. On August 2, 2026, Annex III obligations for high-risk AI systems go fully live. Data governance, technical documentation, human oversight design, and lifecycle monitoring all become mandatory.

One precision point worth flagging: a week ago, on May 7, 2026, EU legislators agreed to push some provisions to December 2027. Biometrics, critical infrastructure, education, employment, and migration/asylum/border control are partially deferred. August 2026 isn't a clean cliff, but it isn't a free pass either. Your legal team has to first figure out which high-risk category your company falls under. Only then does the "cloud ZDR vs. self-host" call become tractable.

But Most Company Work Isn't Tier 1

Here's a trap I see often. A company gets spooked by its tier-1 assets and tries to move every workflow on-prem.

More than 80% of LLM-touched work inside a typical company isn't tier 1. Internal doc summaries, meeting notes, email drafts, README polish, simple refactoring. Pinning all of that to a local GPU breaks the other axes (F2 and F3). You're dragging workloads on-prem that cloud ZDR handles more sensibly.

The F1 conclusion is simple: tier-1 assets local, tier-3 and below on cloud ZDR. If you can't make that split, the whole LLM stack tilts one way and inefficiency piles up on the other side.

F2 — Cost Accounting: "$0 API Spend" on a Company Ledger

Cost is the second axis, and this is where individuals and companies diverge the most.

An individual's cost ledger is one line: the API invoice. End of month, one charge on the card. "Move to local, that line becomes zero" is honest individual math.

A company's ledger has four lines.

What Companies Actually Pay For

Line 1 — hardware depreciation. NVIDIA RTX Pro 6000 Blackwell 96GB lands at $8,000–$9,200 on the market in May 2026. MSRP at launch in March 2025 was $8,565; NVIDIA's marketplace lists it at $8,900 today. Pick $9,000 as a midpoint. Add the rest of the workstation — CPU, RAM, SSD, case, PSU — and the full station lands around $12,500. Amortized over 36 months: $347/month.

A second wrinkle worth a paragraph: the Mac Studio M3 Ultra 512GB was pulled from Apple's catalog in March 2026 because of the global DRAM shortage. It used to be $9,499. You can't really buy one anymore. The NVIDIA DGX Spark went the other way — a $700 price hike in February 2026, from $3,999 to $4,699, for the same memory-supply reason.

The first variable in enterprise adoption has quietly become "can you actually buy the hardware you want?" Your self-host roadmap is now exposed directly to supply chain weather.

Line 2 — electricity. RTX Pro 6000 has a 600W TDP. At 35% average utilization, 24 hours a day, 30 days a month, you're at about 151 kWh. At a US industrial rate of $0.15/kWh, that's $23/month. Korea is cheaper, the EU is more expensive. Either way, this line is small relative to the others.

Line 3 — operational headcount. This is the trap. Assume an MLOps or DevOps engineer manages the workstation. Their fully loaded monthly cost is, say, $12,000. If they spend 0.1 FTE on the GPU box, that's $1,200/month. At 0.05 FTE, $600. At 0.2 FTE, $2,400.

This is where seat-level pricing wobbles. Depending on how accounting allocates headcount, the answer swings by 3x to 5x.

Line 4 — model evaluation and update cost. Every quarter, a new model lands. Re-running an internal eval suite costs roughly $2,000 a quarter, or $667 a month. You can't skip this if you want governance.

Real Per-Seat Comparison

Roll those four lines together. Assume a 5-engineer team shares one workstation:

Line Monthly Per Seat (5 sharing)
Hardware depreciation ($12,500 full station / 36 months) $347 $69
Electricity (600W × 35% utilization) $23 $5
Operational headcount (0.1 FTE × $12,000) $1,200 $240
Model evaluation and updates $667 $133
Total $2,237 $447

Drop the FTE allocation to 0.05 and per-seat falls to about $327. Push it to 0.2 and it climbs to $687. The headcount assumption is the dominant lever.

Cloud bundles? Cursor Business sits at $40/seat/month. GitHub Copilot Enterprise is $39, plus $21 for GitHub Enterprise Cloud, so $60 total. Cursor Enterprise is custom-quoted, but 50-seat-plus deals land roughly in the $50–80 range. Direct API at GPT-5.5 $1.25/M input with a developer using ~12M tokens/month costs $15 in raw model spend — though that's before adding SSO, audit logging, seat management, and onboarding automation that Cursor/Copilot bundle in.

On unit cost alone: cloud bundles are $40–$80, local is $327–$687. Cloud is 5–10x cheaper.

So why bother with local at all?

Two Things Unit Cost Misses — Uptime and Autonomy

Two more lines belong on a company ledger.

Uptime. On April 20, 2026, OpenAI had a 90+ minute outage. ChatGPT, Codex, and the API Platform all went down simultaneously. The UK alone logged 8,700+ user reports. If your coding workflow is fully pinned to one cloud API, 90 minutes is the gap. A 5-engineer team at $80/hour fully blocked for 90 minutes costs you $600. One incident can flip your monthly cost comparison.

Autonomy. Model deprecation is the other variable. Your team spends a year tuning workflows around Claude Sonnet 4.5, and then Anthropic ships Sonnet 5 and announces 4.5 will phase out. Now part of your stack needs a rebuild. That rebuild cost is harder to budget than per-seat spend, and it shows up on whatever quarter the deprecation lands. Local models, by contrast, don't disappear as long as you hold the weights. You're paying for autonomy.

F2 conclusion: cloud wins on unit cost; hybrid wins once you price in uptime and autonomy.

F3 — Per-Role Workflow: Who Uses It Changes the Answer

The third axis is workflow. Even inside one company, the answer changes by role.

Coders — Only Models That Can Run 7 Hours Autonomously Matter

This is the territory Part 1 mapped. Anthropic published the Rakuten case where Claude Code ran autonomously for 7 hours, and Claude Opus 4.7 hit 78% on OSWorld. A person doesn't intervene; the model calls hundreds of tools and pulls a multi-step task through to the end.

Can a local Qwen3.5 122B do that? Five-minute tasks, sure. One-hour autonomous tasks, no — it loses thread. Context decays faster on open models than on the frontier closed ones. The "agent integration" gap from Part 1's five-item list hits coders hardest.

Coding workflows can't really let go of the cloud. Unless your codebase is tier-1, there's no reason to insist on local.

Writers and Researchers — Short, Frequent, Low Context Decay

This is where local LLMs shine.

Meeting notes, doc summaries, email drafts, press releases. Calls are short, context is small, the human is reviewing in real time. There's no need for the model to run autonomously for an hour.

And meeting notes are often packed with internal information. Strategy meetings, HR meetings, legal reviews. Even with cloud ZDR, the security tier may flat-out forbid external transit. If one local GPU workstation can cover your entire writing/research org, that's where ROI shows up the cleanest.

Data and Analytics — Context Length Is the Decider

For data and analytics, a different variable rules: how much data you have to feed in at once.

Meta's Llama 4 Scout shipped a 10M-token context window under an open weight license. That number actually matters here. Quarterly internal data, a year of logs, a full codebase index — workflows that need all of it in a single prompt don't fit in a 1M cloud context. Plus, this kind of data often can't legally leave the internal network anyway. Local is forced, not chosen. The workflow axis and the security axis converge.

Quick legal note: Llama 4's license is free for companies under 700M DAU, but above that it's a separate negotiation. Most Korean IT companies sit in the free band, but the definition of "DAU" leaves room for interpretation, so legal should look before you commit.

Designers and PMs — Multimodal Decides

The last role is the one leaning into multimodal. Design reviews, wireframe checks, visual polish on product spec decks.

This is where Part 1's lineup fragmentation reappears. Gemini 3.2 Flash bundles vision into the same fast/cheap tier, and OEM lineups are evolving fastest there. Drop a PNG into a chat and ask for hierarchy critique — that's a cloud-side workflow today.

Local multimodal is still lonely. Qwen3.5 VL exists, but workflow integration is thin. The plugins inside design tools mostly call cloud APIs. Designers and PMs can't really let go of the cloud either.

F3 conclusion: coders and designers stay cloud; writers, researchers, and analysts can go local. Different parts of the same company need different answers.

Where the Three Axes Meet — Four Adoption Patterns

With the three axes laid out, the practical adoption patterns boil down to four. Where a company lands depends on size, security tier, and role mix.

Pattern A — All Cloud (ZDR Bundle)

Who: under 50 people, few tier-1 assets. SaaS startups, marketing agencies, early-stage fintech.

Stack: Cursor Business + Claude Code, or GitHub Copilot Enterprise. Direct API only for occasional automation.

Per seat: $80–$100/month.

This is the simplest pattern. No reason to own infrastructure. When tier-1 assets actually appear, migrate to Pattern B or C.

Pattern B — Hybrid (Cloud-First, Local-Assist)

Who: 50–500 person mid-sized IT companies. Some tier-1 assets, but coding/design is the dominant workflow.

Stack: coding and design on cloud ZDR. Internal docs, meeting notes, quarterly reports on one or two local GPU workstations.

Per seat: cloud $50–$120 + infrastructure share $50–$150. Roughly $100–$250 total.

This is where most mid-market Korean (and frankly, most mid-market global) IT companies land. Carve out tier-1 separately, run the rest on cloud for efficiency.

Pattern C — Security-First Hybrid (Local-First, Cloud-Assist)

Who: banks, hospitals, game studios with revenue-critical proprietary code. Tier-1 assets are the core of revenue.

Stack: tier-1 code and analytics on a 4–8-node local GPU cluster. Tier-3-and-below general work on cloud ZDR. Dedicated MLOps team at 0.5–1.0 FTE.

Per seat: $200–$400/month including infrastructure allocation. MLOps headcount is the dominant lever.

The decision itself takes a quarter or two — legal, security, accounting, and engineering all have to sit at the table. ROI doesn't show up in a single quarter; it's an 18-to-24-month horizon.

Pattern D — Fully On-Premise

Who: government, defense, some hospital systems. Almost everything is tier-1.

Stack: 16+ GPU cluster, full internal IT team, external APIs blocked.

Per seat: $500+/month.

This pattern only delivers ROI above ~200 people. Below that, MLOps headcount divided by too few seats blows the math up. Pattern D isn't really "chosen because ROI works" — it's chosen because regulation forces it.

Reading Your Company Against the Four

Stand the four patterns next to each other and the question gets concrete: which cell is your company in? And the same company can run multiple patterns at once across roles — coders on Pattern B's cloud side, the doc-summary team on Pattern B's local side.

The trick is to drop the "local vs. cloud" binary at the start. Map out which cell your company sits in, and whether different roles inside the same company need different cells running in parallel. That's the actual answer.

Three Days After I/O — How Part 1's Predictions Feed Into Adoption Math

This post lands on May 21, 2026 — three days after the Google I/O keynote. Part 1's predictions are now partially verified, partially missed. Either way, the keynote feeds back into adoption math.

Gemini 3.2 Pro's official pricing now plugs directly into the math. When Part 1 went out, Gemini 3.2 Flash had leaked at $0.25/M input. I/O formalized the Pro tier. If the price floor keeps dropping, Pattern A's and Pattern B's cloud-side seat cost gets another round of downward pressure. Same per-seat dollars, more tokens per developer.

Android OS-level Gemini integration doesn't change enterprise math much — yet. Part 1 already flagged this: Android lives on the consumer-device side, while enterprise LLM ops mostly live on workstations and servers. That said, once employees start using Gemini on their phones, OEM-grade models connected to corporate SSO and MDM might enter Pattern B as a support tool.

Apple's WWDC is still ahead in early June. Companies with a heavy Mac footprint will get another shift then. If iOS 27 actually lets users pick external models, the option to delegate model choice to users becomes real. How that lands inside corporate governance is a Part 3 question.

I/O didn't flip the adoption pattern map. It mostly added more downward pressure on cloud unit cost. The real wild cards are still WWDC in June and OpenAI's Sweetpea in the back half of the year.

Closing

The three-axis matrix lands on one conclusion: there is no single-line answer.

Four checklist questions to map your company:

  1. Do you have tier-1 assets? If yes, Pattern C or D. If no, Pattern A or B.
  2. Can you actually run GPU infrastructure? Can you commit at least 0.1 FTE of MLOps/DevOps? If not, you stop at Pattern A.
  3. What's your role mix? Coding/design-heavy → Pattern A or B leaning cloud. Writing/research/analytics-heavy → Pattern B leaning local.
  4. How big are you? Under 50 → Pattern A. 50–500 → Pattern B. 500+ with tier-1 driving revenue → Pattern C. 200+ where almost everything is tier-1 → Pattern D.

Answer those four and a shape comes into focus. The per-seat number inside that shape then bends another way depending on how accounting allocates MLOps headcount.

The single-digit number I saw on a company GPU was a different variable than the single-digit number on my laptop in Part 1. Part 1 said: "benchmarks caught up, real use is lonely." Part 2 says: "at company scale, the lonely places and the rational places split."

Part 3 will look at how the I/O keynote actually verified (or didn't) Part 1's predictions, and refine the June–September outlook. Publishing between May 22 and 25.

Which cell your company sits in — that's a question your company has to answer. The next post sketches out how to refine that answer together.

댓글

이 블로그의 인기 게시물

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

Harness Engineering in Practice — How Anthropic Designs AI Agents

What Is Harness Engineering — Designing the Reins for AI Agents