Five Days of Living with a Local LLM on an M4 MacBook

Last week I wrote Five days before Google I/O, the AI front has split into three, and at the end I dropped one line: "Local LLMs are lonely." That line stayed in my head all week. It read too clean. A snapshot of five limits laid out at one moment in time, then a verdict. So I decided to spend five days actually living inside that snapshot…

May 11 Monday morning to May 15 Friday evening. Exactly five days.

The rule was simple. Run the main workflow on a local LLM. The Anthropic and OpenAI APIs stay on, but every time my hand reaches for one, redirect it to Ollama instead. Log every escape. Then at the end of the five days, look at what hardened and what scattered.

By Friday evening, two things had happened at once. My usual five-day API bill of around $45 stopped at $14. That was the intended result. But every single day I escaped to the cloud once or twice, sometimes three times. The interesting part wasn't that I escaped. It was that the escapes weren't acts of weak will. They had reasons, and the reasons were always the same shape. That pattern is the real finding of this post. "Live only on local" isn't a single rule. It's a branching point that lands differently for different people.

Here is what happened in those five days.

A five-day diary

Before the analysis, I want to lay out how the five days actually moved. The abstract limits were already laid out in part one. This section is where Tuesday afternoon at 3pm and Wednesday at 1am get written down.

Day 1 (Mon): setup and the first single digit

Monday morning I redid the setup. The Qwen3.5 122B from part one, but this time deliberately on two backends — Ollama once, MLX-LM once — so I could compare the same model on the same machine.

In Ollama, qwen3.5:122b-a10b-q4_K_M served short answers at 6~8 tokens per second, long answers (2,000+ tokens out) at 5~6 tok/s. Time to first token (TTFT) was about 3 seconds at a 4K-token context, stretching to 6 seconds at 12K. The model file itself is 70GB. Memory footprint about 55GB. With 64GB unified memory, that leaves 8~9GB for the rest of the system. I closed Chrome.

Running the same model in MLX-LM, tokens per second jumped to 12~14. Almost double. The MLX camp's benchmarks report up to a 3x gap on MoE models, but in my setup (M4 Max 64GB, context under 12K) it was just over 2x. Same model, same quantization, only the backend changed, and it doubled. An interesting place to sit.

And still the Apple NPU goes unused. The M4 Max ships with a 38-TOPS NPU on the die. The MLX-LM speedup comes from Metal driving the GPU, not from the NPU. NPU access lives behind Core ML, and the open camp doesn't get in. The part-one image of "a second engine in the trunk that never gets plugged in" stayed accurate every single day of the five.

Monday evening note: "A single digit doubled is still a single digit." 6 tok/s in Ollama became 14 tok/s in MLX. Compared to the 100+ tok/s a GPT-5.5 API serves, that's still a 7x gap. But I learned over the five days that this 7x gap registers very differently depending on the type of work. On a one-line answer, you barely feel it; the system-level first-token latency is similar on both ends. On long answers the gap compounds. So "tokens per second" as a single number ends up being decisive in some places and almost invisible in others.

Day 2 (Tue): 60 lines of code, 4 minutes

Tuesday around 3pm. I asked the model to refactor a single file end-to-end. A 60-line React component, moving state from useState to useReducer.

It took 5 seconds before the answer began. Then the tokens fell out one at a time. The whole 60 lines took 4 minutes 12 seconds. What did I do during those 4 minutes? I switched tabs. Answered two Slack messages. Wrote down the next task. When the answer finished and I came back, the flow was gone. I had to look at the code and ask myself "why did I want this?" before I could even evaluate the output.

This is the five-day version of "inference is slow" from part one. From a one-time snapshot it reads as a single line: slow. From inside five days it shows up differently. The slowness isn't the problem. The slowness breaks the flow, and the broken flow drains the context out of your head. When the answer arrives, the cognitive resources to evaluate it have already moved somewhere else.

There were two ways out. One was "don't do anything else for those 4 minutes." That turns 4 minutes into truly empty time. You either accept it as loss or build a workflow where you have a short book on the desk to read in those gaps. The other way: stop asking for 60-line refactors. Ask for 20 lines three times. Then even at single-digit tok/s, one cycle wraps inside a minute and the flow survives. The cost is the mental overhead of pre-chunking the 60 lines into three pieces yourself.

By Tuesday evening I had locked in the second option. "Chunk the unit of work smaller" was the fastest habit to settle on my body during the five days.

Day 3 (Wed): the 12K-token wall

Wednesday at 1am. That night I gave the local model a Korean essay to polish — roughly 4,000 characters, with an English translation draft on the side. I pasted the whole essay into Qwen3.5 and ran "polish this sentence's rhythm" iteratively.

The first five passes were sharp. Around the sixth, the answers started to wobble. It recommended the same sentence twice. Once it cited a word as if I had written it, when I hadn't.

When I checked, accumulated context had crept near 12K tokens. Ollama's docs note that the default num_ctx is 2,048 tokens — anything beyond that gets silently dropped unless you set it explicitly. I had it set to 32K, but even inside that window, accumulated context causes attention quality to fall off due to RoPE scaling. The 262K context window on the model card doesn't survive the workflow intact.

The fix was simple. I switched to Gemma 3 27B Q4 (17GB on disk, about 6 tok/s). Smaller model, same 12K context lands less heavily. Near the end of the session I switched back to Qwen. Mid-task model swapping inside a single session. Not something I do with cloud APIs — GPT-5.5 just rides through a session end-to-end.

The piece finished at 1:38am. The swap cost maybe 10 minutes. If I had run the same work on the Sonnet 4.6 API, that hour would have cost roughly $0.40 more. So I came out ahead on cost and behind on clock. That's when I started to evaluate time and money on separate axes.

Day 4 (Thu): the first violation

Thursday morning, 9:47am. A client meeting at 10am. I needed to polish one email fast before the call.

Qwen3.5 takes 4 minutes for that. The Opus 4.7 API takes 8 seconds.

With a hard 10am deadline, I went to Opus without much hesitation. First violation of the "local only" rule. The interesting thing wasn't the violation itself. It was the shape of it. Hard time deadline. Someone else waiting on the other side. Short, fast answer needed. Three conditions stacked, and the moment they overlapped, I escaped.

That happened three times during Thursday. The pre-meeting email. A code-review question a coworker pinged me about in Slack and wanted a fast answer to. And one KakaoTalk message to my mother before bed (a little embarrassing, but I'll log it honestly).

Thursday evening note: "The escape pattern isn't 'quality.' It's 'time.'" I didn't run away because the local answer was bad. I ran away because the time it took to arrive collided with someone else's clock. That's when I first sensed this would differ between people. If your daily life has almost no "pre-meeting answer" moments, the escapes don't even happen.

Day 5 (Fri): the shadow of Sonnet 5.1

Friday afternoon. Scrolling X, someone tweeted that the identifier claude-sonnet-5-1 had appeared on Vertex AI for about an hour and then disappeared. The follow-up to the Sonnet 5 thing I'd flagged at the end of part one.

Seeing that tweet, one sentence settled in my head and it might be the sharpest one of the five days. "My local is always one quarter behind."

The Qwen3.5 I had installed for these five days is a February 2026 model. In May, the cloud camp is staging Sonnet 5.1, and the Anthropic pattern from part one (a quiet June-July announcement) is holding. Opus 5 follows somewhere after that. The model on my laptop tracks that flow at a one-quarter lag. If the lag shortens, it becomes half a quarter. If the open camp slows, it stretches to two quarters.

This lag isn't pure downside. It's also stability. You spend a quarter on a model that's already been tested in the field. But on Friday I became newly conscious of the fact that coding benchmarks gain 4~6 SWE-bench points per quarter on the frontier. One quarter behind translates to roughly 6~10% quality gap on coding work.

Friday evening, the five days ended. One note: "The escape pattern has both 'quality' and 'time' axes. The weight differs by person." And a second one: "There's a point where the one-quarter lag turns into the name 'stability.' That point doesn't sit at the same spot for everyone."

Of the five from part one, three softened and two held

I'll bring back the five limits laid out in part one's device section: inference speed, agent integration, NPU unused, battery and heat, OS integration absent. At the end of five days, three of those softened into "limits you get used to" and two held as "limits that don't soften."

The three that softened: speed, battery, heat

Inference speed softened fastest. After the Tuesday 4-minute incident, the habit of "chunk the work small" landed on me by Wednesday. 20 lines three times, not 60 lines once. At that grain, single-digit tok/s isn't decisive. One cycle wraps in under a minute, and human cognition stays in the loop. At a 4-minute cycle, cognition can't. Same single digit, different work design, completely different experience.

Battery and heat soften the moment you constrain the workflow to the desk. Five days, I never even tried to run a local LLM on a laptop in a café. Didn't occur to me to try. The machine sat on the desk plugged in. CPU at 85~92°C, GPU at 80°C, fans pinned at 6,000 RPM. On the desk it's bearable. I had effectively pre-amputated the "café workflow."

The pattern for getting used to all three is the same. You don't actually remove the limit. You shrink the workflow until the limit doesn't reach. Small units of work. Desk only. Power outlet only. Inside the shrunken zone, the local model does enough.

The two that held: long-horizon, OS integration

Agent long-horizon work didn't soften at all. The Rakuten 7-hour autonomous run territory from part one — I never entered it once in five days. Aider connected to Qwen3.5 handled 5-minute one-shot tasks fine. Past 30 minutes context decay started, and one session burned an extra hour of debugging because the model renamed a variable twice. The model forgets an identifier it picked an hour ago.

This isn't a limit you fix by changing workflow shape. Chunk small and you handle inference speed. But long-horizon work is, by definition, work that has to run long to mean anything. The moment you chunk it, it becomes different work. That's exactly why Anthropic emphasizes seven-hour autonomous sessions. Memory management, intermediate verification, the model reflecting on its own work — all of that has to happen inside a single session. Open models lose context before they reach that territory.

OS integration absent stayed absent. Qwen3.5 can't sit in the always-listening background slot like "Hey Siri." I wanted to write a sentence and just say "polish this," but every single time I opened the Open WebUI tab and typed a prompt. In five days, that gesture never smoothed out.

These two limits won't soften in five months. Probably not in a year either. They'd require one of two things: users being able to load their own open-weight models onto the NPU, or OEMs accepting user-supplied models and distilling them to fit their silicon as a standard. Both need industry alignment, not a single company's decision. Apple opening more of the NPU API. Google adding a user-model slot to Tensor. Those decisions aren't landing next quarter. The "seam where the two branches don't meet" from part one stayed unclosed inside the five days too.

The 60/40 split in workflow

By the end of five days, it was visible which slots in my main workflow had hardened around local and which had stayed in the cloud. Roughly 60/40, on some days 70/30. The shape of that split is the real finding.

What hardened on local — 60~70%

Single-shot code review. A function, a component, a SQL query somewhere between 10 and 50 lines. Qwen3.5 did enough at this grain. Single-digit tok/s wasn't a problem because each cycle wraps in 30 seconds to a minute.

Short writing polish. A paragraph, a tweet, a Slack message. An English email draft. Local hardened here. The output quality was nearly indistinguishable from the cloud.

Regex and quick SQL generation. Surprisingly, local was strong here. A one-line answer drops in a second. Single-digit tok/s doesn't move the experience on a one-line answer.

English translation drafts. I ran one English blog draft through Qwen3.5 during the week. A literal pass, then an idiomatic polish. The result was about the same as the cloud API would have produced. It took 8 more minutes, but in those 8 minutes I made tea.

The shared signature of these slots is that each cycle is short and no one is waiting on the other end. When both conditions hold, local already does enough.

What stayed in the cloud — 30~40%

30-minute-plus autonomous coding. The Aider + Claude Code combo where an hour of self-driving work used to be routine. Not once in five days could I hand this to local. Past an hour, the model forgets what it did.

Pre-meeting fast response. The Thursday-morning first violation lives here. A hard deadline, a short answer needed quickly. The escape was the rational move. Opus 4.7 answers in 8 seconds.

Latest-library API questions. Qwen3.5's training data ends somewhere. Next.js 16 or React 19's newest patterns — the cloud models are fresher. Two escapes during the week for this.

Voice input. Walking outside, dictating notes from the car. I just use the ChatGPT app's voice mode. Wiring voice into a local LLM requires OS integration, which is one of the two that don't soften.

The shared signature of these slots is a hard time deadline, a context that has to be long by definition, or a need to enter at the OS level. The two unsolved limits from part one live precisely in this region.

The cost picture: $45 to $14, but laptop electricity also costs

Typical five-day API bill: about $45. Mostly Sonnet 4.6 calls, occasional Opus 4.7.

Five-day local-experiment API bill: about $14. Only the escapes remain.

On the surface, $31 saved. But two costs got moved into the local side instead. Electricity from running the laptop at full bore for five days, and the wear on the machine from sustained heat. Korean electricity rates would put the five-day full-load cost at roughly $3~5. The heat wear doesn't show up as a number. It shows up two years later in the resale price.

Call the real saving $25. Not a small number for five days. A month is $100~150, a year is $1,200~1,800. On a personal scale, real money. The point isn't that the cost went to zero. It's that the cost changed shape.

At the company scale, the picture complicates. Ten people running local LLMs in one office means ten laptops at 6,000 RPM. The fan noise one person tolerates at their desk becomes audible from the next conference room when there are ten of them. Anthropic's base pricing through Bedrock and Vertex matches the direct API — regional endpoints add a 10% premium, batch API gives a 50% discount. What local saves in cloud bills, the company gives back in heat, noise, machine management, and onboarding overhead. The cost doesn't disappear. It changes shape.

"Local only" is a branching point, not a rule

The line I landed on at the end of five days: some people can already live on local, and some need to wait another quarter.

People who can already live there

Solo side-project work. Someone setting their own pace. Single-shot tasks dominant. Desk-centered environment. English and Korean text work as the main feed, with almost no "pre-meeting answer" moments in daily life.

This profile can live not just for five days, but for five months. 60~70% of the workflow on local means only 30~40% in the cloud, and that 30~40% is likely to shrink over time as open models catch up each quarter.

People who should wait another quarter

Team codebase work. A schedule that interleaves with other people's clocks. 30-minute-plus autonomous sessions routine. Mobile or in-car workflows that matter. Voice and image multimodal use. Frequent pre-meeting answers.

For this profile, waiting one more quarter — maybe two — is the rational call. The two unsoftened limits (long-horizon, OS integration) live exactly in this profile's core work. Better to watch where the acceleration comes from before committing.

If a company is considering it: three slots to answer

At the company scale, the $23-billion contradiction from part one comes back from a different angle. As OpenAI runs through splintering and devices to bury its deficit, every company has to decide how to disperse its cloud cost. Local LLMs are one option for that dispersion. Not the answer for everyone.

There are three slots a company has to fill before adopting.

The first is the average horizon of the work. How long, on average, is one task someone hands to the model? If the average finishes inside 5 minutes, local can take 80% of the slots. If the average crosses 30 minutes, it tops out around 60%. If it's an hour, local becomes the supporting role. Starting adoption without knowing this number leads to a months-later retro that says "I don't understand why everyone keeps escaping to the cloud."

The second is the weight of off-desk workflows. How much of the work happens mobile or in the car? A company heavy in sales or field work has to look at OEM on-device (Apple Intelligence, Gemini Nano) alongside any local plan. Local LLMs don't enter that slot. When the laptop is closed, the model isn't working.

The third is how much model control is actually required. In regulated industries (finance, healthcare, government) where the weights have to live inside the company, local sits close to the answer. Otherwise, attaching 90% prompt caching and 50% batch discount to cloud APIs probably maps more cleanly onto accounting. Anthropic being available at the same base price on both Bedrock and Vertex backs up the simplicity. A company already deep in AWS goes through Bedrock; deep in GCP, Vertex. Procurement gets shorter.

Once these three are answered, neither "the company runs on local" nor "the company runs on cloud" comes out as a single sentence. The answer comes out per-person, per-slot. That's the same reason the two branches from part one (local LLMs and OEM on-device) keep running apart, and at the company level, running both setups side by side will stay the practical norm for a while.

What the I/O keynote answered, and what it didn't

I'm writing this the day after the Google I/O keynote wrapped. May 19 was the main stage, May 20 was the developer track. Five days of local living, then the keynote. Full analysis goes into part three; here just two notes.

The part-one prediction that "Gemini becomes the operating layer of Android" landed deeper than I expected. Gemini 3.2 Pro got the official announcement. Android 17 pulls Gemini further into the system layer. OEM acceleration ran ahead of schedule. The things I tried to do with Qwen3.5 on a laptop for five days (polishing prose, code review, English drafts), OEMs are doing more smoothly on the phone with Gemini Nano. Not one quarter behind. Maybe one quarter ahead.

What the keynote didn't answer: can a user load their own open-weight model onto the NPU? Apple stayed silent on this. Google shipped only its own Nano. The keynote didn't address the seam where the two branches would meet. That tells me the two branches keep running apart next quarter, probably across the whole June-September window. The part-one device-front contradiction is still alive.

Part three goes into the full keynote analysis and the June-September outlook. Publishing late May.

Closing

Part one's intro talked about "the two single digits misaligning on my M4." This whole post is what those two single digits looked like across five days of living.

The cost did fall to single digits. $45 became $14. That was the intended result and it held on day five. Speed stayed in single digits. Tokens per second sat between 7 and 14 for the whole week. That held too. But these two misaligned single digits stopped pulling toward a single verdict. The part-one line "local is lonely" turned, inside five days, into "local is already enough for some people and a quarter behind for others." A branching point, not a verdict.

The reason it branches is simple. Time deadlines. Task horizon. Mobile use. Multimodal need. Four axes, and each person sits on them with different weights. Same laptop, same model, same five days, and one person walks out satisfied at the 60% mark while another grinds at the 30% mark. "Local only" can't be a single rule for that reason.

Part three is the full I/O keynote analysis and the June-September outlook. Whether the part-one hypotheses held up, whether the seam between the two branches becomes visible next quarter.

The notes from five days of escapes are still sitting on my desk. That's the most honest part of this post.

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

- 4월 17, 2026

자세한 내용 보기

이 블로그 검색

goldtag