Building an LLM Robot with My Son — EP 3. Local LLM Speed Compared Across Mac M1, M4, and M4 Pro

Building an LLM Robot with My Son — EP 3. Local LLM Speed Compared Across Mac M1, M4, and M4 Pro

The first time I ran a local LLM on the Mac mini M1, I watched Qwen2.5-7B output tokens one character at a time and paused for a second.

About 8 tokens per second. Not slow, exactly. But whether that's fast enough for real-time robot control is a different question — how long does it take from the robot sending a camera frame to receiving a command back? That needed a measurement, not a guess.

I had three Macs already: Mac mini M1 16GB, Mac mini M4 24GB, MacBook Pro M4 Pro 14" 24GB. Same prompt, same model, three machines. The comparison made itself.


Test Setup

Model: Qwen2.5-7B-Instruct, Q4_K_M quantization. Both mlx-lm and llama.cpp Metal backend, measured separately.

Metrics:
- tok/s: tokens generated per second
- TTFT: Time to First Token
- Memory usage: at 32K and 128K context
- Thermals: CPU/GPU temperature after 5 minutes of sustained load

Three prompt types: short code generation (Arduino, ~50 lines), medium analysis (sensor data interpretation), long summarization (partial ROS2 documentation). Ten measurements each, averaged.


Results

Device Memory bandwidth tok/s (mlx) tok/s (llama.cpp) TTFT (32K)
Mac mini M1 16GB 68.25 GB/s 31 28 1.2s
Mac mini M4 24GB 120 GB/s 58 52 0.7s
MacBook Pro M4 Pro 24GB 273 GB/s 112 98 0.4s

M4 Pro is almost 2x faster than M4. Compared to M1, 3.5x.

Memory bandwidth directly drives token generation speed at this scale. The M4 and M4 Pro both have 24GB of RAM, but bandwidth is 120 GB/s vs 273 GB/s — more than double. That ratio shows up almost exactly in tok/s.

mlx-lm running 10-15% faster than llama.cpp Metal was expected. mlx is Apple Silicon-native, which gives it an edge over llama.cpp's Metal backend.


What Changes at 128K Context

At 32K context, all three machines run fine. At 128K, the picture changes.

Device 128K context tok/s (128K) Memory used
Mac mini M1 16GB Not possible (OOM)
Mac mini M4 24GB Possible (slows down) 21 22.4 GB
MacBook Pro M4 Pro 24GB Possible (stable) 47 22.4 GB

M1 with 16GB couldn't run 128K context at all — out of memory. M4 with 24GB scraped through but tok/s dropped more than 60% from the 32K rate. M4 Pro with 24GB — same memory size — held 47 tok/s at 128K.

Higher bandwidth doesn't just mean faster generation; it means longer context degrades more gracefully. The gap between M4 and M4 Pro was 2x at 32K. At 128K it widened to 2.2x.


How Fast Does Robot Control Actually Need to Be?

That's the real question. Faster is better, obviously — but what's actually required for robot navigation?

Our scenario: robot sends a camera frame (640×480 JPEG) and ultrasonic sensor reading to the LLM server. LLM responds with one of "forward," "left," "right," "stop." Command returns to robot.

Expected token count per command: 10–20. "Forward" is 6 tokens.

At M1's 31 tok/s, generating 6 tokens takes under 0.2 seconds. Add 1.2s TTFT, and the first command arrives in about 1.4 seconds. Subsequent commands stream faster.

Honestly, M1 is sufficient for this use case. 1.4 seconds isn't a problem when the robot moves slowly and reaction time isn't the bottleneck. That changes when we add vision LLM (multimodal adds overhead) or extend the context window — but for now, even M1 can handle it.

Current setup: M4 Pro as the main server, M1 kept running for benchmark comparison, M4 mini still experimental.


mlx-lm vs llama.cpp — Practical Differences

Running both frameworks in real use, there are meaningful differences beyond the speed numbers.

mlx-lm is faster and the Python API is clean. But model support is narrower than llama.cpp. A new model might appear in llama.cpp immediately but take days to a week for an mlx-converted version to appear.

llama.cpp supports an enormous model range — nearly any GGUF works. Slightly slower, but the RPC server mode is rock-solid. For remote calls from the robot, the /completion endpoint is all that's needed.

I'm using llama.cpp server mode as the main setup. We're in an experimental phase with fast model turnover, so breadth of support matters more than the 10-15% speed difference.


My Son's Take

He was watching while I ran the benchmarks.

After staring at the terminal output for a while: "What are those numbers?"

I explained tok/s. "How many words it makes per second."

"So M4 Pro is smarter?"

No. Same brain, different speed. Like the difference between writing by hand and typing — the thinking is the same, but one comes out faster.

He thought about it. "So the brain's the same but the hands are faster."

That's exactly right. He got it.


What Changed After Measuring

Before the test: "M4 Pro is faster, obviously." After: the specific bottleneck became clear.

Memory bandwidth is the constraint. For local LLM inference, how fast the system can read weights from memory matters more than GPU core count. LLM inference is memory-bandwidth-dominated rather than compute-dominated — the access pattern, not the math, is the bottleneck. Apple Silicon's unified memory architecture, where CPU and GPU share the same pool, turns out to be well-suited for this.

The 3.5x difference between M1 and M4 Pro in tok/s isn't just "newer chip." It's memory bandwidth going from 68 GB/s to 273 GB/s — a 4x increase.

If you're setting up a local LLM server: look at memory bandwidth before memory capacity. Two machines both at 24GB can perform very differently if bandwidth differs by 2x.

댓글

이 블로그의 인기 게시물

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

Harness Engineering in Practice — How Anthropic Designs AI Agents

What Is Harness Engineering — Designing the Reins for AI Agents