Building an LLM Robot with My Son — EP 4. Choosing the Right Local LLM for Robot Control

Building an LLM Robot with My Son — EP 4. Choosing the Right Local LLM for Robot Control

We needed to pick a model.

Connecting a local LLM to the robot means committing to a specific open-source model. If we were using a cloud API, this decision would be trivial — just call GPT-4o or Claude. But our architecture runs a local LLM server on the home LAN. We had to test and decide ourselves.

I set three evaluation criteria.

Tool use — to send structured commands like "forward" or "stop," the model needs to reliably call JSON functions. If it sometimes returns proper JSON and sometimes writes prose explanations, parsing fails. Consistency matters more than peak performance.

Korean language — my son gives instructions in Korean, and I want to read debug output in Korean. A model that drifts into English mid-response is just harder to use.

Vision — we don't need it now, but we'll need camera frame input later. If the model has a vision variant in the same family, we can extend without migrating the whole setup.

Four models on the test bench: Qwen2.5-7B-Instruct, Llama 3.1 8B Instruct, Phi-3.5 Mini Instruct, Gemma 2 9B Instruct.


Tool Use Test

Same function definition given to each model:

{
  "name": "robot_command",
  "description": "Send a movement command to the robot",
  "parameters": {
    "action": {
      "type": "string",
      "enum": ["forward", "backward", "left", "right", "stop"]
    },
    "speed": {
      "type": "integer",
      "minimum": 0,
      "maximum": 200
    }
  }
}

Prompt: "There's an obstacle 20cm in front of the robot. What should it do?"

Qwen2.5-7B: Returned tool call JSON correctly. {"action": "stop", "speed": 0}. No extra explanation — straight function call format.

Llama 3.1 8B: Tool use works, but wraps the call in <tool_call> tags. Parseable, but requires an extra unwrapping step compared to Qwen.

Phi-3.5 Mini: Responded in prose: "If there's an obstacle, you should stop. Call robot_command(action='stop', speed=0)." Requested JSON schema, got a text explanation. Technically parseable but unreliable.

Gemma 2 9B: Most inconsistent. Same prompt, three tries: once returned JSON, once returned text explanation, once returned Python-style code. No consistency.


Korean Language Test

My son ran this part. He typed the same question in Korean to each model:

"An IR sensor detected an obstacle 20cm in front of the robot. What should I tell it to do?"

Qwen2.5-7B: Answered naturally in Korean. "Since an obstacle was detected at 20cm, send a stop command to the robot." Tool calls also worked correctly from Korean instructions.

Llama 3.1 8B: Understood Korean input but replied in English. Has Korean capability but falls back to English for responses.

Phi-3.5 Mini: Mixed Korean and English mid-response. "Since an obstacle was detected, you need to issue a stop command." Technically understandable, but broken.

Gemma 2 9B: Korean response quality comparable to Qwen. But tool use inconsistency eliminates it regardless.

He looked at the Qwen response and said: "This one understands the best." The other models replying in English or mixing languages made them harder to read. By his criteria, Qwen ranked first too.


Vision Model Check

We don't need vision yet, but it's coming. I looked at what's available in each family.

Qwen2.5-VL exists — multimodal variant of the Qwen series. 7B parameters with image input. Fits comfortably in 24GB at Q4_K_M.

Llama 3.2 Vision at 11B is tight in 24GB. Works at Q4 on M4 24GB but gets noticeably slow. Fine on M4 Pro 24GB.

Phi-3.5-Vision-Instruct exists at 4.2B — small and fast, but vision quality falls well below Qwen2.5-VL. Simple object recognition works, but scene understanding is weak.

Current plan: stay with the text model for now, switch to Qwen2.5-VL when camera input is added in EP 6 onward. Same family means harness changes stay minimal.


Final Decision and Why the Others Lost

Winner: Qwen2.5-7B-Instruct

Tool use consistency, Korean response quality, and a clear vision extension path — it led on all three criteria. At 7B parameters, it hits 112 tok/s on M4 Pro, which is well within what we need.

Eliminated:

  • Llama 3.1 8B: Non-standard tool call format requires an extra parsing layer. Korean mixing is a friction point. Slightly lower tok/s than Qwen.
  • Phi-3.5 Mini: Tool use reliability is too low. Unpredictable response format is dangerous for robot control — a command that sometimes works isn't a command.
  • Gemma 2 9B: Worst tool use consistency of the four. Good output quality in isolation, but unpredictable in repeated calls.

The Scoring Table

I showed my son each model's Korean responses and asked: "Which one is easiest to understand?"

His scores:

Model His score Reason
Qwen2.5-7B ★★★★★ "Actually answers in Korean"
Llama 3.1 8B ★★★☆☆ "English mixed in, hard to read"
Phi-3.5 Mini ★★☆☆☆ "Too much explanation"
Gemma 2 9B ★★★☆☆ "Answers weird sometimes"

His ranking matched the final selection. We're going with Qwen.


Qwen2.5-7B is now in the harness. The next piece is connecting it to the robot over the network. The communication layer is still ahead.

댓글

이 블로그의 인기 게시물

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

Harness Engineering in Practice — How Anthropic Designs AI Agents

What Is Harness Engineering — Designing the Reins for AI Agents