Building an LLM Robot with My Son — EP 8. My Son Gave the AI Robot Its First Real Command

Building an LLM Robot with My Son — EP 8. My Son Gave the AI Robot Its First Real Command

EP 6 connected the LLM server. EP 7 migrated to Pi. This episode: camera joins.

Qwen2.5-VL-7B is now on the LLM server — the multimodal variant that accepts image input alongside text. Camera frames from the robot get sent with each request, and the model decides what to do based on what it sees.

Camera + sensors + LLM + robot, all connected at once for the first time.


Switching to Qwen2.5-VL

From text-only Qwen2.5-7B to Qwen2.5-VL-7B. Same family — harness barely changed. Three things were different:

New section added to CLAUDE.md:

## Vision Input
- Camera resolution: 640×480
- Transmission format: JPEG (quality 70)
- Frame timing: sent only at command request time (not continuous streaming)
- Image + sensor data sent together

## LLM input format (vision mode)
{
  "image": "<base64 encoded JPEG>",
  "sensor": "dist:45",
  "instruction": "user command"
}

Server wrapper updated to accept image base64 and pass it to llama.cpp in multimodal format.

Performance dropped somewhat. Text-only was 112 tok/s; with image input, 78 tok/s. TTFT increased too — image encoding processing time. Full response time: 430ms → 820ms.

0.8 seconds. Slower, but usable.


First Attempt

My son said "go to the kitchen and get water."

What he actually typed: "go toward the kitchen and find the water glass."

Robot set off. Moving forward. The sofa appeared as an obstacle. LLM read the camera frame: "sofa visible, recommend left detour." Turned left.

Then it kept turning left. In circles.

Log:

[LLM] image: wall close, passage visible on right
[CMD] right, speed:120
[LLM] image: space on left, recommend move left
[CMD] left, speed:120
[LLM] image: space on right
[CMD] right, speed:120

The LLM was alternating left and right commands. Direction oscillation — it couldn't settle.


Second Attempt: Adding Command History

I passed the last three commands as context alongside each new request.

context = {
    "image": image_b64,
    "sensor": sensor_data,
    "instruction": user_instruction,
    "history": last_3_commands  # added
}

Second attempt. The robot set off. Left-right oscillation reduced. It navigated around the sofa and started heading toward the kitchen.

Stopped at the kitchen entrance.

The LLM saw the refrigerator and interpreted it as an obstacle. "Obstacle detected, stopping." The refrigerator is an obstacle in the general sense — but in the context of navigating to the kitchen, stopping because of the refrigerator is wrong. The model had no concept of a target destination.

It sat there. Stopped at the threshold.

My son: "Why is it stopping there?" I explained: it thinks the refrigerator is in the way. "Does it need to know it's going for water?" Yes.


Third Attempt: Adding Goal Awareness

Added goal concept to the system prompt:

## Robot Behavior Guidelines
- When given a goal, continue moving toward it
- Navigate around obstacles but keep progressing toward the goal direction
- Refrigerators and appliances are environment features, not obstacles
- "Arrived" is when the robot has entered the space the user specified

Third attempt. Set off. Navigated around the sofa. Didn't stop at the kitchen entrance — went in.

Inside the kitchen, the robot slowly rotated — scanning. Camera moved across the counter. A water glass was sitting on the far end. LLM output: "transparent cup detected on counter, approaching."

Robot moved toward the counter. Stopped 20cm from the cabinet leg. Safety distance constraint triggered.

Couldn't retrieve it. No arm.

My son had been watching expectantly. When it stopped at the counter: "Didn't get it?" "No arm." "Let's add an arm." I want to too.


Why We Called It a Partial Success

Third attempt is logged as partial success.

Didn't retrieve it. But what it did:
- Crossed the living room to the kitchen (autonomously)
- Navigated around the sofa
- Didn't stop at the kitchen entrance — went through
- Recognized the water glass on the counter
- Approached the glass's location

The LLM was reading camera frames, understanding the environment, and issuing movement commands. No hardcoding. No pre-loaded map. No "the glass is at coordinate X." Just "go to the kitchen and find a water glass" and a camera feed.

The fact that this worked matters.


My Son's Turn

He tried it himself that evening. "I'll do it."

Command he typed: "look for the TV in the living room"

The robot moved around the living room and stopped facing the TV. LLM log: "dark rectangular screen, identified as TV, arrived."

"Found it!" he said.

Next command: "what's next to the TV?"

Robot scanned to the left of the TV. LLM response: "one speaker to the left of the TV, bookshelf with multiple books visible"

He read the response text quietly for a moment. Then: "It's actually seeing."

Yes. The robot is seeing through the camera. That's the moment he understood it.


What's Still Not Working

A lot is working. A lot isn't.

Low light degrades recognition significantly. Evening testing with lights dimmed: LLM outputs "insufficient light for identification."

Same command produces different behavior across runs. Even at temperature=0.1, visual perception introduces non-determinism. Same living room, same position — sometimes it goes around the sofa on the left, sometimes the right.

Speed control is still coarse. LLM-issued speed values cluster at 100 or 150 with little in between. Smooth deceleration based on distance is probably better handled in a separate motor control layer than left to the LLM directly.

These are things the next sessions will work through — more conversations, more iterations.

댓글

이 블로그의 인기 게시물

개발자는 코드를 쓰는 사람이 아니다 — AI 시대에 남는 자리는 '책임'에 있다

Harness Engineering in Practice — How Anthropic Designs AI Agents

What Is Harness Engineering — Designing the Reins for AI Agents