라벨이 hybrid LLM deployment인 게시물 표시

I Put a Local LLM on a Company GPU. The ROI Math Got Stranger.

A few days after Part 1 , I dropped the same Qwen3.5 122B onto a company GPU workstation. Part 1 ended with my M4 laptop falling into single-digit tokens-per-second territory. The natural next question followed me around the office: what happens at company scale? The "lonely" place where my notebook ended up — does it look the same on a workstation with a real GPU plugged in? An NVIDIA RTX Pro 6000 Blackwell 96GB happened to be sitting on a test bench. I loaded the same model. Speed jumped 5x to 7x. The 6–8 tokens/second I saw on the M4 became 35–50 tokens/second on the workstation. Chat replies stopped flickering letter-by-letter and started flowing as full sentences. One half of the contradiction from Part 1 — the speed half — was gone. Then I added five teammates to the same box. Average response slid back to single digits. The same phrase reappeared in a different dimension. In Part 1, single digits meant my laptop's raw inference rate. In Part 2, single digits me...