Writing — Prasad Khake

Writing

記事

on-device LLMs · apple silicon · the occasional deep debug · browse by topic →

Running Llama 3.1 8B with FP8 on vLLM Cuts Cost from $1.00 to $0.36 per Million Output Tokens

A measured audit of Llama-3.1-8B on a rented L4: what vLLM's real, untouched default actually costs, what FP8 quantization gets you on top of it, and why a plausible-looking concurrency guess cost throughput instead of adding it. Every number here is measured, including a second, larger benchmark pass that fixed a real gap the first one had.

Jul 14, 2026

Carmack's right about the weights. The KV cache is the part his argument skips.

John Carmack argued AI inference should stream model weights from cheap flash instead of expensive HBM, since weight access is deterministic. That's correct for weights — but the KV cache grows, gets rewritten every token, and reads a shifting range. I measured what happens when you force it to behave like a fixed resource anyway.

Jul 12, 2026

A rotating KV cache saves 36% of your memory and 100% of your recall

Capping a model's KV cache instead of letting it grow without bound plateaus peak memory — but the instant a conversation outgrows the cap, recall of five planted facts collapses from 5/5 to 0/5, every trial, with zero partial credit. The attention-sink trick doesn't help, and here's why.

Jul 12, 2026

At 32,000 tokens, the costliest thing my MacBook did was wait seven minutes to speak

I ran the same long-context test on a 16 GB fanless M3 and a ₹23 rented NVIDIA L4. The laptop fits a 32k context on an 8B model and keeps every planted fact — but prefill balloons to seven minutes and its decode speed can't even be measured, because the fanless chip throttles. A measured, cross-hardware look at the KV-cache tax.

Jul 8, 2026

Learning fine-tuning by building a tool-calling LoRA on an M3

The applied chapter of a from-scratch project: after building tokenization, attention, gradient descent, a tiny GPT, and LoRA by hand, I ran a real QLoRA fine-tune — teaching Llama-3.2-1B to call tools on a MacBook, then measuring honestly what changed and what the adapter costs at inference. A 2.8M-parameter adapter (0.23% of the model) clearly helped on a small test; the debugging taught me the most.

Jul 7, 2026

When to hand-write a GPU kernel on Apple Silicon (and when the compiler already won)

I wrote five GPU kernels from scratch on a 16 GB M3 to learn how LLM inference works at the metal. The most useful thing wasn't a kernel — it's a decision rule: never hand-write elementwise ops (the compiler already fuses them), reach for a kernel the moment a reduction appears, and remember the famous trick is rarely the hard part.

Jun 30, 2026

Atomic Chat's TurboQuant headline did not survive a chat-generation benchmark on my M3

Atomic Chat advertises TurboQuant as 8x faster inference and 6x less memory. I tested the local MLX TurboQuant KV path on a 16 GB M3. It saved about 3-5% total peak memory and did not speed up generation — a useful reminder that KV-cache microbenchmarks do not automatically become whole-chat product claims.

Jun 18, 2026

I built self-speculative decoding for MLX. On an M3, naive layer-skip never beats baseline — 24 configs, 24 losses

Self-speculative decoding lets a model draft its own tokens by skipping layers — speculative decoding's speedup with no extra memory. I built it for MLX and swept 24 configs on an M3. Every one was slower than baseline, even though all were lossless. Here's why, and the paper that fixes it.

Jun 18, 2026

Three ways to make an LLM read its weights less often on a Mac — and why each one backfires

Single-stream decoding on Apple Silicon is bottlenecked by reading the model's weights out of memory, not by the math. Three techniques attack that directly — speculative decoding, diffusion generation, and self-speculative layer-skipping. I measured all three on a 16 GB M3. Each is right in theory and backfires in its own way: the bottleneck just moves one step further from the arithmetic.

Jun 14, 2026

I expected a diffusion LLM to be fast on my Mac. It tied the best model on quality instead — and lost on speed.

LLaDA2.0-mini, a diffusion language model, runs on a 16 GB M3 and ties Qwen3-4B for the best answer-quality score I've measured (20/21). But it's slower than the fastest autoregressive model and uses 4× the memory of the lightest — and the exact reason I expected it to be fast on bandwidth-bound hardware turned out to be why it isn't. A measured look at where the bottleneck actually moved.

Jun 13, 2026

Apple's on-device model ties a 4-bit Llama-3.1-8B — and won't name the M1

Apple shipped an official Python SDK for its on-device Foundation Model at WWDC 2026. I put the ~3B model through the same 21-task quality suite I use for MLX models: it ties a 4-bit Llama-3.1-8B (18/21), one question behind Qwen3-4B. Quality is the only fair axis to compare — and that limitation is itself the interesting part.

Jun 9, 2026

I turned on MLX's memory-saving flag and ran out of memory

On a 16 GB Mac, MLX's --kv-bits flag — whose entire job is to shrink the KV cache so longer contexts fit — raised peak memory at every context length I tested, and OOM'd at 32K where plain fp16 fit at 9.4 GB. It's also no faster (8-bit decoding ran ~4× slower in my tests) and costs no quality you'd want to keep. Here's the measurement, the code-level cause, and why the flag backfires on this path.

Jun 9, 2026

Speculative decoding on a 16 GB Mac: a 20% win that becomes a 25% loss

A 1B draft model speeds up Llama-3.1-8B by 20% on an M3 — at num_draft_tokens=2. Push that dial to 4 and decoding gets 25% SLOWER than using no draft at all. Here's the measured curve, and why low draft counts win when decode is bound by memory bandwidth.

Jun 9, 2026

Gemma 4 on a 16 GB Mac: the E4B matches the 12B at 42% less RAM and 3× the speed

Google's Gemma-4 E4B posts the same math and factual scores as the full 12B on an M3 MacBook Air — in 6.6 GB instead of 11.4, at 8.2 tok/s instead of 2.7 — so on a 16 GB Mac the E4B is the one to run. This is a size win, not a QAT one: the 12B's own QAT build doesn't shrink or speed it up. Honest numbers, measured under a real 2048-token load.

Jun 8, 2026

My benchmark graded '7! = 5040' as wrong — and two other ways it lied to me

Re-running my own LLM benchmark, I found a bug that had inflated the quality scores in posts I'd already published. Then a second bug. Then a third. Here's how a wrong number looks exactly like a right one — and why you spot-check the failures, not the passes.

Jun 7, 2026

One flag makes Qwen3-4B beat Llama-3.1-8B on a 16 GB Mac — at half the RAM

On an M3 MacBook Air, Qwen3-4B with the thinking trace turned off scores 20/21 on a verifiable suite — beating Llama-3.1-8B's 18/21 at half the memory and nearly double the speed. With thinking on, the same model drops to 7/21. The flag is enable_thinking=False, and here's exactly what it changes and why it matters.

Jun 6, 2026

Gemma 4 12B on a 16 GB Mac: 11 GB RAM, 2.7 tok/s, and what my benchmark got wrong

Google's Gemma 4 12B uses 11.4 GB of RAM and runs at 2.7 tok/s on an M3 MacBook Air — 2.4× the memory of Llama-3.1-8B at well under half the speed. Its math and factual answers are flawless; its coding can't be cleanly scored. Here's the honest picture, the multimodal tax, and the benchmark bug I found correcting this post.

Jun 4, 2026

Attention sinks: the four tokens that stabilize infinite context on a 16 GB Mac

StreamingLLM (2023) found that keeping four specific tokens in the KV cache prevents catastrophic perplexity collapse at long contexts. mlx-lm implements this as RotatingKVCache(keep=4). Here's what that means, why it works, and what our measurements on M3 actually show.

Jun 3, 2026

Gemma-3-12B QAT vs Qwen3-14B 3-bit: same quality on a 16 GB Mac, but the smaller model runs lighter and faster

Benchmarking Gemma-3-12B, Qwen3-14B, and Llama-3.1-8B on a 16 GB MacBook Air (M3) with MLX. A quantization-aware 3-bit 12B ties a naïve 3-bit 14B on overall answer quality while running faster and in less memory. On a memory-bound Mac, a well-quantized smaller model can match a bigger naïvely-quantized one — so parameter count alone is the wrong thing to shop on.

Jun 2, 2026

What actually runs well on a 16 GB MacBook

Honest local-LLM benchmarks on a base M3, 16 GB — tokens/sec, peak RAM, and exactly where it hits the wall. The numbers nobody publishes because they run on H100s.

Jun 1, 2026

Why Mistral and Devstral models drop their spaces on Apple Silicon

Debugging why tekken-v13 models emit Ġ instead of spaces through mlx-lm's server, and the one-line root cause in MLX's detokenizer routing.

May 30, 2026