57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon // famstack.dev

Well, it depends.™

I know, worst possible opening for a tech article. But I spent a week benchmarking MLX against llama.cpp on my Mac Studio and that’s the answer. The UI says 57 tok/s. The real experience can be as low as 3 tok/s. The model that reports 57 tok/s is sometimes faster and sometimes slower than the one reporting 29 tok/s. Which one wins depends entirely on what you’re doing with it.

Update (March 13, 2026): I posted this on r/LocalLLaMA and the community helped identify three factors that compound on this specific model and hardware combination: broken prompt caching for Qwen3.5 multimodal, unsupported hybrid attention in MLX, and bf16 weights on M1 hardware that doesn’t do bf16 natively. Prefill still dominates real-world performance. But the gap between MLX and GGUF is likely smaller than these numbers show.

↓ Full breakdown below

#1 post of all time on r/LocalLLaMA

Update 2 (March 14, 2026): I tested oMLX, an MLX inference server with a tiered KV cache that persists to SSD. Same model, same hardware, same benchmark. oMLX matches LM Studio’s generation speed (~55 tok/s) but is up to 10x faster on prefill. At 8K context, prefill drops from 49s to 1.7s. Effective throughput goes from 6 tok/s to 30 tok/s. Full numbers below.

I ran this on an M1 Max. If you have an M2, M3, M4, or M5, your numbers will be different, and I want them. The benchmark tool takes five minutes, needs no dependencies, and outputs JSON. Post your results in an issue and I’ll add your hardware to the comparison table below.

#It started with document classification

I’m experimenting with local AI on a home server. One of the first things I built was a document classifier: scan a PDF, send it to a local LLM, get back a category and tags. Paperless-ngx handles the OCR, the model handles the thinking.

I loaded Qwen3.5-35B-A3B in LM Studio using the MLX engine and saw 57 tok/s generation speed. Nearly twice what the same model gets through llama.cpp (29 tok/s). Every blog post, every Hacker News thread said the same thing: on Apple Silicon, use MLX.

So I did. Classification worked fine. Fast streaming, good results.

Then I started building an ops agent. A bot that manages Docker services, checks backups, calls tools, maintains conversation history with JSON payloads every turn. Context grows with each exchange. And suddenly the “fast” engine felt slow. I’d send a message and wait 15, 20 seconds before any text appeared, even though the tok/s counter still said 57.

I went back to the document classifier and timed it properly. Turns out MLX was slower there too. I just hadn’t noticed because the streaming looked fast.

#The benchmark

Same model. Same quantization. Same hardware. Same inference server. Only the engine differs.

Property	MLX 4-bit	GGUF Q4_K_M
Model	Qwen3.5-35B-A3B	Qwen3.5-35B-A3B
Engine	MLX v1.3.0	llama.cpp v2.4.0
Inference server	LM Studio 0.4.5	LM Studio 0.4.5
Generation speed	57 tok/s	29 tok/s

Mac Studio M1 Max, 64 GB unified memory. Both runs with a warm model (loading time excluded). Temperature 0.6, thinking mode disabled.

I built local-llm-bench and ran three scenarios to cover different workloads.

#Scenario 1: Document classification (short input, short output)

Five documents from a family archive (electricity bill, school report, medical checkup, insurance letter, a multi-page rental lease). The model returns a category, tags, and a two-sentence summary. Each document is processed independently.

Total response time per document (seconds, lower is better)

MLX GGUF

Electricity bill

7.7s

4.9s

School report

7.3s

4.7s

Medical checkup

7.8s

5.3s

Insurance letter

7.5s

5.9s

Rental lease (6pg)

10.6s

8.3s

The UI says 57 tok/s for MLX and 29 tok/s for GGUF. The real numbers are 11-16 and 16-21. Full results: MLX, GGUF.

GGUF wins all five. The engine with half the generation speed finishes every document faster. That was the real surprise.

#Scenario 2: Prefill scaling (growing input, short output)

Same short reply (~150 tokens), but I scaled the input from 655 to 8,500 tokens to see how each engine handles growing context.

Total response time by context size (seconds, lower is better)

MLX GGUF

655 tokens

8.8s

7.2s

1.5K tokens

11.8s

9.2s

3K tokens

17.6s

13.9s

8.5K tokens

52.3s

43.4s

GGUF wins every size. At 8.5K tokens of context, both engines are slow, but MLX is 9 seconds slower. Full results: MLX, GGUF.

#Scenario 3: Agent conversation (growing context, longer output)

An 8-turn conversation with a server ops assistant. Five tool definitions in the system prompt, JSON and log output in every tool result. Context grows each turn. Replies are longer here (300-400 tokens on average).

Total response time per turn (seconds, lower is better)

MLX GGUF

Turn 1 (~575)

9.4s

12.9s

Turn 3 (~1.4K)

13.9s

14.3s

Turn 5 (~2.4K)

18.7s

23.2s

Turn 6 (~3K)

20.9s

18.4s

Turn 8 (~3.9K)

24.0s

28.2s

With longer replies, MLX wins most turns. Its 2x generation speed compensates for the slower prefill. Around turn 6, with ~3,000 tokens of context, GGUF starts catching up. Output lengths vary between runs, so individual turns can swing either way. Full results: MLX, GGUF.

#Two phases, one hidden

Every LLM request has two stages. Your interface only shows you one of them.

Prefill is the engine processing your entire input in one batch: system message, conversation history, tool definitions, your new question. Everything. You experience this as the pause before the first word appears. Technical name: Time To First Token (TTFT).

Generation is the engine producing tokens one at a time. This is the tok/s number in your UI.

Generation speed stays roughly constant regardless of context size. Prefill scales with every token. The split from the prefill scaling benchmark tells the full story:

Prefill time only (seconds, lower is better)

MLX GGUF

655 tokens

6.9s

2.9s

1.5K tokens

9.2s

4.1s

3K tokens

15.0s

8.7s

8.5K tokens

49.4s

37.8s

At 8.5K tokens of context, prefill accounts for 94% of MLX’s total time. You spend 49 seconds waiting for the first word, then 3 seconds watching it stream. The generation speed is irrelevant at that point.

The number on your screen measures the part that doesn’t change.

#Effective throughput

I started calculating what I call “effective throughput.” Output tokens divided by total wall-clock time. What you actually experience, as opposed to what the counter says.

MLX: what the UI says vs what you experience (tok/s, higher is better)

UI says Effective

655 tokens

59 tok/s

13 tok/s

1.5K tokens

57 tok/s

12 tok/s

3K tokens

50 tok/s

7 tok/s

8.5K tokens

51 tok/s

3 tok/s

At 8.5K tokens of context, MLX’s real throughput is 3 tok/s. The UI still says 51.

#Why the gap

Both engines run on the same Metal GPU with unified memory. The difference is batch prompt processing.

llama.cpp has been grinding on this since December 2022. Metal compute shaders tuned for batch processing, KV cache quantization, flash attention. Three years of contributors optimizing the same pipeline adds up.

MLX launched December 2023. Lower per-token overhead (that’s where the faster generation comes from), but the batch processing path for prefill is younger and still being worked on.

One concrete factor: LM Studio’s MLX engine defaults to a prefill chunk size of 512. That’s conservative. Setting it to 8192 can boost prefill by up to 1.5x. Even with that fix, GGUF still leads in most turns of my benchmark, but the gap narrows.

#When to use which

GGUF/llama.cpp wins when output is short relative to input. Document classification, quick answers, tool-calling agents with short replies, RAG with injected context. Anything where you’re feeding the model a lot and expecting a little back.

MLX wins when output is long relative to input. Summaries, creative writing, brainstorming, explanations. Fresh conversations with short prompts and long replies. Also when streaming smoothness matters visually. Text at 57 tok/s just flows nicer on screen.

The crossover depends on both context size and reply length. At 600 tokens of context, MLX needs roughly 250+ output tokens to overcome its prefill penalty. At 8,500 tokens of context, it needs 650+.

For the ops agent I’m building, which processes tool results, Docker status JSON, and backup logs every single turn, GGUF is the better choice. The agent’s replies are typically short (100-200 tokens) but the context grows fast. For the document classifier, GGUF also wins because classification output is short. If I were building a summarizer with long outputs, I’d pick MLX.

#Run it yourself

All my data comes from one model on one machine. One data point is an anecdote. A table full of hardware is useful.

Hardware	Memory	MLX effective (1.5K ctx)	GGUF effective (1.5K ctx)	Contributor
M1 Max	64 GB	12 tok/s	16 tok/s	this article
M2 Pro/Max	?	?	?	run the benchmark
M3 Pro/Max	?	?	?	run the benchmark
M4 Pro/Max	?	?	?	run the benchmark

I built local-llm-bench to make this reproducible. It auto-detects your hardware, runs four scenarios (agent conversation, document classification, prefill scaling, creative writing), and measures effective tok/s alongside generation tok/s. Five minutes, no dependencies.

I’ll update this table as results come in. Fork the repo, run the benchmark, open a PR.

#Things that might change your results

Prefill chunk size. LM Studio MLX defaults to 512. Bumping it to 8192 can improve MLX prefill by up to 1.5x. Worth checking if your version has this fix already.

Inference server overhead. Both engines run through LM Studio here. Running llama.cpp via Ollama, or MLX via mlx-lm directly, adds or removes overhead. I tested Ollama separately and found its generation speed was 18 tok/s for the same GGUF model (vs 29 tok/s through LM Studio). That’s a 38% penalty from the Go wrapper. Prefill was comparable.

Hardware. M2 through M4 have better memory bandwidth, which directly affects prefill. These M1 Max numbers might not represent the gap on newer silicon. This is the data point I’m most curious about.

Prompt caching. This turned out to be a bigger deal than I initially realized. LM Studio’s MLX runtime has broken prompt caching for Qwen3.5 multimodal models. Every turn reprocesses the full conversation history. llama.cpp’s cache was likely working normally. Using a non-vision variant of Qwen3.5 or an alternative MLX runtime like oMLX should fix this. I tested Ollama’s cache separately and found a 37% prefill reduction on repeated prefixes (like the system prompt).

Model dtype. If you’re on M1 or M2, check whether your MLX model uses bf16. These chips don’t support bf16 natively, and prefill runs on non-quantized weights regardless of the model’s quant level. Converting to fp16 with mlx_lm.convert --dtype float16 is a one-minute fix. See the bf16 section below for the full command.

Model architecture. Qwen3.5-35B-A3B uses hybrid attention (gated delta-net, sliding window). llama.cpp handles this better than MLX currently. Testing with an older model like Qwen3 or Llama 3.1 (standard attention) would give a cleaner engine-to-engine comparison. Dense models have different prefill characteristics too.

#Community update: what was actually going on with Qwen3.5-A3B

I posted this on r/LocalLLaMA and the post got some traction, even ranked #9 that day. The comments were full of people who actually know what’s going on under the hood.

TL;DR: Multiple factors stacked against MLX for this specific model on this specific hardware. The benchmarks above are real. MLX is just not as mature as GGUF yet. It needs a bit more time. When it works, it’s great. When it doesn’t, you end up here. The community helped figure out why:

Prompt caching broken for Qwen3.5 multimodal in LM Studio’s MLX runtime. Every turn reprocesses the full history. GGUF had working caching. (mlx-lm#903, mlx-lm#980)
Hybrid attention not optimized in MLX for Qwen3.5. The model uses gated delta-net and sliding window attention. llama.cpp handles it, MLX likely falls back to standard attention.
bf16 dtype on M1/M2. MLX models ship bf16. M1 and M2 don’t support bf16 natively. GGUFs use fp16, which M1 runs fine. During prefill, this penalty multiplies across every input token.
LM Studio’s MLX runtime specifically. Alternative runtimes like oMLX have proper prompt caching. The problem may not be MLX itself.
Most MLX quants are 4-bit only. GGUF has a wider range of quantization options (Q4_K_M, Q5_K_M, Q6_K, Q8_0). More quant levels means better quality/speed tradeoffs.

Or as u/itsjase put it: “You just happened to choose the one model in the world that’s currently slower on MLX.” Fair enough. But it’s also one of the most popular MoE models people run on Macs right now. So here’s what I learned.

#Prompt caching is broken for Qwen3.5 multimodal in MLX

u/Regular-Marketing723 called it: LM Studio’s MLX runtime doesn’t cache prompts for Qwen3.5 multimodal models. Every conversation turn reprocesses the entire history from scratch. Non-vision variants work fine.

I used the multimodal variant for all benchmarks. So MLX was paying full prefill cost every turn, while llama.cpp had working KV cache reuse. The agent conversation scenario (8 turns of growing context) was hit hardest by this.

u/Federal-Effective879 confirmed: “Both prompt processing and token generation with MLX are much faster than llama.cpp, but MLX prompt caching has issues with hybrid models like Qwen 3.5, and this issue is exacerbated by agentic usage that depends on prompt caching for usable performance.”

Both GitHub issues (mlx-lm#903, mlx-lm#980) were active within hours of the post going up. Fixes are in progress.

u/Creepy-Bell-4527 and u/d4mations both recommended oMLX and vMLX as alternative MLX runtimes with proper caching. So the caching issue may be LM Studio-specific, not an MLX problem.

#Qwen3.5 hybrid attention is not fully optimized in MLX

u/rpiguy9907 explained that Qwen3.5 uses a hybrid attention mechanism (gated delta-net, sliding window, mixed attention). llama.cpp supports these newer patterns. MLX likely falls back to standard attention, which gets slower as sequences grow.

Short prompts hide the difference. Long prompts expose it. u/cibernox suggested testing with Qwen3 (older, no hybrid attention) for a cleaner comparison. On their M1 Pro, MLX is consistently ~20% faster than llama.cpp with standard models.

#bf16 dtype tanks prefill on M1 and M2

This one might be the biggest single factor. u/bakawolf123 explained: M1 and M2 chips don’t support bf16 (bfloat16) natively. They support fp16 (float16). Most MLX models on Hugging Face ship as bf16. GGUFs use fp16.

During prefill, the engine processes your entire input using the model’s non-quantized weights. Even a 4-bit quantized MLX model uses the underlying weight dtype for prefill. On M1/M2, that means bf16 gets emulated in software while GGUF runs native fp16. Generation is less affected (less compute per step), but prefill multiplies the penalty across every input token.

The fix takes under a minute. Install mlx-lm and convert:

pip install mlx-lm
mlx_lm.convert --hf-path mlx-community/Qwen3.5-35B-A3B-4bit --mlx-path ./Qwen3.5-35B-A3B-fp16 --dtype float16

This rewrites the non-quantized parameters from bf16 to fp16. Quantized weights stay as-is. Point LM Studio or mlx_lm.server at the output directory and prefill should improve on M1/M2. M3 and M4 may have partial bf16 support (unconfirmed). M5 reportedly handles bf16 natively.

Update: I benchmarked the bf16-to-fp16 conversion in Part 2. Significant improvements on both Gemma 12B and Qwen3 30B.

#Where this leaves us

GGUF is the safer bet right now if you’re using LM Studio. It’s more mature, more stable, and has wider quantization options. MLX has raw speed potential, but LM Studio’s runtime holds it back with broken caching and limited hybrid attention support.

However, if you’re open to switching runtimes: oMLX fixes the caching problem entirely and beats both LM Studio MLX and GGUF across all scenarios. See Update 2 for the full numbers.

My updated conclusion: if you’re running MLX models on Apple Silicon, try oMLX. Its KV caching alone changes the equation. For stability or if you prefer a simpler setup, GGUF via Ollama or LM Studio remains solid. And don’t trust synthetic tok/s numbers. That’s exactly why I built the benchmark harness above.

Also, I’m not the only one running into this.

#Part 2: Isolating the variables

The r/LocalLLaMA discussion gave me a clear list of things to test. I tested all of them: two more models (Gemma 12B, Qwen3 30B), five runtimes (LM Studio, Ollama, oMLX, Rapid-MLX, raw llama.cpp), the bf16-to-fp16 conversion, flash attention, quantized KV cache, and the LM Studio chunk size patch.

Read Part 2: Same Engine, 37% Slower —>

#Update 2: oMLX changes everything

The community kept pointing to oMLX as an alternative MLX runtime. I finally tested it. oMLX is an inference server for Apple Silicon with one standout feature: a tiered KV cache. Hot blocks stay in RAM, cold blocks spill to SSD in safetensors format, and everything persists across requests — even server restarts.

Same model, same hardware (M1 Max 64GB), same benchmark scenarios. Generation speed is virtually identical to LM Studio MLX. Prefill is a different world.

Scenario	LM Studio MLX	oMLX	Speedup
Prefill scaling	90.4s	18.0s	5.0x
Document classification	41.0s	17.5s	2.3x
Agent conversation	138.6s	84.3s	1.6x
Creative writing	38.5s	27.0s	1.4x

The prefill scaling scenario is the most dramatic. At 8K context (turn 4), LM Studio takes 49 seconds before the first token appears. oMLX takes 1.7 seconds. Same MLX engine, same model weights. The difference is entirely in how the KV cache is managed.

In effective tok/s (what you actually experience):

Scenario	LM Studio MLX	oMLX	GGUF (LM Studio)
Prefill scaling	5.9	30.0	7.8
Document classification	13.4	25.7	19.4
Agent conversation	17.0	34.6	17.6
Creative writing	38.3	51.5	27.7

oMLX wins every scenario against both LM Studio MLX and GGUF. It’s not even close. The original article’s conclusion — “use GGUF as your default” — no longer holds if you’re willing to use oMLX instead of LM Studio. The tiered KV cache likely benefits any MLX model, not just Qwen3.5 — though Qwen3.5-35B-A3B was particularly hurt by LM Studio’s broken caching.

Recommendation: If you’re running MLX models on Apple Silicon, try oMLX. Same generation speed as LM Studio, dramatically faster prefill, and it beats GGUF across the board in my tests. The effective throughput advantage ranges from 1.3x (creative writing, short context) to 5x (prefill scaling, long context).

Full benchmark data: oMLX results.

#FAQ

Should I switch from MLX to GGUF? Try oMLX first. It uses the MLX engine with proper KV caching and beats both LM Studio MLX and GGUF in every scenario I tested. If you’re staying on LM Studio, GGUF is the safer choice — LM Studio’s MLX runtime still has caching issues with certain models (Qwen3.5 multimodal in particular). For models with working prompt caching, LM Studio MLX can be competitive or faster, especially for long outputs. Benchmark with the tool above.

Does Ollama use MLX or llama.cpp? llama.cpp. If you’re on Ollama, you’re already on the GGUF engine. Though note that Ollama adds its own overhead. I measured 18 tok/s generation where LM Studio gets 29 tok/s from the same llama.cpp build. Part 2 confirms this: Ollama is consistently 37% slower than LM Studio across multiple models.

Will MLX catch up? It’s already faster per-operation for both prefill and generation, according to multiple commenters on r/LocalLLaMA. The issues are model-specific (hybrid attention support, caching for certain architectures) and runtime-specific (LM Studio’s caching implementation). Fixes are in progress (mlx-lm#980). Regardless of which engine wins in six months: measure what you actually wait for, not what the counter says while text is streaming.

From the Build Log: Local LLMs on a Mac: From Magic to Disappointment —>

References:

Production-Grade Local LLM Inference on Apple Silicon (academic study comparing MLX, llama.cpp, Ollama, MLC-LLM, PyTorch MPS)
llama.cpp Apple Silicon performance tracking
LM Studio MLX prefill chunk size optimization
r/LocalLLaMA discussion on this article (community findings on caching, hybrid attention, bf16 dtype)
mlx-lm#903: Prompt caching issues
mlx-lm#980: Prompt caching fix for hybrid models
oMLX: MLX inference server with tiered KV cache
oMLX benchmark results

#It started with document classification

#The benchmark

#Scenario 1: Document classification (short input, short output)

#Scenario 2: Prefill scaling (growing input, short output)

#Scenario 3: Agent conversation (growing context, longer output)

#Two phases, one hidden

#Effective throughput

#Why the gap

#When to use which

#Run it yourself

#Things that might change your results

#Community update: what was actually going on with Qwen3.5-A3B

#Prompt caching is broken for Qwen3.5 multimodal in MLX

#Qwen3.5 hybrid attention is not fully optimized in MLX

#bf16 dtype tanks prefill on M1 and M2

#Where this leaves us

#Part 2: Isolating the variables

#Update 2: oMLX changes everything

#FAQ

I'm making this reusable for you.