57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon // famstack.dev

Well, it depends.™

I know, worst possible opening for a tech article. But I spent a week benchmarking MLX against llama.cpp on my Mac Studio and that’s the answer. The UI says 57 tok/s. The real experience can be as low as 3 tok/s. The model that reports 57 tok/s is sometimes faster and sometimes slower than the one reporting 29 tok/s. Which one wins depends entirely on what you’re doing with it.

Update (March 13, 2026): I posted this on r/LocalLLaMA and the community helped identify three factors that compound on this specific model and hardware combination: broken prompt caching for Qwen3.5 multimodal, unsupported hybrid attention in MLX, and bf16 weights on M1 hardware that doesn’t do bf16 natively. Prefill still dominates real-world performance. But the gap between MLX and GGUF is likely smaller than these numbers show.

↓ Full breakdown below

#1 post of all time on r/LocalLLaMA

I ran this on an M1 Max. If you have an M2, M3, M4, or M5, your numbers will be different, and I want them. The benchmark tool takes five minutes, needs no dependencies, and outputs JSON. Post your results in an issue and I’ll add your hardware to the comparison table below.

It started with document classification

I’m experimenting with local AI on a home server. One of the first things I built was a document classifier: scan a PDF, send it to a local LLM, get back a category and tags. Paperless-ngx handles the OCR, the model handles the thinking.

I loaded Qwen3.5-35B-A3B in LM Studio using the MLX engine and saw 57 tok/s generation speed. Nearly twice what the same model gets through llama.cpp (29 tok/s). Every blog post, every Hacker News thread said the same thing: on Apple Silicon, use MLX.

So I did. Classification worked fine. Fast streaming, good results.

Then I started building an ops agent. A bot that manages Docker services, checks backups, calls tools, maintains conversation history with JSON payloads every turn. Context grows with each exchange. And suddenly the “fast” engine felt slow. I’d send a message and wait 15, 20 seconds before any text appeared, even though the tok/s counter still said 57.

I went back to the document classifier and timed it properly. Turns out MLX was slower there too. I just hadn’t noticed because the streaming looked fast.

The benchmark

Same model. Same quantization. Same hardware. Same inference server. Only the engine differs.

Property	MLX 4-bit	GGUF Q4_K_M
Model	Qwen3.5-35B-A3B	Qwen3.5-35B-A3B
Engine	MLX v1.3.0	llama.cpp v2.4.0
Inference server	LM Studio 0.4.5	LM Studio 0.4.5
Generation speed	57 tok/s	29 tok/s

Mac Studio M1 Max, 64 GB unified memory. Both runs with a warm model (loading time excluded). Temperature 0.6, thinking mode disabled.

I built local-llm-bench and ran three scenarios to cover different workloads.

Scenario 1: Document classification (short input, short output)

Five documents from a family archive (electricity bill, school report, medical checkup, insurance letter, a multi-page rental lease). The model returns a category, tags, and a two-sentence summary. Each document is processed independently.

Total response time per document (seconds, lower is better)

MLX GGUF

Electricity bill

7.7s

4.9s

School report

7.3s

4.7s

Medical checkup

7.8s

5.3s

Insurance letter

7.5s

5.9s

Rental lease (6pg)

10.6s

8.3s

The UI says 57 tok/s for MLX and 29 tok/s for GGUF. The real numbers are 11-16 and 16-21. Full results: MLX, GGUF.

GGUF wins all five. The engine with half the generation speed finishes every document faster. That was the real surprise.

Scenario 2: Prefill scaling (growing input, short output)

Same short reply (~150 tokens), but I scaled the input from 655 to 8,500 tokens to see how each engine handles growing context.

Total response time by context size (seconds, lower is better)

MLX GGUF

655 tokens

8.8s

7.2s

1.5K tokens

11.8s

9.2s

3K tokens

17.6s

13.9s

8.5K tokens

52.3s

43.4s

GGUF wins every size. At 8.5K tokens of context, both engines are slow, but MLX is 9 seconds slower. Full results: MLX, GGUF.

Scenario 3: Agent conversation (growing context, longer output)

An 8-turn conversation with a server ops assistant. Five tool definitions in the system prompt, JSON and log output in every tool result. Context grows each turn. Replies are longer here (300-400 tokens on average).

Total response time per turn (seconds, lower is better)

MLX GGUF

Turn 1 (~575)

9.4s

12.9s

Turn 3 (~1.4K)

13.9s

14.3s

Turn 5 (~2.4K)

18.7s

23.2s

Turn 6 (~3K)

20.9s

18.4s

Turn 8 (~3.9K)

24.0s

28.2s

With longer replies, MLX wins most turns. Its 2x generation speed compensates for the slower prefill. Around turn 6, with ~3,000 tokens of context, GGUF starts catching up. Output lengths vary between runs, so individual turns can swing either way. Full results: MLX, GGUF.

Two phases, one hidden

Every LLM request has two stages. Your interface only shows you one of them.

Prefill is the engine processing your entire input in one batch: system message, conversation history, tool definitions, your new question. Everything. You experience this as the pause before the first word appears. Technical name: Time To First Token (TTFT).

Generation is the engine producing tokens one at a time. This is the tok/s number in your UI.

Generation speed stays roughly constant regardless of context size. Prefill scales with every token. The split from the prefill scaling benchmark tells the full story:

Prefill time only (seconds, lower is better)

MLX GGUF

655 tokens

6.9s

2.9s

1.5K tokens

9.2s

4.1s

3K tokens

15.0s

8.7s

8.5K tokens

49.4s

37.8s

At 8.5K tokens of context, prefill accounts for 94% of MLX’s total time. You spend 49 seconds waiting for the first word, then 3 seconds watching it stream. The generation speed is irrelevant at that point.

The number on your screen measures the part that doesn’t change.

Effective throughput

I started calculating what I call “effective throughput.” Output tokens divided by total wall-clock time. What you actually experience, as opposed to what the counter says.

MLX: what the UI says vs what you experience (tok/s, higher is better)

UI says Effective

655 tokens

59 tok/s

13 tok/s

1.5K tokens

57 tok/s

12 tok/s

3K tokens

50 tok/s

7 tok/s

8.5K tokens

51 tok/s

3 tok/s

At 8.5K tokens of context, MLX’s real throughput is 3 tok/s. The UI still says 51.

Why the gap

Both engines run on the same Metal GPU with unified memory. The difference is batch prompt processing.

llama.cpp has been grinding on this since December 2022. Metal compute shaders tuned for batch processing, KV cache quantization, flash attention. Three years of contributors optimizing the same pipeline adds up.

MLX launched December 2023. Lower per-token overhead (that’s where the faster generation comes from), but the batch processing path for prefill is younger and still being worked on.

One concrete factor: LM Studio’s MLX engine defaults to a prefill chunk size of 512. That’s conservative. Setting it to 8192 can boost prefill by up to 1.5x. Even with that fix, GGUF still leads in most turns of my benchmark, but the gap narrows.

When to use which

GGUF/llama.cpp wins when output is short relative to input. Document classification, quick answers, tool-calling agents with short replies, RAG with injected context. Anything where you’re feeding the model a lot and expecting a little back.

MLX wins when output is long relative to input. Summaries, creative writing, brainstorming, explanations. Fresh conversations with short prompts and long replies. Also when streaming smoothness matters visually. Text at 57 tok/s just flows nicer on screen.

The crossover depends on both context size and reply length. At 600 tokens of context, MLX needs roughly 250+ output tokens to overcome its prefill penalty. At 8,500 tokens of context, it needs 650+.

For the ops agent I’m building, which processes tool results, Docker status JSON, and backup logs every single turn, GGUF is the better choice. The agent’s replies are typically short (100-200 tokens) but the context grows fast. For the document classifier, GGUF also wins because classification output is short. If I were building a summarizer with long outputs, I’d pick MLX.

Run it yourself

All my data comes from one model on one machine. One data point is an anecdote. A table full of hardware is useful.

Hardware	Memory	MLX effective (1.5K ctx)	GGUF effective (1.5K ctx)	Contributor
M1 Max	64 GB	12 tok/s	16 tok/s	this article
M2 Pro/Max	?	?	?	run the benchmark
M3 Pro/Max	?	?	?	run the benchmark
M4 Pro/Max	?	?	?	run the benchmark

I built local-llm-bench to make this reproducible. It auto-detects your hardware, runs four scenarios (agent conversation, document classification, prefill scaling, creative writing), and measures effective tok/s alongside generation tok/s. Five minutes, no dependencies.

I’ll update this table as results come in. Fork the repo, run the benchmark, open a PR.

Things that might change your results

Prefill chunk size. LM Studio MLX defaults to 512. Bumping it to 8192 can improve MLX prefill by up to 1.5x. Worth checking if your version has this fix already.

Inference server overhead. Both engines run through LM Studio here. Running llama.cpp via Ollama, or MLX via mlx-lm directly, adds or removes overhead. I tested Ollama separately and found its generation speed was 18 tok/s for the same GGUF model (vs 29 tok/s through LM Studio). That’s a 38% penalty from the Go wrapper. Prefill was comparable.

Hardware. M2 through M4 have better memory bandwidth, which directly affects prefill. These M1 Max numbers might not represent the gap on newer silicon. This is the data point I’m most curious about.

Prompt caching. This turned out to be a bigger deal than I initially realized. LM Studio’s MLX runtime has broken prompt caching for Qwen3.5 multimodal models. Every turn reprocesses the full conversation history. llama.cpp’s cache was likely working normally. Using a non-vision variant of Qwen3.5 or an alternative MLX runtime like oMLX should fix this. I tested Ollama’s cache separately and found a 37% prefill reduction on repeated prefixes (like the system prompt).

Model dtype. If you’re on M1 or M2, check whether your MLX model uses bf16. These chips don’t support bf16 natively, and prefill runs on non-quantized weights regardless of the model’s quant level. Converting to fp16 with mlx_lm.convert --dtype float16 is a one-minute fix. See the bf16 section below for the full command.

Model architecture. Qwen3.5-35B-A3B uses hybrid attention (gated delta-net, sliding window). llama.cpp handles this better than MLX currently. Testing with an older model like Qwen3 or Llama 3.1 (standard attention) would give a cleaner engine-to-engine comparison. Dense models have different prefill characteristics too.

Community update: what was actually going on with Qwen3.5-A3B

I posted this on r/LocalLLaMA and the post got some traction, even ranked #9 that day. The comments were full of people who actually know what’s going on under the hood.

TL;DR: Multiple factors stacked against MLX for this specific model on this specific hardware. The benchmarks above are real. MLX is just not as mature as GGUF yet. It needs a bit more time. When it works, it’s great. When it doesn’t, you end up here. The community helped figure out why:

Prompt caching broken for Qwen3.5 multimodal in LM Studio’s MLX runtime. Every turn reprocesses the full history. GGUF had working caching. (mlx-lm#903, mlx-lm#980)
Hybrid attention not optimized in MLX for Qwen3.5. The model uses gated delta-net and sliding window attention. llama.cpp handles it, MLX likely falls back to standard attention.
bf16 dtype on M1/M2. MLX models ship bf16. M1 and M2 don’t support bf16 natively. GGUFs use fp16, which M1 runs fine. During prefill, this penalty multiplies across every input token.
LM Studio’s MLX runtime specifically. Alternative runtimes like oMLX have proper prompt caching. The problem may not be MLX itself.
Most MLX quants are 4-bit only. GGUF has a wider range of quantization options (Q4_K_M, Q5_K_M, Q6_K, Q8_0). More quant levels means better quality/speed tradeoffs.

Or as u/itsjase put it: “You just happened to choose the one model in the world that’s currently slower on MLX.” Fair enough. But it’s also one of the most popular MoE models people run on Macs right now. So here’s what I learned.

Prompt caching is broken for Qwen3.5 multimodal in MLX

u/Regular-Marketing723 called it: LM Studio’s MLX runtime doesn’t cache prompts for Qwen3.5 multimodal models. Every conversation turn reprocesses the entire history from scratch. Non-vision variants work fine.

I used the multimodal variant for all benchmarks. So MLX was paying full prefill cost every turn, while llama.cpp had working KV cache reuse. The agent conversation scenario (8 turns of growing context) was hit hardest by this.

u/Federal-Effective879 confirmed: “Both prompt processing and token generation with MLX are much faster than llama.cpp, but MLX prompt caching has issues with hybrid models like Qwen 3.5, and this issue is exacerbated by agentic usage that depends on prompt caching for usable performance.”

Both GitHub issues (mlx-lm#903, mlx-lm#980) were active within hours of the post going up. Fixes are in progress.

u/Creepy-Bell-4527 and u/d4mations both recommended oMLX and vMLX as alternative MLX runtimes with proper caching. So the caching issue may be LM Studio-specific, not an MLX problem.

Qwen3.5 hybrid attention is not fully optimized in MLX

u/rpiguy9907 explained that Qwen3.5 uses a hybrid attention mechanism (gated delta-net, sliding window, mixed attention). llama.cpp supports these newer patterns. MLX likely falls back to standard attention, which gets slower as sequences grow.

Short prompts hide the difference. Long prompts expose it. u/cibernox suggested testing with Qwen3 (older, no hybrid attention) for a cleaner comparison. On their M1 Pro, MLX is consistently ~20% faster than llama.cpp with standard models.

bf16 dtype tanks prefill on M1 and M2

This one might be the biggest single factor. u/bakawolf123 explained: M1 and M2 chips don’t support bf16 (bfloat16) natively. They support fp16 (float16). Most MLX models on Hugging Face ship as bf16. GGUFs use fp16.

During prefill, the engine processes your entire input using the model’s non-quantized weights. Even a 4-bit quantized MLX model uses the underlying weight dtype for prefill. On M1/M2, that means bf16 gets emulated in software while GGUF runs native fp16. Generation is less affected (less compute per step), but prefill multiplies the penalty across every input token.

The fix takes under a minute. Install mlx-lm and convert:

pip install mlx-lm
mlx_lm.convert --hf-path mlx-community/Qwen3.5-35B-A3B-4bit --mlx-path ./Qwen3.5-35B-A3B-fp16 --dtype float16

This rewrites the non-quantized parameters from bf16 to fp16. Quantized weights stay as-is. Point LM Studio or mlx_lm.server at the output directory and prefill should improve on M1/M2. M3 and M4 may have partial bf16 support (unconfirmed). M5 reportedly handles bf16 natively. I haven’t rerun the benchmarks yet. That’s first on the list for Part 2.

Where this leaves us

GGUF is the safer bet right now. It’s more mature, more stable, and has wider quantization options. MLX has raw speed potential, but the ecosystem around it (caching, model support, runtime quality) still needs to catch up. Multiple people in the thread confirmed the same experience: MLX can be fast for certain models, but GGUF is more predictable across the board.

My conclusion after all of this: use GGUF as your default. Test specific scenarios with your actual workload before switching to MLX. And don’t trust synthetic tok/s numbers. That’s exactly why I built the benchmark harness above.

Also, I’m not the only one running into this.

Coming in Part 2: Isolating the variables

The r/LocalLLaMA discussion gave me a clear list of things to test. Each one could change the picture on its own.

bf16 to fp16 conversion. Run mlx_lm.convert on the model and rerun the benchmark. If this closes most of the gap, the story changes from “MLX prefill is slow” to “MLX prefill is slow with bf16 on M1/M2.”

Non-vision Qwen3.5 variant. Use a quant with the vision module removed. Prompt caching should work with these. If the agent conversation scenario flips, broken caching was the dominant factor.

Llama 3.1 8B MLX vs GGUF. No hybrid attention, well-supported by both engines. A clean engine-to-engine comparison without the Qwen3.5 complications.

Alternative runtimes. mlx_lm.server directly (bypass LM Studio), oMLX (better caching), and Ollama for the GGUF baseline. How much does the wrapper matter?

Tuning flags. Flash attention, quantized KV cache, and raised GPU wired memory limits:

sudo sysctl iogpu.wired_limit_mb=8192
launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE "q4_0"

The goal for Part 2: figure out how much of the MLX penalty is the engine, how much is the model, and how much is the runtime. Credit to everyone in the Reddit thread who helped narrow it down.

FAQ

Should I switch from MLX to GGUF? Depends on your workload and model. For Qwen3.5 multimodal on M1/M2, GGUF is faster in most scenarios right now due to compounding issues with MLX (broken caching, hybrid attention, bf16 dtype). For other models with working prompt caching, MLX can be competitive or faster, especially for long outputs. MLX can also suffer stability issues that GGUF doesn’t have. Benchmark both with the tool above.

Does Ollama use MLX or llama.cpp? llama.cpp. If you’re on Ollama, you’re already on the GGUF engine. Though note that Ollama adds its own overhead. I measured 18 tok/s generation where LM Studio gets 29 tok/s from the same llama.cpp build. More on that in Part 2.

Will MLX catch up? It’s already faster per-operation for both prefill and generation, according to multiple commenters on r/LocalLLaMA. The issues are model-specific (hybrid attention support, caching for certain architectures) and runtime-specific (LM Studio’s caching implementation). Fixes are in progress (mlx-lm#980). Regardless of which engine wins in six months: measure what you actually wait for, not what the counter says while text is streaming.

From the Build Log: Local LLMs on a Mac: From Magic to Disappointment —>

References:

Production-Grade Local LLM Inference on Apple Silicon (academic study comparing MLX, llama.cpp, Ollama, MLC-LLM, PyTorch MPS)
llama.cpp Apple Silicon performance tracking
LM Studio MLX prefill chunk size optimization
r/LocalLLaMA discussion on this article (community findings on caching, hybrid attention, bf16 dtype)
mlx-lm#903: Prompt caching issues
mlx-lm#980: Prompt caching fix for hybrid models

It started with document classification

The benchmark

Scenario 1: Document classification (short input, short output)

Scenario 2: Prefill scaling (growing input, short output)

Scenario 3: Agent conversation (growing context, longer output)

Two phases, one hidden

Effective throughput

Why the gap

When to use which

Run it yourself

Things that might change your results

Community update: what was actually going on with Qwen3.5-A3B

Prompt caching is broken for Qwen3.5 multimodal in MLX

Qwen3.5 hybrid attention is not fully optimized in MLX

bf16 dtype tanks prefill on M1 and M2

Where this leaves us

Coming in Part 2: Isolating the variables

FAQ

I'm making this reusable for you.