Same Engine, 37% Slower: MLX vs llama.cpp on Apple Silicon // famstack.dev

Two models, five runtimes, M1 Max 64GB. I benchmarked MLX against llama.cpp to find out which engine gives you better local LLM performance on Apple Silicon.

A week of benchmarks later: it’s closer than the internet would have you believe. Fix a dtype issue on M1/M2, pick the right runtime, and MLX matches GGUF on raw speed. But GGUF currently ships better quantization, needs no workarounds on older chips, and the model ecosystem moves faster. The runtime you pick changes your throughput by up to 37%.


    Effective tok/s on ops agent scenario (M1 Max 64GB) with Qwen3 30B-A3B

    GGUF (llama.cpp engine)              MLX engine
    ┌─────────────────────────────┐      ┌─────────────────────────────┐
    │  //////////////////////     │      │  ///////////////////        │
    │  LM Studio   41.7 eff tok/s │ best │  oMLX        38.0 eff tok/s │ best
    │                             │      │                             │
    │  /////////////////////      │      │  /////////////////          │
    │  llama.cpp  41.4 eff tok/s  │  =   │  Rapid-MLX   35.6 eff tok/s │ =
    │                             │      │                             │
    │  /////////////              |      │  /////////                  │
    │  Ollama      26.0 eff tok/s │ -37% │  LM Studio   17.0 eff tok/s │ -55%
    └─────────────────────────────┘      └─────────────────────────────┘

    The runtime matters more than the engine.

#What I found

You don't know s**t until you start measuring your actual use-case.

With a model that works correctly on both engines, the speed gap almost disappears. Part 1 tested Qwen 3.5, which had broken prompt caching, hybrid attention MLX couldn’t optimize, and bf16 weights on a chip that doesn’t do bf16. This time I tested Gemma 12B and Qwen3 30B-A3B. No bugs. After converting bf16 to fp16 on my M1, the difference is single digits in most scenarios.

GGUF wins on practical grounds, not speed. GGUF’s K-quant quantization does 4.7x less damage to model quality than MLX’s uniform 4-bit, at similar file sizes. And GGUF uses fp16 natively, so M1/M2 users don’t need the bf16 conversion workaround. Better output quality, less hassle.

The runtime matters more than the engine. Ollama is 37% slower than LM Studio on the same llama.cpp engine. LM Studio runs GGUF at full native speed. I compiled llama.cpp from source and verified: 41.7 vs 41.4 eff tok/s. On the MLX side, oMLX is 2.2x faster than LM Studio’s built-in MLX engine on multi-turn conversations.

oMLX fixes caching bugs, not MLX in general. In Part 1, oMLX was 5x faster than LM Studio MLX on Qwen3.5. On Gemma, where caching works fine, they perform the same. oMLX is the best MLX runtime, but its advantage is caching, not raw speed.

#MLX vs GGUF: Speed comparison

Two models tested on M1 Max 64GB. One dense (Gemma 12B), one Mixture-of-Experts (Qwen3 30B-A3B). Both tested with default bf16 weights, fp16-converted weights, and GGUF Q4_K_M.

#Gemma 3 12B

Gemma 3 12B QAT. Dense, 12B params, standard attention, no known caching bugs. QAT means it was trained with quantization in the loop. Fits a 16 GB Mac.

Gemma 3 12B: effective tok/s by scenario (higher is better)

MLX (bf16) MLX (fp16 converted) GGUF

Creative writing

26.7

32.8

32.4

Doc classification

10.9

16.1

16.9

Ops agent (8 turns)

17.2

20.0

22.2

Prefill stress (8K ctx)

2.8

5.2

4.7

bf16 weights make MLX look worse than it is. With default bf16, GGUF wins every scenario by 21-68%. Convert to fp16, and the gap shrinks to single digits. MLX fp16 edges ahead on creative writing (32.8 vs 32.4) and prefill stress (5.2 vs 4.7). GGUF keeps the lead on doc classification and ops agent. Generation speed follows the same pattern: bf16 runs at 28 tok/s, fp16 at 35, GGUF at 33.

Raw results: Browse the Github Repo.

How I measured. Mac Studio M1 Max, 64 GB, 24 GPU cores. local-llm-bench, a tool I built for this series. Real prompts against real backends, timed end-to-end. All numbers are effective tok/s: output tokens divided by total processing time, prefill included. That’s different from the generation tok/s your UI shows, which ignores prefill entirely. Part 1 explains why that distinction matters. Raw data for every number in this article.

#Qwen3 30B-A3B

Qwen3 30B-A3B 2507. MoE, 30B total / 3B active per token. Same Qwen family as the model tested in Part 1, one generation back. No hybrid attention, no broken caching.

Qwen3 30B-A3B: effective tok/s by scenario (higher is better)

MLX (bf16) MLX (fp16 converted) GGUF

Creative writing

53.7

52.7

56.1

Doc classification

26.4

32.8

33.7

Ops agent (8 turns)

35.7

38.4

41.7

Prefill stress (8K ctx)

6.0

8.6

7.6

Tighter than Gemma. With bf16, GGUF wins all four scenarios (4-28%). With fp16, MLX takes the lead on prefill stress (8.6 vs 7.6) and pulls close on doc classification (32.8 vs 33.7). Generation speed is essentially tied: 58 tok/s (GGUF) vs 55-56 tok/s (MLX). Part 1 showed 57 (MLX) vs 29 (GGUF), but that gap was the model, not the engine.

Raw results: Browse the Github Repo

#Runtimes: Ollama vs LM Studio vs oMLX

Same engine, different wrapper, different speed. The runtime you choose matters as much as the engine underneath.

#How much overhead does the wrapper add?

I compiled llama.cpp from source with Metal support and ran llama-server directly. Same GGUF file, no wrapper.

Qwen3 30B-A3B ops agent effective tok/s, GGUF across three wrappers (higher is better)

GGUF

LM Studio

41.7

llama.cpp (source)

41.4

Ollama

26.0

LM Studio and raw llama.cpp are within noise (41.7 vs 41.4). LM Studio adds no measurable overhead. Ollama is 37% slower on the same engine. Its Go wrapper adds overhead on every request, consistent across both articles and both models tested.

Building llama.cpp from source, if you want to try it:

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

./build/bin/llama-server \
  -m /path/to/your-model.gguf \
  --port 8090 -ngl 99 -c 16384

Unless you need custom build flags, LM Studio runs the same engine at the same speed.

#MLX runtimes compared to GGUF (llama.cpp): Qwen3.5-35B-A3B

Wrappers matter on the GGUF side. What about MLX? I went back to the Part 1 model (Qwen3.5-35B-A3B-4bit) and tested four MLX runtimes plus LM Studio GGUF for reference: LM Studio, oMLX, Rapid-MLX, and mlx-openai-server. Since publishing, AlexTzk ran the same benchmarks on M3 Max (128GB, 40 GPU cores). Thanks for the contributions!

Qwen3.5-35B-A3B ops agent effective tok/s across runtimes (higher is better)

M1 Max (64GB, 24 GPU) M1 Max (fp16 converted) M3 Max (128GB, 40 GPU) GGUF reference

oMLX (MLX)

38.0

47.3

71.3

Rapid-MLX (MLX)

35.6

mlx-openai-server (MLX)

26.2

LM Studio (MLX)

17.0

37.1

LM Studio (GGUF)

17.6

Qwen3.5-35B-A3B prefill stress (8K context) effective tok/s (higher is better)

M1 Max (64GB, 24 GPU) M1 Max (fp16 converted) M3 Max (128GB, 40 GPU) GGUF reference

oMLX (MLX)

16.4

12.4

22.6

mlx-openai-server (MLX)

8.7

Rapid-MLX (MLX)

8.5

LM Studio (GGUF)

7.8

LM Studio (MLX)

5.9

14.8

oMLX on M3 Max hits 71.3 eff tok/s, 1.9x the M1 Max result, roughly proportional to GPU cores (40 vs 24). LM Studio MLX on M3 Max gets 37.1, about the same as oMLX on a three-year-old M1 Max. The runtime you pick can matter more than the chip. On M1 Max, LM Studio GGUF (17.6) and LM Studio MLX (17.0) land at the same effective speed for opposite reasons: GGUF generates slowly with fast prefill, MLX generates fast with slow prefill.

Note the fp16 tradeoff on the prefill chart: fp16 hurts oMLX on pure prefill stress (12.4 vs 16.4) because larger weights mean more memory reads at high context. On real workloads where generation dominates, fp16 still wins.

I didn’t benchmark Ollama on Qwen3.5. Based on the 37% overhead measured on Qwen3 30B, expect it below LM Studio GGUF.

Raw results: MLX engine comparison.

#The bf16 fix: Free performance for M1/M2 chips on Apple Silicon for MLX engines

u/bakawolf123 pointed this out in the Part 1 thread. M1 and M2 chips don’t support bfloat16 natively. Most MLX models on HuggingFace ship bf16 weights. GGUFs use fp16.

During prefill, the engine processes your entire input using the model’s non-quantized weight dtype. Even a 4-bit model uses the full-precision dtype for this step. On M1/M2, that means bf16 gets emulated in software. fp16 runs at full hardware speed.

Converting takes under a minute with mlx-lm:

python3 -m venv .venv && source .venv/bin/activate && pip install mlx-lm

mlx_lm.convert \
  --hf-path ~/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-4bit \
  --mlx-path ~/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-fp16 \
  --dtype float16

The converted model shows up in LM Studio as a separate entry. The original stays untouched. You can check if your model ships bf16 by looking at torch_dtype in its config.json.

#Prefill improvement on Gemma 12B

bf16 fix on MLX for M1/M2 Processors: Improved prefill time by context size (seconds, lower is better)

MLX bf16 MLX fp16

655 tokens

6.2s

3.7s

1.5K tokens

12.9s

7.5s

3K tokens

29.5s

17.2s

8K tokens

114.4s

68.9s

1.7x prefill improvement at every context size. Generation speed jumps from 28 to 33 tok/s too, matching GGUF. With fp16, MLX’s prefill penalty drops from 70% slower than GGUF to roughly the same ballpark.

#Prefill improvement on Qwen3 30B

Qwen3 30B-A3B bf16 fix: prefill time by context size (seconds, lower is better)

MLX bf16 MLX fp16 GGUF

655 tokens

2.5s

1.5s

1.6s

1.5K tokens

4.8s

2.8s

3.3s

3K tokens

11.1s

6.6s

7.7s

8.5K tokens

62.5s

42.6s

49.1s

On the MoE model, fp16 MLX is actually faster than GGUF at context sizes below 3K (1.5s vs 1.6s at 655 tokens, 6.6s vs 7.7s at 3K). GGUF takes the lead at 8K. The difference is small either way.

fp16 and bf16 have the same precision for the value ranges LLMs operate in. No quality difference. If you’re on M1 or M2 and running MLX models, this conversion recovers 40-70% of the prefill penalty for free.

#”4-bit” is not “4-bit”

Everything above measures speed. During research for this article I learned that MLX and GGUF quantization don’t produce the same output quality, even when both say “4-bit” and start from the same original weights.

Perplexity: Measures how surprised a model is by text it should be able to predict. Lower means the quantized model behaves more like the original full-precision version. KL-divergence measures how far the quantized model’s probability distributions have shifted from the original. Both are standard ways to measure how much damage quantization does.

The difference comes down to how the bits are allocated.

GGUF’s K-quants (Q4_K_M, Q5_K_M, Q6_K) do not apply 4 bits uniformly. They allocate more bits to sensitive tensors (attention value projections, output layers) and fewer to the rest. Q4_K_M averages out to 4.83 bits per weight. MLX’s default 4-bit quantization applies the same bit depth everywhere.

  MLX 4-bit: same depth everywhere
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │ 4b │ 4b │ 4b │ 4b │ 4b │ 4b │ 4b │ 4b │  → 3.50 GB
  └────┴────┴────┴────┴────┴────┴────┴────┘

  GGUF Q4_K_M: more bits on sensitive layers
  ┌────┬────┬──────┬────┬────┬────┬──────┬────┐
  │ 4b │ 4b │  6b  │ 4b │ 4b │ 4b │  6b  │ 4b │  → 3.80 GB
  └────┴────┴──────┴────┴────┴────┴──────┴────┘
                ▲                     ▲
           attn values          output layer

The llama.cpp project measured the perplexity impact of each format on a 7B model. Lower numbers mean closer to the original F16 model:

Format	Size	Perplexity increase vs F16
Q4_0 (uniform, structurally similar to MLX 4-bit)	3.50 GB	+0.2499
Q4_1	3.90 GB	+0.1846
Q4_K_S	3.56 GB	+0.1149
Q4_K_M	3.80 GB	+0.0535
Q5_K_M	4.45 GB	+0.0142

Q4_0 and Q4_K_M are within 300 MB of each other. The perplexity difference is 4.7x. That is significant. KL-divergence measurements on Mistral-7B show the same ratio.

When users asked for K-quant support in MLX, the maintainers said they’re not planning to add it. They offer mixed_3_6 and mixed_2_6 recipes as alternatives. The default MLX models on HuggingFace still ship uniform 4-bit.

As of late March 2026, this is an active and fast-moving area for MLX. Follow JANG-Q: adaptive quantization for MLX, the oMLX maintainers’ 1oQ announcement, and Alexandru Vasile’s writeup on integrating JANG with oMLX to fit larger models on Apple Silicon for the latest work on bringing mixed/adaptive quantization to MLX.

After the fp16 fix, MLX and GGUF trade wins depending on the scenario, with differences in single digits. The engine isn’t the deciding factor. Quantization quality, ecosystem speed, and how much setup friction you want to deal with are.

#GGUF

LM Studio with GGUF. Good UI for browsing and configuring models, runs llama.cpp at full native speed. I verified this by compiling llama.cpp from source and benchmarking both: no measurable overhead. Combine that with GGUF’s broader quantization options (Q4_K_M, Q5_K_M, Q6_K vs MLX’s uniform 4-bit) and the faster ecosystem (new models get GGUF quants within hours on bartowski’s and lmstudio-community’s HuggingFace pages, MLX variants take days to weeks). imho LM Studio + GGUF is the best way to start on Apple Silicon right now.

Avoid Ollama for anything latency-sensitive. Same llama.cpp engine underneath, but the Go wrapper costs 19-40% of your throughput. On multi-turn conversations, prefill grows linearly per turn instead of staying flat. Prompt caching doesn’t appear to work properly.

#MLX

oMLX for newer models like Qwen3.5. Anything prefill-heavy or with long, multi-turn conversations. Its caching layers are smart: caches aren’t just held in-memory, they overflow to SSD too. Resuming old conversations is quick.

oMLX runtime is fast. In contrast: LM Studio is great for GGUF, but currently the slowest MLX option I tested, 2.2x behind oMLX on multi-turn. oMLX handles caching correctly and can be 5x faster when LM Studio’s caching is broken.

If you’re on M1 or M2, convert your MLX weights to fp16: Another 1.5-1.7x on prefill for free.

If you have a demanding use case, benchmark both engines. MLX with fp16 conversion can be faster on prefill-heavy workloads (document processing, RAG with large context). The differences are small enough that your specific model and workload matter more than the engine. The benchmark tool takes five minutes.

#The backstory

MLX promises faster LLM inference on Apple Silicon. llama.cpp has three years of Metal optimization and 900+ contributors. If you’re running local models on a Mac, the runtime you pick changes your performance by up to 37%, and your choice of MLX vs GGUF format matters less than you’d think.

In Part 1, I benchmarked MLX against llama.cpp (GGUF) on one model and found that MLX’s reported 57 tok/s dropped to 3 tok/s in real-world use. I posted the results on r/LocalLLaMA. One of the first responses:

“You just happened to choose the one model in the world that’s currently slower on MLX.” (u/itsjase)

Fair point. I picked Qwen 3.5: strongest cutting-edge local model, Mixture of Experts, 3 billion active parameters, vision support included. But it had broken prompt caching, hybrid attention that MLX doesn’t optimize, and bf16 weights on a chip that doesn’t do bf16. Everything that could go wrong for MLX went wrong on that one model.

So I tested two more models, added three runtimes, compiled llama.cpp from source, patched LM Studio’s prefill chunk size, and converted model weights from bf16 to fp16. A week of benchmarking to answer one question: was it the model, the engine, or the runtime?

#The setup

Same machine as Part 1: Mac Studio M1 Max, 64 GB, 24 GPU cores. Same benchmark tool, specifically created for this series to test real-world usage scenarios relevant for famstack.

Currently four scenarios: creative writing (long output), document classification (short output), ops agent (8-turn conversation), prefill stress test (growing context, no caching). In the meantime I added some vision scenarios too. But they are not part of this article. I will compile the vision results into a separate article.

#Why these two models

Gemma 3 12B QAT (dense, 12B, ~8 GB). Google model, completely different family from Qwen. Standard attention, no known caching bugs. QAT means it was trained with 4-bit quantization in the loop, so the weights are optimized for running quantized. Fits a 16 GB Mac.

Qwen3 30B-A3B 2507 (MoE, 30B total / 3B active, ~17 GB). Same Qwen family as Part 1, one generation back. No hybrid attention, no broken caching. If GGUF wins on this model too, the Part 1 result wasn’t Qwen3.5-specific.

#Why these five runtimes

Runtime	Engine	Purpose
LM Studio MLX	MLX	Apple’s engine through a GUI wrapper
LM Studio GGUF	llama.cpp	Same wrapper, different engine
Ollama	llama.cpp	Same engine, Go wrapper instead of LM Studio
oMLX	MLX	Alternative MLX runtime with tiered KV cache. Crushed Part 1’s benchmarks on Qwen3.5.
llama-server	llama.cpp	Raw llama.cpp compiled from source. No wrapper at all.

Testing LM Studio against raw llama.cpp isolates wrapper overhead. Testing LM Studio against Ollama isolates Go overhead. Testing oMLX against LM Studio MLX shows whether oMLX’s Part 1 advantage was the runtime or just better caching for one broken model.

#Gemma: Four runtimes compared

Google’s Gemma 3 is about as far from Qwen3.5 as I could get. Dense model (all 12B parameters active per token, unlike Qwen3.5’s MoE). Standard attention (no hybrid tricks). QAT quantization. No caching issues in any runtime I tested.

Gemma 3 12B ops agent effective tok/s (higher is better)

MLX engine GGUF engine

LM Studio GGUF

22.2

LM Studio MLX

17.2

oMLX

15.3

Ollama

7.0

oMLX and LM Studio MLX produce the same numbers on Gemma. In Part 1, oMLX was 5x faster because it fixed Qwen3.5’s broken caching. When caching already works, no difference. Ollama falls apart on multi-turn: 7.0 vs 22.2 eff tok/s against LM Studio GGUF. Prefill grows linearly per turn (3s, 10s, 13s, 17s, 19s) instead of staying flat. Caching doesn’t seem to work properly.

The prefill stress numbers (2.8 and 4.7 eff tok/s from the speed comparison above) are worst-case: independent prompts, no shared prefix, full reprocessing. At 8K, MLX waits 114 seconds before the first token, GGUF 67. That’s the M1 Max memory bandwidth wall (400 GB/s). In real multi-turn use with caching, response times stay at 3-6 seconds per turn.

#The model tag confusion

The model variants with its tags and non-consistent naming across different platforms gives me serous headaches as an engineer. We desperately need a semver like standard for this mess. Anyone?

I almost published wrong numbers here.

My first GGUF run used qwen3-30b-a3b from LM Studio. The MLX run used qwen3-30b-a3b-instruct-2507-mlx. Similar names. Different models. The older GGUF leaked <think> blocks into every response, burning tokens on reasoning that should have been suppressed. MLX responses were clean. The GGUF responses opened with paragraphs of internal monologue before the actual output.

That made GGUF look slower and MLX look competitive. I caught it by comparing the saved response files side by side.

After downloading the matching 2507 GGUF, the results changed. “Same model” on HuggingFace doesn’t mean same behavior. Check the exact version tag.

#What didn’t help

The r/LocalLLaMA thread suggested several optimization flags. I tested each on the Gemma 12B prefill stress test.

Flash attention (Ollama). Baseline: 3.7 eff tok/s. With flash attention: 3.9. Noise.

Quantized KV cache (q4_0). Baseline: 3.7. With q4_0: 3.8. KV cache quantization reduces memory usage (useful for fitting longer contexts), but doesn’t speed up the actual prefill computation.

Prefill chunk size patch. LM Studio’s MLX engine processes prompts in 512-token chunks. This patch bumps it to 4096. The author measured 2x improvement on M3 Ultra. On my M1 Max: fp16 Gemma at 8K was 72.7s with the patch, 68.9s without. No measurable difference.

These optimizations target compute efficiency. On M1 Max, the bottleneck is memory bandwidth (400 GB/s). Faster math doesn’t help when the GPU is waiting on memory reads. The chunk size patch may help on M3/M4 where bandwidth is higher. On M1, the bf16-to-fp16 conversion is the only optimization that made a real difference.

#Open questions

Does the 4.7x perplexity gap between uniform 4-bit and K-quant show up in real-world output quality? This article measures speed. The quantization quality question needs its own benchmarks across accuracy, categorization, and coding tasks.

MLX stability on long sessions. Independent reports on r/LocalLLaMA describe MLX prefill becoming “unbearably slow” on large context with Qwen3.5 on an M3 Ultra 512GB, MLX generation speed decreasing with context size while llama.cpp stays stable, and LM Studio Metal/memory failures causing full system reboots during MLX sessions on a 48GB Mac. Switching to GGUF resolved both cases. These reports match what I observed on M1 Max at smaller scale.

How do these numbers scale on newer Apple Silicon? AlexTzk submitted M3 Max results, see the runtime comparison above. oMLX scales from 38.0 to 71.3 eff tok/s (1.9x), roughly proportional to GPU cores (40 vs 24). Both chips share 400 GB/s memory bandwidth. M2 and M4 data still missing. Run the benchmark on your hardware and post results.

References:

local-llm-bench (benchmark tool, all raw data in results/)
r/LocalLLaMA Part 1 discussion
llama.cpp
MLX
oMLX
LM Studio
Ollama
mlx-lm (model conversion)
LM Studio MLX prefill patch
Gemma 3 QAT
mlx-lm#903: Qwen3.5 prompt caching
llama.cpp quantization perplexity benchmarks
KL-divergence measurements for GGUF formats (Artefact2, Mistral-7B)
MLX-LM quality benchmarks (MMLU Pro by quantization level)
MLX issue #1934: Q4_K_M support request (declined by maintainers)
JANG-Q: adaptive quantization for MLX
Integrating JANG with oMLX (Alexandru Vasile)
llama.cpp Apple Silicon performance baselines