You ran ollama pull, typed a question, and text appeared. Cool. But what happened between your keyboard and that answer? What’s a “7B model” and why is it 4 GB instead of 14? Why do long conversations feel slower? Why does your Mac run models that choke a gaming PC with twice the specs?
This guide answers all of that. No math, no ML degree. After reading, you’ll understand:
- What’s inside a model file and why sizes vary so much
- How tokens, quantization, and context windows work
- The two inference phases and why the speed number on your screen lies
- Why Apple Silicon’s memory architecture is unusually good for this
- What all the jargon means (MoE, GGUF, KV cache, TTFT, tok/s)
Who this is for: You can handle a terminal. You’ve run a local model or you’re about to. You want to understand what’s happening, not just copy-paste commands.
The big picture
Before we get into the details, here’s everything that happens between your keyboard and the answer on screen. Each box is a concept explained in its own section below.
YOU TYPE A PROMPT
│
▼
┌─────────────┐
│ Tokenizer │ "What is..." → [3923, 374, ...]
└──────┬──────┘
│
▼
╔════════════════════════════════════════════════╗
║ INFERENCE ENGINE (Ollama, llama.cpp, MLX) ║
║ Model weights loaded in memory (quantized) ║
║ ║
║ ┌───────────────┐ ┌─────────────────────┐ ║
║ │ 1. PREFILL │ │ 2. GENERATE │ ║
║ │ │ │ │ ║
║ │ Process all │───▶│ One token at a time │ ║
║ │ input at once │ │ │ ║
║ │ │ │ "Paris" → "is" │ ║
║ │ (you wait) │ │ → "the" → ... │ ║
║ └───────────────┘ └──────────┬──────────┘ ║
╚═════════════════════════════════╪══════════════╝
│
▼
┌─────────────┐
│ Tokenizer │ → "Paris is the..."
└──────┬──────┘
│
▼
YOUR SCREEN:
"Paris is the capital of France."
That’s the whole pipeline. Every section below unpacks one piece of it.
Billions of numbers in a trenchcoat
When you run ollama pull llama3:8b, you download a file somewhere between 4 and 50 GB. That file contains billions of numbers called weights (also called parameters). These numbers are everything the model learned during training: language patterns, facts, reasoning strategies, and a lot of internet arguments about tabs versus spaces.
A “7B model” has 7 billion parameters. A “70B model” has 70 billion. More parameters generally means a smarter model, but also a bigger file and more memory needed to run it.
The weights are fixed. They don’t change when you chat with the model. Your conversation doesn’t teach it anything new. Think of weights as long-term memory: everything you’ve learned over your life that shapes how you think and respond. You don’t form new long-term memories during a single conversation. Neither does the model.
How big are these files?
Each parameter is a number. At full precision (16-bit floating point), each number takes 2 bytes. So a 7B model at full precision would be:
7,000,000,000 × 2 bytes = 14 GB
A 70B model? 140 GB. That’s more RAM than most computers have.
This is where quantization comes in.
Quantization: the JPEG of AI
Quantization compresses model weights by storing each number with fewer bits. Instead of 16 bits per number, you use 8, 4, or even 3.
Think of it like JPEG compression for images. The original photo might be 20 MB. A high-quality JPEG is 2 MB and you can barely tell the difference. Crank the compression too far and you start seeing artifacts. Same idea with model weights. Maarten Grootendorst wrote a visual guide to quantization that covers this in depth if you want the full picture.
| Precision | Bits per weight | 7B model size | 70B model size | Quality |
|---|---|---|---|---|
| Full (bf16) | 16 | 14 GB | 140 GB | Baseline |
| 8-bit (Q8) | 8 | 7 GB | 70 GB | Nearly identical |
| 4-bit (Q4) | 4 | 4 GB | 40 GB | Small loss, sweet spot |
| 3-bit (Q3) | 3 | 3 GB | 30 GB | Noticeable degradation |
4-bit quantization is the sweet spot. Quality loss is small enough that most people can’t tell the difference in conversation. It’s what most people run locally, and what tools like Ollama default to.
You’ll see format names like Q4_K_M or 4bit in model downloads. Those refer to specific quantization methods. The details differ, but the principle is the same: fewer bits per weight, smaller file, fits in less memory.
The “B” in model names stands for billion parameters, not bytes. A “35B model” has 35 billion parameters. At 4-bit quantization, that’s roughly 19 GB on disk. At full precision, it would be 70 GB.
Tokens: models don’t read English
You type “What is the capital of France?” and see “Paris” appear on screen. But the model never sees your words. It works with tokens.
A token is a chunk of text, usually a word or part of a word. The model’s vocabulary is a fixed dictionary of these chunks, typically 30,000 to 150,000 entries. Before your message reaches the model, a tokenizer splits it into tokens from this dictionary.
"What is the capital of France?"
Tokenized: ["What", " is", " the", " capital", " of", " France", "?"]
Token IDs: [3923, 374, 279, 6864, 315, 9822, 30]
Some words become multiple tokens. Uncommon words, technical terms, and non-English text get split into smaller pieces:
"tokenization" → ["token", "ization"] (2 tokens)
"macOS" → ["mac", "OS"] (2 tokens)
"Qwen" → ["Q", "wen"] (2 tokens)
Rule of thumb: 1 token is roughly 3/4 of an English word. A page of text is about 400 tokens. A 10-page document is roughly 4,000 tokens.
The model processes these token IDs as numbers. It outputs a token ID. The tokenizer converts that back to text. Everything you see on screen went through this round trip.
Why your Mac punches above its weight
This is where Apple Silicon gets interesting.
A traditional PC has separate memory for the CPU (RAM) and GPU (VRAM). Running an LLM on a GPU means the entire model has to fit in VRAM. An NVIDIA RTX 4090 has 24 GB of VRAM. A 70B model at 4-bit needs ~40 GB. Doesn’t fit. You either buy multiple GPUs or use a smaller model.
Apple Silicon uses unified memory. CPU and GPU share the same pool of RAM. A Mac with 64 GB of unified memory can use all of it for the model, the operating system, and your other apps, without copying data between chips.
TYPICAL PC
┌────────────────────┐ ┌────────────────────┐
│ System RAM │ │ GPU VRAM │
│ 32 GB │ │ 24 GB │
│ │ │ │
│ OS, apps, browser │◄─────►│ Model must fit │
│ │ copy │ entirely in here │
└────────────────────┘ └────────────────────┘
Two separate pools. Model limited to VRAM (24 GB).
A 40 GB model? Buy another GPU.
MAC WITH APPLE SILICON
╔════════════════════════════════════════════════╗
║ 64 GB Unified Memory ║
║ ║
║ ┌──────────┐ ┌──────────────┐ ┌──────────┐ ║
║ │ macOS │ │ Model │ │ Headroom │ ║
║ │ ~6 GB │ │ ~19 GB │ │ ~39 GB │ ║
║ │ │ │ │ │ │ ║
║ │ CPU+GPU │ │ CPU + GPU │ │ Apps, KV │ ║
║ │ share │ │ share │ │ cache │ ║
║ └──────────┘ └──────────────┘ └──────────┘ ║
║ ║
╚════════════════════════════════════════════════╝
One shared pool. A 40 GB model? It fits.
This is why a Mac Studio with 64 GB can run models that would need a multi-GPU setup on a PC. The memory bandwidth is lower than a dedicated GPU (the M1 Max does about 400 GB/s versus an RTX 4090’s 1,000 GB/s), so generation is slower. But the model fits, and “slower but works” beats “fast but doesn’t fit.”
Practical sizing: 16 GB runs 7-8B models comfortably. 32 GB handles 13-14B models with room to spare. 64 GB unlocks the 30-70B range, where models start handling complex reasoning and multi-step tasks well.
The pause and the flow: two phases of inference
Inference is the technical term for “asking the model a question and getting an answer.” Training is when the model learns. Inference is when it uses what it learned. Every time you type a message and get a response, that’s one inference.
And here’s the part that surprises people: inference is guessing. That’s it. The model predicts the most probable next token based on patterns it absorbed during training. It doesn’t “understand” your question. It doesn’t “know” the answer. It’s pattern-matching at a scale so large that the output looks like understanding. When it writes a correct Python function, it’s not programming. It’s predicting what tokens are most likely to follow “def sort_list(” based on millions of code examples it trained on.
Your brain does something similar when you finish someone’s sentence. You’re not doing it because you truly know what they’ll say. You’re pattern-matching based on context, tone, and a lifetime of conversations. The model does this, minus the lifetime, minus the understanding, but with a lot more training data.
Each inference has two distinct phases. Understanding this split is the key to understanding everything else about local LLM performance.
Phase 1: Prefill
The model reads your entire input at once: the system prompt, conversation history, and your new message. It processes all of it in a single batch to build up an internal representation of the context. Similar to how you’d re-read the last few messages in a group chat before replying. The more you have to catch up on, the longer it takes before you start typing.
You experience this as the pause before the first word appears. The technical name is Time To First Token (TTFT).
Prefill speed depends on how much text the model has to read. A short “hello” takes a fraction of a second. A conversation that’s been going for 20 messages with a big system prompt? Several seconds.
Phase 2: Generation
The model produces its response one token at a time. Each new token depends on everything that came before it (the full context from prefill plus all the tokens it has generated so far). This is sequential. There’s no shortcut.
You experience this as text appearing on screen. The speed is measured in tokens per second (tok/s).
You type: "What services are running on my server?"
┌─ PHASE 1: PREFILL ───────────┐ ┌─ PHASE 2: GENERATE ────────────┐
│ │ │ │
│ Process all input: │ │ Produce response: │
│ │ │ │
│ • System prompt │ │ "There" ▸ "are" ▸ "5" │
│ • Chat history │──▶ ▸ "services" ▸ "running" │
│ • Your new question │ │ ▸ ":" ▸ ... │
│ • Tool definitions │ │ │
│ │ │ Speed: ~30 tok/s │
│ Nothing on screen yet. │ │ Mostly constant. │
│ Time grows with input. │ │ │
└──────────────────────────────┘ └────────────────────────────────┘
Seconds to minutes Seconds
(depends on context size) (depends on reply length)
Why this matters: The tok/s number your chat UI shows you only measures Phase 2. If prefill takes 10 seconds and generation takes 5 seconds, the UI proudly displays “30 tok/s” while you actually waited 15 seconds total. The MLX vs llama.cpp benchmark digs deep into this gap.
Context window: the model’s working memory
If model weights are long-term memory, the context window is working memory: everything you can hold in your head right now. Humans can juggle about 4-7 items in working memory at once. Models get thousands of tokens, but the principle is the same. There’s a hard limit, and when it’s full, old stuff falls out.
The context window is the maximum number of tokens a model can “see” at once. Everything in the current conversation, system prompt, chat history, your latest message, and the model’s response so far, has to fit inside this window.
╔═══ CONTEXT WINDOW (e.g. 32,768 tokens) ══════════════════╗
║ ║
║ ┌──────────┐ ┌────────────────────┐ ┌──────┐ ┌───────┐ ║
║ │ System │ │ Chat History │ │ Your │ │Model's│ ║
║ │ Prompt │ │ │ │ New │ │ Re- │ ║
║ │ │ │ Msg 1, 2, 3 ... │ │ Msg │ │sponse │ ║
║ │ ~500 tok │ │ ◄── grows ──► │ │ │ │ │ ║
║ └──────────┘ └────────────────────┘ └──────┘ └───────┘ ║
║ ║
║ ◄──── everything must fit inside this box ───────────► ║
╚══════════════════════════════════════════════════════════╝
Exceed it and older messages simply vanish.
Common context window sizes:
| Context size | Rough equivalent | Typical models |
|---|---|---|
| 4,096 tokens | ~5 pages of text | Older models (GPT-3 era) |
| 8,192 tokens | ~10 pages | Llama 2, many fine-tunes |
| 32,768 tokens | ~40 pages | Llama 3, Qwen 2.5 |
| 128,000 tokens | ~160 pages | GPT-4, Claude, Gemini |
| 262,144 tokens | ~330 pages | Qwen 3.5 (maximum) |
What happens when you exceed it? The model cannot see anything outside its context window. Older messages get pushed out. The model doesn’t “forget” them gradually. They just vanish. One moment it can reference something you said, the next it can’t.
The KV cache: the model’s mental scratch pad
As the model processes tokens during prefill, it builds a data structure called the KV cache (Key-Value cache). It’s the model’s equivalent of the notes you keep in your head during a conversation: who said what, what topics came up, what was asked. For every token in the conversation, each layer of the model stores two vectors:
- Key: “What kind of information is this token?”
- Value: “What is the actual content?”
When generating the next token, the model scans all the Keys to figure out which earlier tokens are relevant, then reads those Values. This is the attention mechanism, the core idea that makes modern language models work.
The KV cache grows with every token in the conversation. More chat history means a bigger cache, which means more memory used and slower prefill.
Turn 1: You say hello, model responds
KV cache: ~200 tokens worth
Turn 5: Back and forth, context building
KV cache: ~2,000 tokens worth
Turn 15: Long conversation with tool results
KV cache: ~8,000 tokens worth
Prefill now takes several seconds
This is why long conversations feel slower than fresh ones. It’s not your imagination.
One brain vs a team of specialists
Your brain does something similar to what MoE models do: different regions activate for different tasks. You don’t use your visual cortex when doing mental math. Models figured out the same trick.
Most models come in two architectural flavors.
Dense models
Every parameter in the model activates for every token. A 32B dense model uses all 32 billion parameters for each word it generates.
Pros: Consistent, reliable. Well-understood behavior. Cons: Slower, because every parameter participates in every step.
Mixture of Experts (MoE)
The model is split into groups of specialist “experts.” For each token, a router picks which experts to activate. Only a fraction of the total parameters are used per token. Hugging Face has a detailed explainer on MoE architecture if you want the deeper mechanics.
Dense Model (32B): MoE Model (35B total, 3B active):
┌──────────────────────┐ ┌──────────────────────┐
│ │ │ Router │
│ All 32B parameters │ │ "Which experts │
│ active for every │ │ for this token?" │
│ single token │ └──────┬───────────────┘
│ │ │
│ Slower per token │ ┌──────┴───────┐
│ More memory used │ │ │
│ during inference │ ┌──┴──┐ ┌───┐ ┌──┴──┐
│ │ │Exp 1│ │...│ │Exp 8│
└──────────────────────┘ │ ON │ │off│ │ ON │
│ 3B │ │ │ │ │
└─────┘ └───┘ └─────┘
Only ~3B active per token
Faster generation
Same total knowledge
Example: Qwen 3.5-35B-A3B has 35 billion total parameters but only activates 3 billion per token. It generates text at 30-50 tok/s on Apple Silicon. A comparable dense model might do 10-20 tok/s.
The catch: MoE models are faster at generation, but the full model still needs to fit in memory. A 35B MoE model takes as much RAM as a 35B dense model, even though it only uses a fraction of the parameters per token.
The middleman: inference engines
A model file is just numbers. You need software to load those numbers, feed in your tokens, and run the math. That software is called an inference engine.
| Engine | What it is | Model format | Notes |
|---|---|---|---|
| llama.cpp | C/C++ inference engine | GGUF | Runs on everything. Metal GPU on Mac. Mature, fast prefill. |
| MLX | Apple’s ML framework | MLX (safetensors) | Built for Apple Silicon. Fast generation, but younger project. |
| Ollama | User-friendly wrapper around llama.cpp | GGUF | One-command install. ollama pull and go. |
| LM Studio | Desktop app, supports both engines | GGUF + MLX | GUI for browsing and running models. |
| Open WebUI | Chat interface (connects to Ollama or others) | Any (via backend) | The ChatGPT-like UI for your local models. |
Ollama is how most people start. It wraps llama.cpp, handles model downloads, and gives you a simple API. One command gets you running:
ollama pull llama3:8b
ollama run llama3:8b
The engine you choose affects performance. In the MLX vs llama.cpp benchmark, the same model on the same hardware showed very different total response times depending on the engine, because each engine handles the prefill and generation phases differently.
Model formats
Models are distributed in different file formats depending on which engine will run them.
GGUF (GPT-Generated Unified Format) is the format used by llama.cpp and Ollama. It bundles the quantized weights, tokenizer, and metadata into a single file. When you ollama pull a model, you’re downloading a GGUF file. The GGUF specification is maintained by the ggml project.
MLX format uses safetensors files optimized for Apple’s MLX framework. You’ll find these on Hugging Face under the mlx-community organization.
Safetensors is a general-purpose format from Hugging Face for storing model weights safely. MLX models use this format. PyTorch models often do too.
You don’t usually need to worry about formats. Ollama handles GGUF automatically. LM Studio picks the right format for each engine. If you’re comparing engines directly (like in the MLX vs GGUF benchmark), you’ll want the same model in both formats to get a fair comparison.
The speed number that lies
Chat UIs show various performance metrics. Most of them only tell part of the story.
Tokens per second (tok/s) measures generation speed only. It’s the Phase 2 number. A model generating at 30 tok/s produces about 22 words per second. That’s fast enough to read comfortably. 50+ tok/s feels nearly instant.
Time To First Token (TTFT) is how long you wait before the first word appears. This is the Phase 1 prefill time. Short prompts: under a second. Long conversations with tool definitions: potentially 10-20 seconds.
Total response time is what you actually experience. Prefill time plus generation time. This is the only number that tells you how long you waited for a complete answer.
What the UI shows: What you experience:
"53 tok/s" 12 seconds of nothing
then 5 seconds of text
= 17 seconds total
Versus:
"28 tok/s" 5 seconds of nothing
then 8 seconds of text
= 13 seconds total
The "slower" model answered 4 seconds faster.
This happens because the 53 tok/s model had slower prefill. The tok/s metric only captures part of the picture.
Models that argue with themselves first
You know that moment when someone asks you a tricky question and you pause, thinking through the answer before you open your mouth? Some newer models do the same thing, literally. They include a thinking step. Before answering, the model generates a hidden chain of reasoning inside <think> tags:
You: "Should I use RAID for my backup drives?"
Model (internal, hidden from you):
<think>
The user is asking about RAID for backups. RAID is not a backup
strategy. RAID protects against drive failure but not against
accidental deletion, ransomware, or corruption. I should explain
this distinction.
</think>
Model (visible to you):
"RAID protects against a single drive dying. It doesn't protect
against deleting the wrong file, ransomware encrypting everything,
or a software bug corrupting your data. Those events happen to all
drives in the RAID simultaneously. Use RAID for uptime, not for
backup. A proper backup lives on a separate device."
The thinking step uses extra tokens (and therefore extra time) but can improve answer quality for complex questions. For simple queries, it’s overhead you don’t need. Most inference tools let you toggle this on or off.
Qwen 3.5, DeepSeek R1, and similar “reasoning models” use this approach.
Teaching old models new tricks (RAG)
The model’s weights are frozen. It only “knows” what was in its training data. But you can stuff extra information into the prompt at runtime. Same way you’d look something up in a book before answering a question: you didn’t memorize the content, but with the page open in front of you, you can give a pretty good answer. This is called Retrieval-Augmented Generation (RAG).
Without RAG:
┌────────────────────────────────────┐
│ System prompt + your question │ → Model answers from training data
└────────────────────────────────────┘
With RAG:
┌────────────────────────────────────┐
│ System prompt │
│ + retrieved document chunks │ → Model answers using the documents
│ + your question │
└────────────────────────────────────┘
A RAG system searches your documents (PDFs, notes, emails, whatever), finds relevant chunks, and inserts them into the prompt before the model sees it. The model can then answer questions about content it was never trained on.
The trade-off: those document chunks eat into your context window and increase prefill time. A 10,000-token document pasted into the prompt means 10,000 extra tokens the model has to process before it starts answering. On a local model, that can mean an extra 10-30 seconds of wait time depending on your hardware.
How every piece connects
Back to that pipeline diagram from the top. Every component affects what you experience:
| Component | Affects | Example impact |
|---|---|---|
| Model size (parameters) | Quality of answers | 8B gives decent chat. 32B+ handles complex reasoning. |
| Quantization | Memory usage, subtle quality | 4-bit: 19 GB for a 35B model instead of 70 GB |
| Unified memory | Which models fit | 64 GB Mac runs 35B models. 16 GB Mac runs 8B models. |
| Context window | How much the model can “see” | 32K = ~40 pages. Older messages vanish beyond this. |
| KV cache | Memory usage, prefill speed | Grows with conversation length. Long chats get slower. |
| Inference engine | Prefill and generation speed | Same model, different engine = different total wait time |
| Architecture (Dense/MoE) | Generation speed, memory | MoE: faster per token, same total memory |
Glossary
| Term | Plain English |
|---|---|
| Token | A chunk of text (roughly 3/4 of a word). The unit models work in. |
| Parameters / Weights | The billions of numbers a model learned during training. Its “knowledge.” |
| Quantization | Compressing weights to use less memory. 4-bit is the sweet spot. |
| Inference | Running a model to get an answer. Technically: guessing the next token, repeatedly. |
| Prefill (TTFT) | Processing your input. The pause before text appears. |
| Generation (tok/s) | Producing the response, one token at a time. |
| Context window | Maximum tokens the model can see at once. |
| KV cache | The model’s working memory of the current conversation. Grows per token. |
| Unified memory | Apple Silicon shares RAM between CPU and GPU. No separate VRAM. |
| Dense model | All parameters active for every token. Slower, consistent. |
| MoE | Mixture of Experts. Only a subset of parameters active per token. Faster. |
| GGUF | Model file format for llama.cpp and Ollama. |
| MLX | Apple’s ML framework. Optimized for Apple Silicon. |
| RAG | Retrieval-Augmented Generation. Feeding documents into the prompt at runtime. |
| Attention | How the model decides which earlier tokens matter for the next word. |
Checklist
- Understand what model parameters are and why size matters
- Know that 4-bit quantization is the practical sweet spot for local models
- Understand tokens as the unit models work in (~3/4 word each)
- Know the two inference phases: prefill (waiting) and generation (text flowing)
- Understand why context window size affects memory and speed
- Know the difference between Dense and MoE architectures
- Understand why tok/s alone doesn’t tell you total wait time
- Know what Ollama, llama.cpp, and MLX are and how they relate
Next steps:
- Install Ollama on your Mac —>
- Set up Open WebUI for a ChatGPT-like interface —>
- MLX vs llama.cpp: which engine is actually faster? —> (now you’ll understand every concept in that benchmark)
From the Build Log: Local LLMs on a Mac: From Magic to Disappointment —>