What is Ollama and How Do You Run It on a Mac? // famstack.dev

What you’ll build: Ollama installed on your Mac with a local AI model you can chat with from the terminal.

End state: Type a question, get an answer, completely offline. No account, no API key, no data leaving your machine.

What you’ll understand: What Ollama actually is, how AI models work in plain terms, what those cryptic model names mean, and how to have your first conversation with a local AI.

You need a Mac with Apple Silicon (M1 or later) and at least 16GB of unified memory. 8GB can work for very small models, but 16GB gives you a comfortable experience.

Why run AI locally at all?

Because it’s fun. There’s something satisfying about running a language model entirely on hardware sitting under your desk, no cloud, no subscription, no one else’s server involved. It’s the best kind of toy since plastic bricks.

This isn’t about replacing ChatGPT or Claude. The big cloud models are better for plenty of things and that’s fine. What this is about is having the tech running in your own house, understanding how it works, and being set up for what comes next. Models improve fast. What feels limited today will feel obvious in two years. The people who know this stack now will be ahead when the capabilities catch up.

What is Ollama?

Ollama is an open-source tool that makes running AI language models on your own hardware as simple as installing an app. Before Ollama existed, getting a model running locally meant compiling software from source, managing Python environments, and debugging cryptic dependency errors for an afternoon. Ollama wraps all of that complexity in a single binary with a clean command-line interface.

Under the hood it uses llama.cpp, a C++ runtime that runs AI models efficiently on consumer hardware, including Apple Silicon. Ollama adds model management on top: download a model by name, switch between them, expose a local API that other tools can talk to.

It’s free, open source, and has no telemetry or usage tracking. The project is at ollama.com.

Why the name? The project never gave an official explanation. The most likely reading: it’s a play on “llama”, both the animal and Meta’s Llama model family that put open-weight AI on the map. The extra “ol” gives it a friendlier sound. Whatever the origin, the llama theme runs through the whole ecosystem: the models, the tools, the community.

What is an AI model?

A model is a large file, typically several gigabytes, containing billions of numerical weights learned from training on text. When you ask it a question, it uses those weights to predict the most likely next words, one token at a time. That’s the whole mechanism. It’s pattern matching at a scale that produces surprisingly useful results.

Models don’t think, don’t remember previous conversations by default, and don’t have access to the internet. What they do well: understanding and generating text, summarizing, translating, explaining, drafting, answering questions on topics covered in their training data.

Different models have different strengths. Some are tuned for code, some for conversation, some for specific languages. Ollama gives you access to dozens of them.

How to read a model name

Model names look cryptic until you know the pattern. Take llama3.2:8b-instruct-q4_K_M as an example. It breaks into four parts.

llama3.2 is the model family and version. Llama is Meta’s open-weight model series, meaning the weights are publicly available for anyone to download and run. Other families you’ll encounter: Qwen (Alibaba), Mistral (French startup), Gemma (Google), Phi (Microsoft). Each family has different strengths and release cadences.

:8b is the parameter count. Parameters are the numerical values the model learned during training. 8 billion of them. Think of parameters like the knobs and dials of a very complex machine: more knobs means the machine can model more nuance, but also takes up more space and runs slower.

A rough rule of thumb for RAM: the model needs about 1GB of RAM per billion parameters at 4-bit quantization. An 8B model needs roughly 8GB. A 32B model needs roughly 24-32GB. Common sizes and what they’re good for:

Size	RAM needed	Good for
3B	~3 GB	Quick tasks, low-end hardware
7-8B	~6-8 GB	Daily driver, good quality
32B	~24 GB	High quality, needs 64GB Mac
70B+	48GB+	Research-grade, most home servers can’t run these

-instruct means the model was fine-tuned to follow instructions and have conversations. The alternative is a base model, which just predicts the next word without understanding that you’re asking it something. For chat, always pick the instruct variant. If you don’t see -instruct in the name, look for -chat or -it as equivalents depending on the family.

-q4_K_M is the quantization. This is where it gets a bit technical, but the concept is simple: the original model weights are stored as high-precision numbers. Quantization compresses them to lower precision to save RAM and run faster, at a small cost to quality. q4 means 4-bit precision. q8 is closer to the original but uses roughly twice the RAM. q4_K_M is a specific variant that balances quality and size well.

You almost never need to think about this. When you run ollama pull llama3.1:8b, Ollama picks a good quantization automatically. The full tag q4_K_M shows up when you browse models manually or run ollama list.

When in doubt: pull the 8b variant of whatever family you want to try. Ollama handles the rest.

Install Homebrew

Homebrew is a package manager for macOS. If you already have it, skip this step.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

It will ask for your password and walk you through the setup. Takes a couple of minutes.

Install Ollama

brew install ollama

Homebrew downloads the Ollama binary, puts it on your PATH, and registers a background service so Ollama starts automatically every time you log in. After the install finishes, Ollama is already running.

Verify it’s up:

curl http://localhost:11434/api/version

You should see something like:

{"version":"0.6.2"}

If you get “connection refused”, start Ollama manually:

brew services start ollama

Then try the curl again.

Download your first model

Ollama has no models yet. You need to download one. Start with Llama 3.1 8B, a solid general-purpose model that runs well on most Apple Silicon Macs:

ollama pull llama3.1:8b

This downloads about 4.7GB from Ollama’s model registry. You’ll see a progress bar:

pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
success

Start a conversation

ollama run llama3.1:8b

Ollama loads the model into memory (a few seconds on first load) and drops you into a chat:

>>>

Type anything and press enter. Ask it to explain something, draft a message, summarize text you paste in. Type /bye to exit.

The model runs entirely on your Mac. Nothing is sent anywhere.

Manage your models

ollama list               # see what you have downloaded
ollama ps                 # see what's loaded in memory right now
ollama pull qwen2.5:7b    # download another model
ollama rm llama3.1:8b     # delete a model to free up disk space

Ollama loads models on demand and unloads them automatically after a few minutes of inactivity. Pull several, experiment, delete the ones you don’t use.

What else can you run?

One model is just the start. There are specialist models for coding, translation, legal documents, SQL generation, math, image description, and dozens of other niches. New ones appear weekly. Half the fun is pulling a model you’ve never heard of and seeing what it can do.

ollama.com/library is the simplest starting point. Models are tagged by category and sorted by popularity. Search for a task like “code” or “vision” and you’ll find something tuned for it. Each model page shows sizes, RAM requirements, and a one-line pull command.

huggingface.co/models is the bigger catalog. Filter by task, sort by trending or most downloaded, and filter by format (GGUF for Ollama). Most models worth running are already in Ollama’s library, but HuggingFace is where you find the long tail.

A few categories worth trying once you’re comfortable:

Vision models like llama3.2-vision can describe images, read text from screenshots, and answer questions about photos. Pull one and drop an image into the chat.
Code models like qwen2.5-coder:7b are fine-tuned for writing, reviewing, and explaining code. Noticeably better than general models for that use case.
Small specialist models sometimes punch above their weight. A 3B model fine-tuned for SQL generation can outperform a general 32B model at writing queries.

The models are free. Disk space is the only cost. Pull ten, delete the ones that don’t click, keep the ones that surprise you. ollama rm cleans up in seconds.

Useful API endpoints

Ollama exposes a local API at http://localhost:11434 that other tools use to talk to it. You don’t need the API for chatting in the terminal, but it’s good to know it exists.

# Check Ollama is running
curl http://localhost:11434/api/version

# List downloaded models
curl http://localhost:11434/api/tags

# Send a prompt and get a response
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "What is Docker in one sentence?",
  "stream": false
}'

This is the API that Open WebUI, n8n, Continue (VS Code), LangChain, and most local AI tools connect to. If a tool says “Ollama-compatible”, it talks to this endpoint.

Frequently asked questions

Can I run AI on a Mac without internet? Once you’ve pulled a model with ollama pull, it lives on your disk. Unplug the Ethernet, turn off Wi-Fi, it doesn’t care. No internet, no account, no API key. I’ve used it on a train with no signal.

How much RAM do I need to run Ollama on a Mac? 16GB gets you a 7-8B model running comfortably. That’s enough for automations, Whisper, quick Q&A. 32GB gives you room to experiment without watching Activity Monitor. 64GB is where you run the 32B+ models that are starting to get seriously good.

Is Ollama free? Completely. Open source, no telemetry, no usage limits. The models are free too, released under open licenses by Meta, Alibaba, Mistral, Google, Microsoft, and others. Your electricity bill is the only cost.

What’s the difference between Ollama and ChatGPT? ChatGPT runs on OpenAI’s servers. You send your data to them, they process it, they keep logs. Ollama runs models on your Mac. Nothing leaves your machine. ChatGPT is still better at complex reasoning, but for private questions, first drafts, and home automations, local models do the job.

Does Ollama use the GPU on Apple Silicon? The native macOS install uses Metal for GPU acceleration through unified memory. That’s why you install Ollama directly, not in Docker. The containerized version runs inside OrbStack’s Linux VM and has zero GPU access. Night and day difference in speed.

What’s next

The terminal chat is a good start. The next guide covers setting up Open WebUI, a browser-based chat interface that looks and works like ChatGPT, model recommendations for different hardware configurations, and serving everything to every device on your home network.

Open WebUI and serving AI to your home network →