Ollama Setup Guide: Run Powerful Local LLMs on Your Own Machine
You don't need an OpenAI account or a GPU cluster to run serious language models anymore. Ollama is the fastest way to pull, run, and serve open-source LLMs locally — Llama 3, Mistral, Gemma, Phi, Qwen, and dozens more. It handles model downloading, quantization, hardware acceleration, and a local REST API out of the box. This Ollama setup guide walks you through everything: from installation to calling models from your own code.
Prerequisites
Ollama runs on macOS, Linux, and Windows. Here's what you need before starting:
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel)
- Linux: Any modern distro with glibc 2.17+ (Ubuntu 20.04+ is ideal)
- Windows: Windows 10 or 11 with WSL2 recommended for best performance
- RAM: Minimum 8GB — 16GB+ for 13B models, 32GB+ for 70B models
- Disk space: Models range from 2GB (small 3B) to 40GB+ (70B). Have at least 20GB free to start
- GPU (optional but fast): NVIDIA GPU with CUDA, AMD GPU with ROCm, or Apple Silicon with Metal — Ollama auto-detects and uses them
Check available RAM and disk before pulling large models:
free -h
df -h ~
nvidia-smi # If you have an NVIDIA GPU
Installing Ollama
Linux (One-Liner)
The official install script detects your OS, installs the Ollama binary, and sets it up as a systemd service that starts automatically:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify it's running:
ollama --version
sudo systemctl status ollama
curl http://localhost:11434
The last command should return Ollama is running. That's your confirmation the local API server is up.
macOS
Download the macOS app from ollama.com, open it, and Ollama runs as a menu bar app. The CLI and API are available immediately after launch. Apple Silicon Macs get Metal GPU acceleration automatically — performance is excellent even on M1.
Windows
Download the Windows installer from ollama.com. It installs the CLI and starts the server. For development, WSL2 gives you a better experience — install Ollama inside the WSL2 environment using the Linux one-liner above, then access it from both Windows and WSL.
Docker (Server Deployments)
If you're running Ollama on a headless server or want it containerized alongside other services:
# CPU only
docker run -d \
-v ollama_data:/root/.ollama \
-p 11434:11434 \
--name ollama \
--restart unless-stopped \
ollama/ollama
# With NVIDIA GPU
docker run -d \
--gpus all \
-v ollama_data:/root/.ollama \
-p 11434:11434 \
--name ollama \
--restart unless-stopped \
ollama/ollama
Pulling and Running Models
Finding Models
Browse available models at ollama.com/library. Each model page lists available parameter sizes and quantization variants. Bigger isn't always better — a well-quantized 7B model often beats a poorly-prompted 70B for focused tasks, and it runs on consumer hardware.
Pulling Your First Model
Pull Llama 3.2 (3B — fast, runs on almost anything) to get started:
# Pull a model (downloads to ~/.ollama/models)
ollama pull llama3.2
# Pull a larger variant
ollama pull llama3.2:3b
ollama pull llama3.1:8b
ollama pull llama3.1:70b # Needs 40GB+ RAM
# Other popular models
ollama pull mistral
ollama pull gemma2
ollama pull qwen2.5
ollama pull phi4
ollama pull codellama # Code-focused
Running a Model Interactively
Start an interactive chat session directly in the terminal:
ollama run llama3.2
# Single prompt (non-interactive)
ollama run llama3.2 "Explain how RAG works in 3 sentences"
# Pipe input
cat myfile.txt | ollama run mistral "Summarize this"
Managing Local Models
# List downloaded models
ollama list
# Show model details (parameters, quantization, size)
ollama show llama3.2
# Remove a model to free disk space
ollama rm llama3.1:70b
# Copy a model under a new name (for custom Modelfiles)
ollama cp llama3.2 my-custom-model
Using the Ollama REST API
Ollama exposes a local REST API on port 11434. This is what makes it genuinely useful for development — any app that can make an HTTP request can use it.
Generate Endpoint (Single Turn)
curl http://localhost:11434/api/generate \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2",
"prompt": "What is the difference between RAG and fine-tuning?",
"stream": false
}' | jq .response
Chat Endpoint (Multi-Turn)
curl http://localhost:11434/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2",
"messages": [
{
"role": "system",
"content": "You are a concise technical assistant. Answer in bullet points."
},
{
"role": "user",
"content": "What are the key differences between Docker and a VM?"
}
],
"stream": false
}' | jq .message.content
OpenAI-Compatible API
Ollama also exposes an OpenAI-compatible endpoint at /v1/chat/completions. This means you can drop Ollama in as a local replacement for any tool or library that already uses the OpenAI SDK — no code changes needed beyond updating the base URL:
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer ollama' \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Write a Python function to flatten a nested list"}
]
}' | jq .choices[0].message.content
In Python, you just change the base_url:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required by the SDK but not validated
)
response = client.chat.completions.create(
model='llama3.2',
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain async/await in Python with an example"}
]
)
print(response.choices[0].message.content)
Customizing Models with Modelfiles
A Modelfile lets you create a custom model variant with a baked-in system prompt, adjusted parameters, or a fine-tuned base. Think of it like a Dockerfile but for LLMs.
Creating a Custom Model
Create a file named Modelfile:
# Modelfile
FROM llama3.2
# Set system prompt
SYSTEM """
You are a senior DevOps engineer. You give precise, actionable answers.
You prefer shell commands and config examples over long explanations.
When in doubt, show the code first.
"""
# Tune generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Build and run it:
ollama create devops-assistant -f ./Modelfile
ollama run devops-assistant
# Verify it's listed
ollama list
Custom models created from Modelfiles are stored locally and behave like any other Ollama model — you can reference them in the API by name.
Exposing Ollama for Network Access
By default, Ollama only listens on 127.0.0.1. If you want to access it from other machines on your network — or from Docker containers on the same host — you need to bind it to 0.0.0.0.
Linux (systemd)
# Edit the systemd service override
sudo systemctl edit ollama
# Add these lines in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
# Save, then reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify
curl http://YOUR_SERVER_IP:11434
Accessing Ollama from Docker Containers
When Ollama is installed on the host and you want Docker containers (like Dify or Open WebUI) to reach it, use the Docker bridge IP:
# Get the host IP on the docker bridge
ip addr show docker0 | grep 'inet ' | awk '{print $2}' | cut -d/ -f1
# Usually 172.17.0.1
# Use this in container config:
OLLAMA_BASE_URL=http://172.17.0.1:11434
Tips, Gotchas, and Troubleshooting
Model Runs Slowly or Uses Only CPU
Ollama auto-detects GPU support but can fall back to CPU silently. Check whether GPU is being used:
# While a model is running in another terminal:
nvidia-smi # Check GPU utilization (NVIDIA)
watch -n1 nvidia-smi
# Or check Ollama logs for GPU detection
journalctl -u ollama -n 50 | grep -i gpu
If Ollama isn't using your NVIDIA GPU, make sure the CUDA toolkit is installed and the Ollama version matches your driver. On ROCm (AMD), set OLLAMA_ROCM=1 before starting the service.
Out of Memory Errors
Each model requires enough RAM (or VRAM) to load all its layers. A rough guide:
- 3B model (Q4): ~2GB RAM
- 7B model (Q4): ~4–5GB RAM
- 13B model (Q4): ~8–9GB RAM
- 70B model (Q4): ~40–45GB RAM
If you're OOMing, try a smaller quantization variant or a smaller model. The :Q4_K_M tag is the sweet spot for quality vs. size on most models.
Model Takes Forever to Load
The first run of a model loads it into memory — this can take 10–30 seconds on spinning disk. Subsequent calls to the same model are instant because Ollama keeps it resident. To control how long models stay loaded:
# Keep model loaded indefinitely (useful for API servers)
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2", "keep_alive": -1}'
# Unload model immediately to free memory
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2", "keep_alive": 0}'
Ollama Service Won't Start
# Check full service logs
journalctl -u ollama --no-pager -n 100
# Check if port is already in use
sudo ss -tlnp | grep 11434
# Restart cleanly
sudo systemctl restart ollama
Pro Tips
- Use
num_ctxintentionally — the default context window is often 2048 tokens. For document processing or long conversations, set it higher in your Modelfile or API call. Larger context = more RAM used. - Try Open WebUI for a ChatGPT-like interface over your local models. It's a single Docker container that connects to Ollama out of the box:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main - Use embedding models for RAG pipelines —
ollama pull nomic-embed-textgives you a fast local embedding model you can use with any vector store, keeping your entire RAG stack offline. - Temperature 0 for determinism — when using Ollama in automated pipelines or evals, set
temperature: 0to get reproducible outputs. - Store models on a fast drive — model load time is almost entirely disk I/O. An SSD cuts load time by 5–10x compared to spinning disk for large models.
Wrapping Up
This Ollama setup guide covers what you need to go from zero to a fully functional local LLM stack: installation on any platform, pulling and running models, calling the REST API, building custom model variants with Modelfiles, and exposing Ollama for use by other services on your network.
The real unlock with Ollama is the OpenAI-compatible API. Every tool in your stack that talks to OpenAI can talk to Ollama with a one-line config change — no data leaves your infrastructure, no per-token costs, and no rate limits. For development workflows, private document processing, and internal AI tools, it's the right default.
Building a Local AI Stack for Your Team or Business?
Running Ollama in production — with high availability, GPU infrastructure, model routing, and integration into your existing apps — takes more than a single server setup. The sysbrix team designs and deploys private AI infrastructure for teams that need performance, privacy, and reliability without the cloud vendor lock-in.
Talk to Us →