Why Run LLMs Locally with Ollama?
Every prompt you send to a cloud AI service leaves your machine. That's fine for casual use. It's a serious problem when you're working with proprietary code, customer data, internal documentation, or anything you'd rather not route through a third-party API.
Ollama solves this cleanly. It's a local model runtime that installs in one command, manages model downloads automatically, handles GPU acceleration without manual CUDA wrangling, and exposes a clean REST API on localhost:11434. Your prompts never leave the machine. No API keys, no usage limits, no billing surprises.
This Ollama setup guide gets you from zero to a fully configured local LLM stack — GPU-accelerated where available, with a custom Modelfile, REST API integration, and Python client wired up. If you want to take things further after this and add a full chat UI on top, our companion guide covers exactly that: Ollama Setup Guide: Run Powerful Local LLMs on Your Own Machine.
Prerequisites
Check these before installing. Missing hardware is the most common reason people get stuck.
Hardware Requirements
| VRAM / RAM | What You Can Run Comfortably |
|---|---|
| 8 GB VRAM | 7B–8B models at Q4 quantization (e.g. llama3.1:8b, mistral:7b) |
| 16 GB VRAM | 13B models at Q4; 8B at Q8; some 32B models with partial offloading |
| 24 GB VRAM | 32B models at Q4; 70B models with partial CPU offloading |
| CPU only (16 GB RAM) | 3B–7B models — slow but functional; expect 2–5 tok/s |
GPU support: NVIDIA (CUDA 11.3+), AMD (ROCm 5.7+), Apple Silicon (Metal — built in). Ollama's installer auto-detects all three.
OS Support
- Linux: Any modern distro with systemd (Ubuntu 22.04+ recommended)
- macOS: 12 Monterey or later; Apple Silicon gets native Metal acceleration
- Windows: Windows 10/11 via native installer or WSL2
Before You Start on Linux
# Check if NVIDIA GPU is detected
nvidia-smi
# Check CUDA version
nvcc --version
# Check available disk space (models are large)
df -h ~
You need at least 10–20 GB of free disk for model storage. Models are downloaded to ~/.ollama/models by default.
Step 1 — Install Ollama
Installation takes one command on Linux and macOS. Windows gets a native installer.
Linux and macOS
curl -fsSL https://ollama.com/install.sh | sh
The script detects your OS and GPU, installs the correct runtime (CUDA, ROCm, or Metal), and on Linux registers Ollama as a systemd service that starts automatically on boot.
Windows
Download and run the installer from ollama.com/download. It installs Ollama as a background service and adds ollama to your PATH.
Verify the Installation
# Check the installed version
ollama --version
# On Linux — confirm the service is running
systemctl status ollama
# Hit the API directly to confirm it's alive
curl http://localhost:11434/api/version
If curl returns a JSON version object, Ollama is up and accepting requests. If the service isn't running on Linux, start it:
systemctl enable --now ollama
Step 2 — Pull and Run Your First Model
With Ollama running, pull a model from the Ollama library. Models are stored locally and versioned by tag.
Pull a Model
# Pull Llama 3.2 3B — fast, low VRAM, good starting point
ollama pull llama3.2
# Pull Llama 3.1 8B — stronger reasoning, needs ~8 GB VRAM
ollama pull llama3.1:8b
# Pull Mistral 7B — excellent for code and structured tasks
ollama pull mistral:7b
# Pull DeepSeek-R1 8B — reasoning model with visible thinking traces
ollama pull deepseek-r1:8b
# See everything you've downloaded
ollama list
Run a Model Interactively
# Start an interactive chat session
ollama run llama3.2
# Type your message and press Enter
# Use /bye to exit, /help to see CLI commands
# Use /set verbose to see token stats and GPU utilisation
The first run after a pull takes a few seconds to load the model into VRAM. Subsequent calls in the same session are instant — Ollama keeps the model loaded in memory for the OLLAMA_KEEP_ALIVE duration (5 minutes by default).
One-Shot Prompts from the Shell
# Pipe a prompt directly without entering interactive mode
echo "Explain TCP three-way handshake in two sentences" | ollama run llama3.2
# Useful for scripting and pipeline integration
cat error.log | ollama run mistral:7b "Summarise the errors in this log:"
Step 3 — GPU Configuration and Performance Tuning
Out of the box, Ollama detects your GPU and loads models into VRAM automatically. But there are several environment variables that give you meaningful control over performance — especially when running multiple models or serving concurrent users.
Confirm GPU Is Being Used
After starting a model, watch GPU VRAM consumption in a second terminal:
# NVIDIA — watch VRAM consumption live
watch -n 1 nvidia-smi
# AMD
watch -n 1 rocm-smi
# Ollama's own process log also shows GPU info
journalctl -u ollama -f | grep -i gpu
If VRAM usage doesn't increase when you load a model, Ollama isn't using your GPU — see the troubleshooting section below.
Key Environment Variables
On Linux, set these in the Ollama systemd override so they persist across reboots:
sudo systemctl edit ollama
Add the variables you want under [Service]:
[Service]
# Keep models loaded in VRAM for 30 minutes after last use (default: 5m)
Environment="OLLAMA_KEEP_ALIVE=30m"
# Limit to 1 loaded model — prevents VRAM exhaustion on smaller GPUs
Environment="OLLAMA_MAX_LOADED_MODELS=1"
# Limit concurrent inference requests (useful for single-GPU shared servers)
Environment="OLLAMA_NUM_PARALLEL=1"
# Change where models are stored (e.g. a larger data disk)
Environment="OLLAMA_MODELS=/data/ollama/models"
# Bind to all interfaces so other machines on your network can reach it
Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl daemon-reload && sudo systemctl restart ollama
Note on OLLAMA_HOST: Setting this to
0.0.0.0makes the Ollama API reachable from other machines on your network. Only do this on a trusted LAN — there is no authentication on the Ollama API by default. If you're exposing it beyond localhost, put it behind a reverse proxy with auth.
Step 4 — Custom Models with Modelfiles
A Modelfile is Ollama's equivalent of a Dockerfile — a plain text blueprint that defines a model's base, system prompt, inference parameters, and conversation seed. It lets you create named, reusable model configurations that can be shared like any other file.
Basic Modelfile Structure
# Modelfile — save as ./Modelfile
FROM llama3.1:8b
# Inference parameters
PARAMETER temperature 0.3 # Lower = more deterministic (good for code)
PARAMETER num_ctx 8192 # Context window — how much history the model sees
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
# System prompt — sets the model's persona and constraints
SYSTEM """
You are a senior backend engineer. You write clean, idiomatic code with no unnecessary explanation.
When asked for code, return only the code block. No preamble, no "here is the code".
When asked a question, answer directly and concisely.
"""
# Build the custom model from the Modelfile
ollama create backend-dev -f ./Modelfile
# Run it
ollama run backend-dev
# List all models including custom ones
ollama list
# Inspect the Modelfile of any model
ollama show --modelfile backend-dev
Modelfiles are particularly useful for teams — you can commit them to a repository and everyone runs the same model configuration with the same system prompt, parameters, and behaviour.
Load a Local GGUF File
If you have a custom GGUF model file that isn't in the Ollama library, point your Modelfile's FROM at it directly:
# Modelfile pointing at a local GGUF
FROM /models/my-fine-tuned-model.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant specialised in internal company documentation."
ollama create my-internal-model -f ./Modelfile
ollama run my-internal-model
Step 5 — REST API and Python Integration
The interactive CLI is convenient for exploration. For applications, you want the REST API or one of the client libraries. Ollama's API is OpenAI-compatible, which means it works as a drop-in replacement in most frameworks that support OpenAI.
REST API with curl
# Streaming chat completion
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [
{ "role": "user", "content": "What is the difference between a process and a thread?" }
],
"stream": false
}'
# Non-streaming generate endpoint
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:7b",
"prompt": "Write a Python function to flatten a nested list.",
"stream": false
}'
Python with the Ollama Library
pip install ollama
import ollama
# Simple chat call
response = ollama.chat(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain the CAP theorem in three bullet points."},
],
)
print(response["message"]["content"])
# Streaming response — useful for long outputs
for chunk in ollama.chat(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Write a FastAPI hello world endpoint."}],
stream=True,
):
print(chunk["message"]["content"], end="", flush=True)
OpenAI-Compatible Endpoint
If your project already uses the OpenAI Python client or any OpenAI-compatible library, point it at Ollama's local server — no code changes required beyond the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the client, value is ignored by Ollama
)
completion = client.chat.completions.create(
model="mistral:7b",
messages=[{"role": "user", "content": "What is a Bloom filter?"}],
)
print(completion.choices[0].message.content)
Step 6 — Troubleshooting and Production Tips
The most common problems are all fixable. Here's what to check and in what order.
Problem: Ollama Isn't Using the GPU
Symptoms: nvidia-smi shows zero VRAM consumption when a model is loaded. Inference is extremely slow (under 3 tok/s on a modern GPU).
Diagnose:
# Check what Ollama sees at startup
journalctl -u ollama | grep -i "gpu\|cuda\|rocm\|metal"
# List GPU devices Ollama has detected
ollama info
Common causes and fixes:
- NVIDIA drivers not installed or outdated: Run
nvidia-smi— if it fails, install drivers first. Ollama requires CUDA 11.3+ drivers. - CUDA libraries missing: Install the CUDA toolkit or just the runtime libraries:
sudo apt install nvidia-cuda-toolkit - Reinstall Ollama after installing drivers: The installer bakes in runtime detection at install time. If you installed Ollama before the drivers, reinstall it.
- Model too large for VRAM: If the model exceeds available VRAM, Ollama falls back to CPU. Switch to a smaller quantization (e.g.
llama3.1:8b-instruct-q4_0) or a smaller model.
Problem: "Connection Refused" on Port 11434
Ollama isn't running. Start it:
# Linux
systemctl start ollama
journalctl -u ollama -n 50 # read startup logs if it fails
# macOS / Windows — relaunch the Ollama app
# Or start the server manually:
ollama serve
Problem: Model Download Stalls or Fails
Large models (8B+ at FP16) can be 10–30 GB. Downloads can time out on slow connections.
# Ollama resumes interrupted downloads — just re-run the pull
ollama pull llama3.1:8b
# If a download is corrupted, remove and re-pull
ollama rm llama3.1:8b
ollama pull llama3.1:8b
Problem: Out of Memory During Inference
The model fits in VRAM at load time but crashes when a long prompt exhausts the context buffer. Options:
- Reduce
num_ctxin your Modelfile (e.g. from 8192 to 4096) - Pull a more aggressively quantized version:
llama3.1:8b-instruct-q4_0uses less memory than the defaultq4_K_M - Set
OLLAMA_MAX_LOADED_MODELS=1to prevent multiple models competing for VRAM
Tip: Choose the Right Quantization
Quantization trades a small amount of quality for a large reduction in memory and speed. The default Ollama pull gives you Q4_K_M — a good balance. Here's a quick reference:
Q8_0— Near-full quality, ~2× the VRAM of Q4. Use when VRAM is plentiful.Q4_K_M— Default. Good quality-to-size ratio. Best starting point.Q4_0— Smaller than Q4_K_M, slightly lower quality. Good for tight VRAM budgets.Q2_K— Very small, noticeable quality loss. Last resort for CPU-only machines.
Tip: Use Embedding Models for RAG Pipelines
Ollama also serves embedding models — useful for building RAG systems where you embed documents and query them with semantic search:
ollama pull nomic-embed-text
curl http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "The user wants to know about database indexing strategies."
}'
Tip: Back Up Your Models Directory
Models live in ~/.ollama/models. The blobs are large but content-addressed — if you move to a new machine, copy the directory across and Ollama will recognise all previously downloaded models without re-downloading:
# Rsync models to a backup or new machine
rsync -avz --progress ~/.ollama/models/ user@new-server:/home/user/.ollama/models/
What You've Built
At the end of this Ollama setup guide, you have a fully operational local LLM stack:
- Ollama installed and running as a persistent service with GPU acceleration
- Models pulled and tested from the CLI
- GPU utilisation confirmed — you're getting hardware-accelerated inference
- Custom Modelfile defining a named model with a tailored system prompt and tuned parameters
- REST API and Python integration ready for application development
- OpenAI-compatible endpoint for drop-in replacement in existing projects
From here, the natural next step is adding a chat UI. Open WebUI gives you a full ChatGPT-style frontend that connects directly to Ollama — our companion post walks through the exact setup: Ollama Setup Guide: Run Powerful Local LLMs on Your Own Machine.
Need This at Scale?
A single-machine Ollama setup covers a lot of ground. When you're ready to go further — multi-GPU nodes, load-balanced inference across a fleet, private model registries, integration with internal tooling, or compliance-grade infrastructure — the architecture gets more involved.
The Sysbrix team designs and deploys production-grade local AI infrastructure. If you're building something that needs to scale beyond a single server, we're happy to talk through the options.