Ollama Setup Guide: Run Local LLMs on Your Machine

Why Run LLMs Locally with Ollama?

Every prompt you send to a cloud AI service leaves your machine. That's fine for casual use. It's a serious problem when you're working with proprietary code, customer data, internal documentation, or anything you'd rather not route through a third-party API.

Ollama solves this cleanly. It's a local model runtime that installs in one command, manages model downloads automatically, handles GPU acceleration without manual CUDA wrangling, and exposes a clean REST API on localhost:11434. Your prompts never leave the machine. No API keys, no usage limits, no billing surprises.

This Ollama setup guide gets you from zero to a fully configured local LLM stack — GPU-accelerated where available, with a custom Modelfile, REST API integration, and Python client wired up. If you want to take things further after this and add a full chat UI on top, our companion guide covers exactly that: Ollama Setup Guide: Run Powerful Local LLMs on Your Own Machine.

Prerequisites

Check these before installing. Missing hardware is the most common reason people get stuck.

Hardware Requirements

VRAM / RAM	What You Can Run Comfortably
8 GB VRAM	7B–8B models at Q4 quantization (e.g. llama3.1:8b, mistral:7b)
16 GB VRAM	13B models at Q4; 8B at Q8; some 32B models with partial offloading
24 GB VRAM	32B models at Q4; 70B models with partial CPU offloading
CPU only (16 GB RAM)	3B–7B models — slow but functional; expect 2–5 tok/s

GPU support: NVIDIA (CUDA 11.3+), AMD (ROCm 5.7+), Apple Silicon (Metal — built in). Ollama's installer auto-detects all three.

OS Support

Linux: Any modern distro with systemd (Ubuntu 22.04+ recommended)
macOS: 12 Monterey or later; Apple Silicon gets native Metal acceleration
Windows: Windows 10/11 via native installer or WSL2

Before You Start on Linux

# Check if NVIDIA GPU is detected
nvidia-smi

# Check CUDA version
nvcc --version

# Check available disk space (models are large)
df -h ~

You need at least 10–20 GB of free disk for model storage. Models are downloaded to ~/.ollama/models by default.

Step 1 — Install Ollama

Installation takes one command on Linux and macOS. Windows gets a native installer.

Linux and macOS

curl -fsSL https://ollama.com/install.sh | sh

The script detects your OS and GPU, installs the correct runtime (CUDA, ROCm, or Metal), and on Linux registers Ollama as a systemd service that starts automatically on boot.

Windows

Download and run the installer from ollama.com/download. It installs Ollama as a background service and adds ollama to your PATH.

Verify the Installation

# Check the installed version
ollama --version

# On Linux — confirm the service is running
systemctl status ollama

# Hit the API directly to confirm it's alive
curl http://localhost:11434/api/version

If curl returns a JSON version object, Ollama is up and accepting requests. If the service isn't running on Linux, start it:

systemctl enable --now ollama

Step 2 — Pull and Run Your First Model

With Ollama running, pull a model from the Ollama library. Models are stored locally and versioned by tag.

Pull a Model

# Pull Llama 3.2 3B — fast, low VRAM, good starting point
ollama pull llama3.2

# Pull Llama 3.1 8B — stronger reasoning, needs ~8 GB VRAM
ollama pull llama3.1:8b

# Pull Mistral 7B — excellent for code and structured tasks
ollama pull mistral:7b

# Pull DeepSeek-R1 8B — reasoning model with visible thinking traces
ollama pull deepseek-r1:8b

# See everything you've downloaded
ollama list

Run a Model Interactively

# Start an interactive chat session
ollama run llama3.2

# Type your message and press Enter
# Use /bye to exit, /help to see CLI commands
# Use /set verbose to see token stats and GPU utilisation

The first run after a pull takes a few seconds to load the model into VRAM. Subsequent calls in the same session are instant — Ollama keeps the model loaded in memory for the OLLAMA_KEEP_ALIVE duration (5 minutes by default).

One-Shot Prompts from the Shell

# Pipe a prompt directly without entering interactive mode
echo "Explain TCP three-way handshake in two sentences" | ollama run llama3.2

# Useful for scripting and pipeline integration
cat error.log | ollama run mistral:7b "Summarise the errors in this log:"

Step 3 — GPU Configuration and Performance Tuning

Out of the box, Ollama detects your GPU and loads models into VRAM automatically. But there are several environment variables that give you meaningful control over performance — especially when running multiple models or serving concurrent users.

Confirm GPU Is Being Used

After starting a model, watch GPU VRAM consumption in a second terminal:

# NVIDIA — watch VRAM consumption live
watch -n 1 nvidia-smi

# AMD
watch -n 1 rocm-smi

# Ollama's own process log also shows GPU info
journalctl -u ollama -f | grep -i gpu

If VRAM usage doesn't increase when you load a model, Ollama isn't using your GPU — see the troubleshooting section below.

Key Environment Variables

On Linux, set these in the Ollama systemd override so they persist across reboots:

sudo systemctl edit ollama

Add the variables you want under [Service]:

[Service]
# Keep models loaded in VRAM for 30 minutes after last use (default: 5m)
Environment="OLLAMA_KEEP_ALIVE=30m"

# Limit to 1 loaded model — prevents VRAM exhaustion on smaller GPUs
Environment="OLLAMA_MAX_LOADED_MODELS=1"

# Limit concurrent inference requests (useful for single-GPU shared servers)
Environment="OLLAMA_NUM_PARALLEL=1"

# Change where models are stored (e.g. a larger data disk)
Environment="OLLAMA_MODELS=/data/ollama/models"

# Bind to all interfaces so other machines on your network can reach it
Environment="OLLAMA_HOST=0.0.0.0"

sudo systemctl daemon-reload && sudo systemctl restart ollama

Note on OLLAMA_HOST: Setting this to 0.0.0.0 makes the Ollama API reachable from other machines on your network. Only do this on a trusted LAN — there is no authentication on the Ollama API by default. If you're exposing it beyond localhost, put it behind a reverse proxy with auth.

Step 4 — Custom Models with Modelfiles

A Modelfile is Ollama's equivalent of a Dockerfile — a plain text blueprint that defines a model's base, system prompt, inference parameters, and conversation seed. It lets you create named, reusable model configurations that can be shared like any other file.

Basic Modelfile Structure

# Modelfile — save as ./Modelfile
FROM llama3.1:8b

# Inference parameters
PARAMETER temperature 0.3       # Lower = more deterministic (good for code)
PARAMETER num_ctx 8192          # Context window — how much history the model sees
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# System prompt — sets the model's persona and constraints
SYSTEM """
You are a senior backend engineer. You write clean, idiomatic code with no unnecessary explanation.
When asked for code, return only the code block. No preamble, no "here is the code".
When asked a question, answer directly and concisely.
"""

# Build the custom model from the Modelfile
ollama create backend-dev -f ./Modelfile

# Run it
ollama run backend-dev

# List all models including custom ones
ollama list

# Inspect the Modelfile of any model
ollama show --modelfile backend-dev

Modelfiles are particularly useful for teams — you can commit them to a repository and everyone runs the same model configuration with the same system prompt, parameters, and behaviour.

Load a Local GGUF File

If you have a custom GGUF model file that isn't in the Ollama library, point your Modelfile's FROM at it directly:

# Modelfile pointing at a local GGUF
FROM /models/my-fine-tuned-model.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful assistant specialised in internal company documentation."

ollama create my-internal-model -f ./Modelfile
ollama run my-internal-model

Step 5 — REST API and Python Integration

The interactive CLI is convenient for exploration. For applications, you want the REST API or one of the client libraries. Ollama's API is OpenAI-compatible, which means it works as a drop-in replacement in most frameworks that support OpenAI.

REST API with curl

# Streaming chat completion
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      { "role": "user", "content": "What is the difference between a process and a thread?" }
    ],
    "stream": false
  }'

# Non-streaming generate endpoint
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:7b",
    "prompt": "Write a Python function to flatten a nested list.",
    "stream": false
  }'

Python with the Ollama Library

pip install ollama

import ollama

# Simple chat call
response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain the CAP theorem in three bullet points."},
    ],
)
print(response["message"]["content"])

# Streaming response — useful for long outputs
for chunk in ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Write a FastAPI hello world endpoint."}],
    stream=True,
):
    print(chunk["message"]["content"], end="", flush=True)

OpenAI-Compatible Endpoint

If your project already uses the OpenAI Python client or any OpenAI-compatible library, point it at Ollama's local server — no code changes required beyond the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the client, value is ignored by Ollama
)

completion = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "What is a Bloom filter?"}],
)
print(completion.choices[0].message.content)

Step 6 — Troubleshooting and Production Tips

The most common problems are all fixable. Here's what to check and in what order.

Problem: Ollama Isn't Using the GPU

Symptoms: nvidia-smi shows zero VRAM consumption when a model is loaded. Inference is extremely slow (under 3 tok/s on a modern GPU).

Diagnose:

# Check what Ollama sees at startup
journalctl -u ollama | grep -i "gpu\|cuda\|rocm\|metal"

# List GPU devices Ollama has detected
ollama info

Common causes and fixes:

NVIDIA drivers not installed or outdated: Run nvidia-smi — if it fails, install drivers first. Ollama requires CUDA 11.3+ drivers.
CUDA libraries missing: Install the CUDA toolkit or just the runtime libraries: sudo apt install nvidia-cuda-toolkit
Reinstall Ollama after installing drivers: The installer bakes in runtime detection at install time. If you installed Ollama before the drivers, reinstall it.
Model too large for VRAM: If the model exceeds available VRAM, Ollama falls back to CPU. Switch to a smaller quantization (e.g. llama3.1:8b-instruct-q4_0) or a smaller model.

Problem: "Connection Refused" on Port 11434

Ollama isn't running. Start it:

# Linux
systemctl start ollama
journalctl -u ollama -n 50   # read startup logs if it fails

# macOS / Windows — relaunch the Ollama app
# Or start the server manually:
ollama serve

Problem: Model Download Stalls or Fails

Large models (8B+ at FP16) can be 10–30 GB. Downloads can time out on slow connections.

# Ollama resumes interrupted downloads — just re-run the pull
ollama pull llama3.1:8b

# If a download is corrupted, remove and re-pull
ollama rm llama3.1:8b
ollama pull llama3.1:8b

Problem: Out of Memory During Inference

The model fits in VRAM at load time but crashes when a long prompt exhausts the context buffer. Options:

Reduce num_ctx in your Modelfile (e.g. from 8192 to 4096)
Pull a more aggressively quantized version: llama3.1:8b-instruct-q4_0 uses less memory than the default q4_K_M
Set OLLAMA_MAX_LOADED_MODELS=1 to prevent multiple models competing for VRAM

Tip: Choose the Right Quantization

Quantization trades a small amount of quality for a large reduction in memory and speed. The default Ollama pull gives you Q4_K_M — a good balance. Here's a quick reference:

Q8_0 — Near-full quality, ~2× the VRAM of Q4. Use when VRAM is plentiful.
Q4_K_M — Default. Good quality-to-size ratio. Best starting point.
Q4_0 — Smaller than Q4_K_M, slightly lower quality. Good for tight VRAM budgets.
Q2_K — Very small, noticeable quality loss. Last resort for CPU-only machines.

Tip: Use Embedding Models for RAG Pipelines

Ollama also serves embedding models — useful for building RAG systems where you embed documents and query them with semantic search:

ollama pull nomic-embed-text

curl http://localhost:11434/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "The user wants to know about database indexing strategies."
  }'

Tip: Back Up Your Models Directory

Models live in ~/.ollama/models. The blobs are large but content-addressed — if you move to a new machine, copy the directory across and Ollama will recognise all previously downloaded models without re-downloading:

# Rsync models to a backup or new machine
rsync -avz --progress ~/.ollama/models/ user@new-server:/home/user/.ollama/models/

What You've Built

At the end of this Ollama setup guide, you have a fully operational local LLM stack:

Ollama installed and running as a persistent service with GPU acceleration
Models pulled and tested from the CLI
GPU utilisation confirmed — you're getting hardware-accelerated inference
Custom Modelfile defining a named model with a tailored system prompt and tuned parameters
REST API and Python integration ready for application development
OpenAI-compatible endpoint for drop-in replacement in existing projects

From here, the natural next step is adding a chat UI. Open WebUI gives you a full ChatGPT-style frontend that connects directly to Ollama — our companion post walks through the exact setup: Ollama Setup Guide: Run Powerful Local LLMs on Your Own Machine.

Need This at Scale?

A single-machine Ollama setup covers a lot of ground. When you're ready to go further — multi-GPU nodes, load-balanced inference across a fleet, private model registries, integration with internal tooling, or compliance-grade infrastructure — the architecture gets more involved.

The Sysbrix team designs and deploys production-grade local AI infrastructure. If you're building something that needs to scale beyond a single server, we're happy to talk through the options.

Talk to Us About Enterprise LLM Infrastructure →

Run AI Locally, Zero Cloud Dependency: The Complete Ollama Setup Guide