You don't need to send your prompts to OpenAI to get a polished chat interface. Open WebUI is a self-hosted, ChatGPT-style frontend that connects to local models via Ollama, remote APIs via LiteLLM, or any OpenAI-compatible endpoint. This Open WebUI setup guide walks you through deploying it with Docker Compose, wiring up your first local model, and configuring it for team use — all without your data leaving your server.
Prerequisites
Before you start, make sure you have:
- A Linux server (Ubuntu 22.04/24.04 recommended) with Docker and Docker Compose v2 installed
- At least 4 CPU cores and 8GB RAM (16GB+ recommended for running larger models locally)
- Optional but recommended: an NVIDIA GPU with CUDA drivers for faster inference
- A domain or subdomain pointed at your server if you want HTTPS
- Basic familiarity with Docker and environment variables
Confirm Docker is ready:
docker --version
docker compose version
Check if you have GPU support available:
# Check NVIDIA GPU and drivers (optional)
nvidia-smi
# Check if Docker can see the GPU
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
What Is Open WebUI and Why Self-Host It?
Open WebUI is an open-source web interface for interacting with large language models. It looks and feels like ChatGPT — threaded conversations, markdown rendering, code highlighting, file uploads, and RAG (Retrieval-Augmented Generation) for chatting with your documents. But unlike ChatGPT, everything runs on your hardware.
What You Get
- Local model support — run Llama, Mistral, Gemma, or any Ollama-compatible model entirely offline.
- Remote API support — connect to OpenAI, Anthropic, or a LiteLLM proxy for commercial models when needed.
- Multi-user support — create accounts, roles, and permissions for your team.
- RAG and document chat — upload PDFs, Word docs, or text files and ask questions about them.
- Customizable prompts and models — save system prompts, create model presets, and tune temperature and context.
- Full data privacy — prompts, documents, and conversation history never leave your infrastructure.
The tradeoff is hardware. Small models (3B–7B parameters) run fine on CPU. Larger models need GPU acceleration and more RAM. But for most internal use cases, a 7B or 8B model on a mid-range server is surprisingly capable.
Deploy Open WebUI with Docker Compose
The simplest setup runs Open WebUI alongside Ollama in a single Compose stack. Ollama handles model downloads and inference; Open WebUI provides the interface.
Create the Project Directory
sudo mkdir -p /opt/openwebui && cd /opt/openwebui
sudo chown -R $USER:$USER /opt/openwebui
The Docker Compose File
This stack includes Ollama and Open WebUI with persistent storage. For GPU support, uncomment the deploy section.
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ./ollama:/root/.ollama
ports:
- "11434:11434"
# Uncomment for NVIDIA GPU support
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
restart: unless-stopped
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=your-secret-key-change-me
- WEBUI_AUTH=True
- WEBUI_NAME=Open WebUI
volumes:
- ./open-webui:/app/backend/data
ports:
- "3000:8080"
depends_on:
- ollama
Start the stack:
docker compose up -d
docker compose logs -f openwebui
Wait for the logs to show Application startup complete, then visit http://your-server-ip:3000. The first user to register becomes the admin automatically.
Pull Your First Model
Ollama downloads models on demand. Start with a small, capable model like Llama 3.2 or Mistral:
# Pull a lightweight but capable model
docker exec -it ollama ollama pull llama3.2
# Or pull a larger model if you have the RAM/GPU
docker exec -it ollama ollama pull mistral
# List available models
docker exec -it ollama ollama list
Return to the Open WebUI interface, click the model selector at the top, and choose your downloaded model. You're now chatting with a local LLM.
Connect Remote APIs for Hybrid Use
Local models are great for privacy and offline work, but sometimes you need GPT-4o or Claude 3.5 Sonnet. Open WebUI supports multiple backends simultaneously.
Add an OpenAI API Key
Go to Settings → Admin Settings → Connections in the Open WebUI interface. Add your OpenAI API key under the OpenAI section. Open WebUI will then list OpenAI models alongside your local ones.
Connect via LiteLLM Proxy
If you're already running a LiteLLM proxy (recommended for teams), point Open WebUI at it instead of managing individual provider keys:
# In Open WebUI Admin Settings → Connections → OpenAI API
# Set the API Base URL to your LiteLLM proxy:
http://your-litellm-host:4000/v1
# Use any virtual key from LiteLLM as the API key
sk-lit…-your-virtual-key
This gives you the best of both worlds: local models for sensitive work, commercial models for complex tasks, and unified cost tracking through LiteLLM.
Enable Multi-User Mode and Permissions
By default, the first registered user becomes admin. After that, you control who can access what.
User Roles
- Admin — full access to settings, model management, user administration, and system configuration.
- User — can chat with allowed models, upload documents, and manage their own conversations.
- Pending — registered but awaiting admin approval (if enabled).
Restrict Model Access
In Admin Settings → Models, you can hide specific models from non-admin users. This is useful if you want only local models visible to most users while reserving GPT-4o for a specific team.
Enable Sign-Up Controls
In Admin Settings → General, toggle:
- Enable New User Sign-Ups — turn off to make the instance invite-only.
- Enable User Approval — require admin approval before new accounts are active.
For a team deployment, disable open sign-ups and create accounts manually. This prevents unauthorized access to your GPU resources and API keys.
Set Up RAG for Document Chat
One of Open WebUI's most powerful features is RAG: you upload documents and the model answers questions based on their content. This runs entirely locally — your documents never leave the server.
Upload Documents
In the chat interface, click the + button next to the input field and upload PDFs, Word docs, or text files. Open WebUI automatically chunks, embeds, and indexes them.
Configure the Embedding Model
By default, Open WebUI uses the same LLM for embeddings. For better performance, configure a dedicated embedding model in Admin Settings → Documents:
# Pull a small embedding model in Ollama
docker exec -it ollama ollama pull nomic-embed-text
# In Open WebUI Admin Settings → Documents, set:
# Embedding Model: nomic-embed-text:latest
# Chunk Size: 1000
# Chunk Overlap: 100
Smaller embedding models are faster and use less memory. nomic-embed-text is a solid choice for most document sets.
Query Your Documents
Once uploaded, ask questions naturally:
"What does our SLA say about incident response times?" "Summarize the Q3 budget proposal." "Find all mentions of GDPR compliance in the policy document."
Open WebUI retrieves relevant chunks and feeds them to the LLM as context. The answers are grounded in your actual documents, not hallucinated.
Tips, Gotchas, and Troubleshooting
Model Responses Are Slow
CPU inference is slow for anything larger than 3B parameters. If responses take 30+ seconds:
- Switch to a smaller model (
llama3.2:3binstead ofllama3.2). - Enable GPU passthrough in the Compose file (uncomment the deploy section).
- Reduce context window in the model settings — less context = faster generation.
Ollama Container Won't Start
Check that the ollama data directory has correct permissions:
sudo chown -R $USER:$USER /opt/openwebui/ollama
docker compose restart ollama
Open WebUI Can't Reach Ollama
Both containers must be on the same Docker network (they are in the Compose file above). If you separated them, verify connectivity:
# From the openwebui container, test Ollama connectivity
docker exec -it openwebui curl http://ollama:11434/api/tags
GPU Not Detected
Ensure NVIDIA Container Toolkit is installed:
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Out of Memory Errors
Large models need significant RAM. Monitor usage:
docker stats ollama --no-stream
# Or watch in real time
watch -n 2 docker stats ollama --no-stream
If you're hitting memory limits, use quantized models (the :q4_0 or :q5_0 tags) which trade minor quality loss for major memory savings.
Pro Tips
- Start with small models —
llama3.2:3borphi3:miniare surprisingly capable for most tasks and run fast on CPU. - Use persistent volumes — the
./ollamaand./open-webuidirectories in the Compose file ensure models and data survive container restarts. - Enable HTTPS — put Open WebUI behind a reverse proxy (Traefik, Caddy, or Nginx) with TLS. The interface handles sensitive prompts and documents.
- Back up your data directory — conversation history, uploaded documents, and user accounts live in
./open-webui. Include it in your backup strategy. - Monitor GPU utilization — if you have a GPU, run
nvidia-smi dmonto watch utilization and temperature during heavy use. - Test RAG before trusting it — verify that document answers are actually grounded in the text, not hallucinated. RAG reduces but doesn't eliminate hallucinations.
Wrapping Up
A complete Open WebUI setup guide isn't just about installing a chat interface — it's about owning your AI stack. With Open WebUI and Ollama, you get a private, self-hosted alternative to ChatGPT that runs on your hardware, respects your data, and scales from a single user to a team.
The setup is straightforward: Docker Compose with Ollama and Open WebUI, pull a model, and start chatting. From there, add remote APIs for hybrid use, enable multi-user mode for teams, and configure RAG to chat with your documents. The whole stack deploys in under 15 minutes and improves with every model release.
Start with the Compose file in this guide, pull Llama 3.2, and explore the interface. Once you see local AI responding to your prompts with zero latency and zero API bills, you'll understand why self-hosted LLMs are becoming the default for privacy-conscious teams.
Need Help Deploying Open WebUI at Scale?
If you're rolling out Open WebUI across a larger team — with SSO integration, GPU cluster management, model governance, or custom RAG pipelines — the sysbrix team can design and deploy it. We build self-hosted AI infrastructure that's production-ready, not just proof-of-concept.