LiteLLM Setup Proxy: One Gateway to Rule Every LLM in Your Stack

LiteLLM Setup Proxy: One Gateway to Rule Every LLM in Your Stack

The moment you start using more than one LLM provider, things get messy fast. Different SDKs, different auth patterns, different rate limits, no unified view of what anything costs. LiteLLM fixes this by sitting between your apps and every LLM provider in existence — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Azure, Bedrock, and local models via Ollama — behind a single OpenAI-compatible API. One endpoint, one API key format, full cost tracking, model fallbacks, and per-team budgets. This guide walks you through a complete LiteLLM setup proxy from scratch.

Prerequisites

A Linux server or local machine (Ubuntu 20.04+ recommended)
Docker Engine and Docker Compose v2 installed
API keys for at least one LLM provider (OpenAI, Anthropic, etc.) or a local Ollama instance
At least 512MB RAM free — LiteLLM is lightweight
Port 4000 available (or any custom port you prefer)
Basic familiarity with YAML config files

Confirm Docker is ready:

docker --version
docker compose version
# Verify your OpenAI key works before proxying it
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY" | jq '.data[0].id'

What Is LiteLLM and Why Run It as a Proxy?

LiteLLM has two modes: a Python library you import directly, and a proxy server you deploy and point your apps at. This guide focuses on the proxy — which is almost always the right choice for teams.

What the Proxy Gives You

Unified OpenAI-compatible API — any app or library using the OpenAI SDK can point at LiteLLM with zero code changes. Just change the base URL.
Provider abstraction — swap GPT-4 for Claude 3.5 Sonnet for Gemini Pro with a one-line config change. Your app code never changes.
Virtual API keys — issue keys to teams or apps. Revoke them without rotating provider credentials. Track spend per key.
Budget limits — set hard spend caps per key, per team, or globally. LiteLLM blocks requests when budgets are hit.
Model fallbacks — if GPT-4 rate-limits, automatically retry with Claude or a local model. Zero downtime from provider outages.
Load balancing — spread requests across multiple deployments of the same model (useful with Azure OpenAI regional endpoints).
Logging and observability — built-in cost tracking per model, with integrations for Langfuse, Helicone, and more.

The bottom line: one LiteLLM proxy instance lets your entire organization talk to every LLM through a single, controlled, observable gateway.

Quick Start: Running LiteLLM Locally

Install and Run with pip (Fastest)

If you want to test LiteLLM before committing to a Docker deployment, run it directly with pip:

# Install LiteLLM with proxy extras
pip install 'litellm[proxy]'

# Run with an inline model config
litellm --model gpt-4o --port 4000

# Or with multiple models via config file
litellm --config config.yml --port 4000

The proxy is now running at http://localhost:4000. Test it immediately:

curl http://localhost:4000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer sk-anything' \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Say hello in one sentence"}]
  }' | jq .choices[0].message.content

That sk-anything key works because auth is disabled by default. You'll lock that down shortly.

Production Deployment with Docker Compose

The LiteLLM Config File

The config file is the heart of LiteLLM. It defines which models are available, how they map to providers, fallback chains, and global settings. Create config.yml:

# config.yml
model_list:

  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  # Anthropic
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  # Local model via Ollama
  - model_name: llama3.2
    litellm_params:
      model: ollama/llama3.2
      api_base: http://host.docker.internal:11434

  # Fallback group — try gpt-4o first, fall back to claude
  - model_name: best-available
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: best-available-primary

router_settings:
  fallbacks:
    - {"best-available": ["claude-3-5-sonnet", "llama3.2"]}
  retry_after: 5
  num_retries: 3

litellm_settings:
  success_callback: []
  failure_callback: []
  request_timeout: 600
  set_verbose: false

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

Docker Compose with PostgreSQL

LiteLLM uses a database to persist virtual keys, usage data, and budgets. PostgreSQL is the right choice for production:

# docker-compose.yml
version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./config.yml:/app/config.yaml:ro
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
      - STORE_MODEL_IN_DB=true
    command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "4"]
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - litellm_net

  postgres:
    image: postgres:15-alpine
    container_name: litellm_db
    restart: unless-stopped
    environment:
      - POSTGRES_DB=litellm
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - litellm_net

volumes:
  postgres_data:

networks:
  litellm_net:

Create your .env file with real values — never commit this:

# .env
OPENAI_API_KEY=sk-proj-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
POSTGRES_PASSWORD=a-strong-db-password
LITELLM_MASTER_KEY=sk-litellm-your-master-key-here

# Generate a strong master key:
# openssl rand -hex 20 | sed 's/^/sk-litellm-/'

Start the stack:

docker compose up -d
docker compose logs -f litellm

Wait for LiteLLM Proxy: Port 4000 in the logs. Then verify the health endpoint:

curl http://localhost:4000/health/liveliness
# Returns: {"status": "healthy"}

Managing Virtual Keys, Budgets, and Teams

Creating Virtual API Keys

The master key lets you create virtual keys for teams and applications. These virtual keys are what you hand out — your real provider API keys never leave the proxy:

# Create a key for a specific team with a monthly budget
curl -X POST http://localhost:4000/key/generate \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' \
  -d '{
    "key_alias": "team-backend",
    "models": ["gpt-4o", "gpt-4o-mini", "claude-3-5-sonnet"],
    "max_budget": 50.00,
    "budget_duration": "1mo",
    "metadata": {"team": "backend", "project": "search-api"}
  }' | jq .key

The response is a virtual key like sk-litellm-xyz123.... That's what the backend team puts in their app. If they exceed $50 in a month, requests are automatically blocked until the budget resets. Your OpenAI bill doesn't accumulate unchecked.

Restricting Models Per Key

# Create a restricted key — only cheap models, low budget
curl -X POST http://localhost:4000/key/generate \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' \
  -d '{
    "key_alias": "intern-dev",
    "models": ["gpt-4o-mini", "llama3.2"],
    "max_budget": 5.00,
    "budget_duration": "1mo",
    "tpm_limit": 10000,
    "rpm_limit": 60
  }' | jq .key

Viewing Usage and Spend

# Get spend data for all keys
curl http://localhost:4000/key/info \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' | jq '.info[] | {alias: .key_alias, spend: .spend, budget: .max_budget}'

# Get model-level spend breakdown
curl http://localhost:4000/spend/logs \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' | jq '.[-5:]'

LiteLLM also ships with a built-in UI at http://localhost:4000/ui. Log in with your master key to get a dashboard showing spend by model, key, and team — useful for monthly cost reviews without writing queries.

Connecting Apps to LiteLLM

Drop-in OpenAI SDK Replacement

Any app using the OpenAI Python SDK works with LiteLLM immediately — just change the base URL:

from openai import OpenAI

# Before: direct to OpenAI
# client = OpenAI(api_key="sk-proj-...")

# After: through LiteLLM proxy
client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-litellm-team-backend-key"
)

# Everything else stays exactly the same
response = client.chat.completions.create(
    model="gpt-4o",           # Or "claude-3-5-sonnet", "llama3.2"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize this in bullet points: ..."}
    ],
    temperature=0.3,
    max_tokens=500
)

print(response.choices[0].message.content)

Connecting Dify, n8n, or Open WebUI

Any tool with an OpenAI-compatible endpoint setting works. In Dify: go to Settings → Model Provider → OpenAI-Compatible → Custom and set:

API Base URL: http://litellm:4000/v1 (or your server IP)
API Key: your virtual key
Model Name: gpt-4o, claude-3-5-sonnet, or whatever you configured

From that point, Dify routes every LLM call through LiteLLM — you get full cost tracking and can swap models without touching Dify's config.

Tips, Gotchas, and Troubleshooting

Requests Failing with 401 Unauthorized

Once a master key is set, all requests need a valid key — including calls using the master key itself. Confirm the key is in the Authorization: Bearer header, not a custom header:

# Correct format
curl http://localhost:4000/v1/models \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here'

# Check that the key exists in the DB
curl http://localhost:4000/key/info?key=sk-litellm-team-key \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here'

Model Not Found Error

The model name in your API call must exactly match a model_name entry in config.yml. LiteLLM doesn't pass unknown model names through — it rejects them. List available models to confirm:

curl http://localhost:4000/v1/models \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' | jq '[.data[].id]'

Ollama Not Reachable from Container

On Linux, host.docker.internal doesn't resolve by default. Find the Docker bridge IP and use it instead:

# Get Docker bridge IP
ip addr show docker0 | grep 'inet ' | awk '{print $2}' | cut -d/ -f1
# Usually 172.17.0.1

# Use in config.yml:
# api_base: http://172.17.0.1:11434

High Latency on First Request

LiteLLM loads model configs and validates provider connectivity on startup. If a provider is unreachable at startup (e.g., Ollama isn't running), it logs a warning but continues. First requests to a cold model may be slower as the connection is established. Check startup logs:

docker logs litellm 2>&1 | grep -E 'ERROR|WARNING|model'

# Run a health check per model
curl http://localhost:4000/health \
  -H 'Authorization: Bearer sk-litellm-your-master-key-here' | jq .

Updating LiteLLM

docker compose pull litellm
docker compose up -d litellm
docker compose logs -f litellm

Your PostgreSQL volume persists — all virtual keys, usage history, and budgets survive the update. LiteLLM runs database migrations automatically on startup.

Pro Tips

Set fallbacks on every production model — provider outages happen. A fallback chain of [gpt-4o, claude-3-5-sonnet, llama3.2] means your app keeps working even when OpenAI has an incident.
Use tpm_limit and rpm_limit on keys to enforce rate limits per team before they hit provider limits — it's a better signal than raw 429 errors from upstream.
Add Langfuse for full observability — set success_callback: ["langfuse"] in litellm_settings and add your Langfuse keys as environment variables. Every LLM call gets logged with input, output, latency, and cost.
Use model aliases for portability — name your models fast, smart, and local instead of provider-specific names. Swap the underlying model in config without touching any app code.
Put LiteLLM behind Traefik for HTTPS and domain routing — the same pattern used for any other service in your stack applies here. Treat it as a first-class internal API.

Wrapping Up

A complete LiteLLM setup proxy transforms how your team consumes LLMs. Instead of every developer managing their own API keys, every app hardcoding a specific provider, and no one knowing what anything costs — you get one controlled gateway, full cost visibility, automatic fallbacks, and the freedom to swap models without touching application code.

Start with the Docker Compose stack in this guide, wire up your first two providers, and issue virtual keys to your apps. Once the proxy is running and you can see spend data flowing in, add fallback chains and budget limits. The whole setup takes an afternoon and pays off every month when you actually know what your LLM spend looks like — and can do something about it.

Need an Enterprise-Grade LLM Gateway?

If you're rolling out LiteLLM across a larger team — with SSO, audit logging, multi-region failover, or integration into existing infrastructure — the sysbrix team can design and deploy it. We build AI infrastructure that's production-ready, not just proof-of-concept.

Talk to Us →

in Guides

# AI API Gateway LLM LiteLLM Self-Hosted

Self-Host Grafana: Build a Full Monitoring Stack With Dashboards That Actually Tell You Something

Learn how to self-host Grafana with Prometheus and Node Exporter using Docker, connect your data sources, and build dashboards that give you real visibility into your infrastructure.