Skip to Content

Production Guide: Deploy Grafana Loki with Docker Compose + Traefik + S3 on Ubuntu

A production-focused Loki deployment with TLS ingress, durable object storage, retention controls, and operational validation.

Running centralized logs in production sounds simple until incidents hit: one node fills disk, noisy services bury critical errors, and security teams need retention guarantees you cannot prove. This guide shows a pragmatic deployment pattern for Grafana Loki on Ubuntu using Docker Compose, Traefik, and S3-compatible object storage. The goal is not a demo stack; it is a maintainable, recoverable logging platform with clear operational checks.

Scenario: you operate 10–80 services across VMs and containers, and engineering needs fast troubleshooting while compliance requires predictable retention. We will deploy Loki in single-binary mode with boltdb-shipper and object storage, put it behind Traefik with TLS, lock down secrets handling, and validate ingestion/query behavior before go-live.

Architecture and flow overview

This architecture separates control and durability concerns while staying lightweight:

  • Promtail/agents on workloads ship logs to Loki’s push API.
  • Loki handles indexing and query execution.
  • S3-compatible bucket stores index/chunks for durable retention.
  • Local persistent volume stores write-ahead and cache state.
  • Traefik terminates TLS and exposes a stable HTTPS endpoint.

Flow: application logs → Promtail labels/ships → Loki receives and compacts → chunks/index to S3 → engineers query via Grafana or logcli. Keeping object storage external to container lifecycle makes upgrades and node replacement safer.

Prerequisites

  • Ubuntu 22.04/24.04 host with 4 vCPU, 8 GB RAM minimum.
  • Docker Engine + Docker Compose plugin installed.
  • A domain like logs.example.com pointed to your Traefik host.
  • Traefik already running on a shared proxy Docker network.
  • S3-compatible bucket (AWS S3, MinIO, or Wasabi) and access credentials.
  • NTP time sync enabled (clock skew causes painful query behavior).
sudo apt update && sudo apt -y upgrade
sudo apt -y install ca-certificates curl gnupg lsb-release jq unzip
docker --version
docker compose version

If the copy button does not work in your browser, select the block and copy manually.

Step-by-step deployment

1) Create project layout with least-privilege permissions

We isolate configuration and runtime data so backups and rotations are predictable. Keep environment variables outside compose YAML.

sudo mkdir -p /opt/loki/{config,data,backups}
sudo chown -R $USER:$USER /opt/loki
cd /opt/loki
umask 027
touch .env
chmod 600 .env

If the copy button does not work in your browser, select the block and copy manually.

2) Define secrets in an environment file

Never hardcode keys in docker-compose.yml or scripts. Rotate credentials through your secret manager and update this file atomically.

cat > /opt/loki/.env <<'EOF'
LOKI_DOMAIN=logs.example.com
S3_ENDPOINT=s3.us-east-1.amazonaws.com
S3_REGION=us-east-1
S3_BUCKET=prod-loki-logs
S3_ACCESS_KEY=REPLACE_ME
S3_SECRET_KEY=REPLACE_ME
TRAEFIK_CERTRESOLVER=letsencrypt
EOF
chmod 600 /opt/loki/.env

If the copy button does not work in your browser, select the block and copy manually.

For MinIO or other S3-compatible APIs, keep endpoint explicit (for example https://minio.internal:9000) and confirm TLS trust chain from the Loki host.

3) Write production Loki configuration

This config keeps ingestion simple while using object storage for durable chunks/index. Retention is set explicitly so storage growth remains predictable.

auth_enabled: false

server:
  http_listen_port: 3100
  log_level: info

common:
  path_prefix: /loki
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: aws
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  aws:
    s3: s3://${S3_ACCESS_KEY}:${S3_SECRET_KEY}@${S3_ENDPOINT}/${S3_BUCKET}
    s3forcepathstyle: false
    region: ${S3_REGION}
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/boltdb-cache
    cache_ttl: 24h

limits_config:
  retention_period: 30d
  ingestion_rate_mb: 8
  ingestion_burst_size_mb: 16
  max_query_parallelism: 16

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true

chunk_store_config:
  max_look_back_period: 30d

If the copy button does not work in your browser, select the block and copy manually.

4) Create Docker Compose stack with Traefik labels

Pin image versions in production to avoid surprise changes. Attach to an existing external Traefik network and expose only through the reverse proxy.

services:
  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    restart: unless-stopped
    env_file:
      - /opt/loki/.env
    command: -config.file=/etc/loki/loki.yaml
    volumes:
      - /opt/loki/config/loki.yaml:/etc/loki/loki.yaml:ro
      - /opt/loki/data:/loki
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://127.0.0.1:3100/ready"]
      interval: 30s
      timeout: 5s
      retries: 5
    labels:
      - traefik.enable=true
      - traefik.docker.network=proxy
      - traefik.http.routers.loki.rule=Host(`${LOKI_DOMAIN}`)
      - traefik.http.routers.loki.entrypoints=websecure
      - traefik.http.routers.loki.tls=true
      - traefik.http.routers.loki.tls.certresolver=${TRAEFIK_CERTRESOLVER}
      - traefik.http.services.loki.loadbalancer.server.port=3100
      - traefik.http.middlewares.loki-sec.headers.browserXssFilter=true
      - traefik.http.middlewares.loki-sec.headers.contentTypeNosniff=true
      - traefik.http.routers.loki.middlewares=loki-sec
    networks:
      - proxy

networks:
  proxy:
    external: true

If the copy button does not work in your browser, select the block and copy manually.

5) Start, validate health, and smoke test writes

Bring up the service, confirm readiness, then push a known line and query it. This closes the loop from ingest to retrieval before any dashboard work.

cp /opt/loki/config/loki.yaml /opt/loki/backups/loki.yaml.$(date +%F-%H%M%S)
docker compose -f /opt/loki/docker-compose.yml up -d
docker ps --filter name=loki
curl -fsS http://127.0.0.1:3100/ready
curl -k -u user:pass https://logs.example.com/ready || true

If the copy button does not work in your browser, select the block and copy manually.

ts_ns=$(date +%s%N)
curl -sS -H "Content-Type: application/json" \
  -X POST "http://127.0.0.1:3100/loki/api/v1/push" \
  --data-raw "{\"streams\":[{\"stream\":{\"app\":\"smoke\",\"env\":\"prod\"},\"values\":[[\"$ts_ns\",\"loki smoke test ok\"]]}]}"

start=$(date -u -d '5 minutes ago' +%s)000000000
end=$(date -u +%s)000000000
curl -G -s "http://127.0.0.1:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={app="smoke",env="prod"}' \
  --data-urlencode "start=$start" \
  --data-urlencode "end=$end" | jq .status

If the copy button does not work in your browser, select the block and copy manually.

Configuration and secrets handling best practices

First, keep your S3 key policy minimal: only list/get/put/delete on the single Loki bucket path. Second, rotate access keys quarterly or when team composition changes. Third, do not mix Loki and backup artifacts in one bucket prefix without lifecycle rules.

Enable lifecycle policies at the bucket layer aligned with retention_period. If your compliance policy says 30 days, enforce 35 days at storage for safety, then periodically audit effective object age. For regulated environments, archive audit logs for policy changes in the object store itself.

On host security: restrict shell access, keep Docker socket access limited, and treat compose files as controlled configuration. Where possible, integrate with external secret stores (Vault, SOPS, or cloud secret managers) and render .env during deploy from CI/CD with short-lived credentials.

Verification checklist

  • Service health: /ready returns HTTP 200 locally and via Traefik route.
  • TLS: certificate chain is valid and auto-renew policy is tested.
  • Object writes: new chunk/index objects appear in bucket within minutes.
  • Query latency: common queries return under expected SLO during peak periods.
  • Retention: old data ages out as expected in both Loki and bucket lifecycle.
  • Backup/recovery: configuration snapshots and restore procedure are documented.

Common issues and fixes

Loki starts but no logs arrive

Usually this is an agent-side label or endpoint mismatch. Verify Promtail target URL and check tenant/auth assumptions. Start with a direct local push test to isolate whether ingestion path is alive.

S3 errors during compaction

Look for permission gaps (missing delete/list) or region mismatch. Also confirm endpoint format: many S3-compatible providers require path-style addressing or custom CA bundles.

Traefik route returns 404/502

Common causes include wrong Docker network attachment, bad router rule host, or container healthcheck failing. Inspect Traefik dashboard and ensure traefik.docker.network=proxy matches your runtime network name.

Queries time out on large ranges

Tune query parallelism and narrow label selectors. Avoid high-cardinality labels (request IDs, UUIDs) in stream labels; keep those in log body. Review dashboard queries for accidental broad scans.

Disk usage grows unexpectedly on host

Even with S3 storage, local cache/WAL can grow if compactor falls behind or cleanup jobs are blocked. Check compactor logs, verify retention is enabled, and ensure data volume has alerting thresholds.

FAQ

Do I need microservices mode for production?

Not always. For small-to-mid workloads, single-binary Loki with strong storage and clear limits is often enough. Move to distributed mode when ingestion/query scale or HA requirements exceed one node profile.

Can I run Loki without S3-compatible storage?

You can, but durability and long-term retention become fragile on node failures. Object storage is strongly recommended for production resilience and maintenance flexibility.

How should I choose retention settings?

Start from compliance and incident-response needs, then map to storage cost. Many teams keep 30 days hot logs and export longer-term archives separately for legal hold workflows.

What labels should I avoid?

Avoid high-cardinality labels such as user IDs, session IDs, and request UUIDs. Keep labels bounded (service, environment, region, cluster) so index cardinality remains healthy.

How do I upgrade Loki safely?

Pin the current version, snapshot configuration, read release notes for schema/storage changes, then roll in staging with replayed query smoke tests before production rollout.

How can I validate disaster recovery quickly?

Rebuild on a fresh host from versioned compose/config, restore environment secrets securely, and run the push/query smoke tests. Document measured RTO/RPO after each drill.

Related guides

Talk to us

If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.

Contact Us

Production Guide: Deploy Redash with Docker Compose + Nginx + PostgreSQL on Ubuntu
A production-first Redash deployment blueprint with TLS, Redis queueing, backups, secure secrets handling, and day-2 operational checks.