Running centralized logs in production sounds simple until incidents hit: one node fills disk, noisy services bury critical errors, and security teams need retention guarantees you cannot prove. This guide shows a pragmatic deployment pattern for Grafana Loki on Ubuntu using Docker Compose, Traefik, and S3-compatible object storage. The goal is not a demo stack; it is a maintainable, recoverable logging platform with clear operational checks.
Scenario: you operate 10–80 services across VMs and containers, and engineering needs fast troubleshooting while compliance requires predictable retention. We will deploy Loki in single-binary mode with boltdb-shipper and object storage, put it behind Traefik with TLS, lock down secrets handling, and validate ingestion/query behavior before go-live.
Architecture and flow overview
This architecture separates control and durability concerns while staying lightweight:
- Promtail/agents on workloads ship logs to Loki’s push API.
- Loki handles indexing and query execution.
- S3-compatible bucket stores index/chunks for durable retention.
- Local persistent volume stores write-ahead and cache state.
- Traefik terminates TLS and exposes a stable HTTPS endpoint.
Flow: application logs → Promtail labels/ships → Loki receives and compacts → chunks/index to S3 → engineers query via Grafana or logcli. Keeping object storage external to container lifecycle makes upgrades and node replacement safer.
Prerequisites
- Ubuntu 22.04/24.04 host with 4 vCPU, 8 GB RAM minimum.
- Docker Engine + Docker Compose plugin installed.
- A domain like
logs.example.compointed to your Traefik host. - Traefik already running on a shared
proxyDocker network. - S3-compatible bucket (AWS S3, MinIO, or Wasabi) and access credentials.
- NTP time sync enabled (clock skew causes painful query behavior).
sudo apt update && sudo apt -y upgrade
sudo apt -y install ca-certificates curl gnupg lsb-release jq unzip
docker --version
docker compose version
If the copy button does not work in your browser, select the block and copy manually.
Step-by-step deployment
1) Create project layout with least-privilege permissions
We isolate configuration and runtime data so backups and rotations are predictable. Keep environment variables outside compose YAML.
sudo mkdir -p /opt/loki/{config,data,backups}
sudo chown -R $USER:$USER /opt/loki
cd /opt/loki
umask 027
touch .env
chmod 600 .env
If the copy button does not work in your browser, select the block and copy manually.
2) Define secrets in an environment file
Never hardcode keys in docker-compose.yml or scripts. Rotate credentials through your secret manager and update this file atomically.
cat > /opt/loki/.env <<'EOF'
LOKI_DOMAIN=logs.example.com
S3_ENDPOINT=s3.us-east-1.amazonaws.com
S3_REGION=us-east-1
S3_BUCKET=prod-loki-logs
S3_ACCESS_KEY=REPLACE_ME
S3_SECRET_KEY=REPLACE_ME
TRAEFIK_CERTRESOLVER=letsencrypt
EOF
chmod 600 /opt/loki/.env
If the copy button does not work in your browser, select the block and copy manually.
For MinIO or other S3-compatible APIs, keep endpoint explicit (for example https://minio.internal:9000) and confirm TLS trust chain from the Loki host.
3) Write production Loki configuration
This config keeps ingestion simple while using object storage for durable chunks/index. Retention is set explicitly so storage growth remains predictable.
auth_enabled: false
server:
http_listen_port: 3100
log_level: info
common:
path_prefix: /loki
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: aws
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
s3: s3://${S3_ACCESS_KEY}:${S3_SECRET_KEY}@${S3_ENDPOINT}/${S3_BUCKET}
s3forcepathstyle: false
region: ${S3_REGION}
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/boltdb-cache
cache_ttl: 24h
limits_config:
retention_period: 30d
ingestion_rate_mb: 8
ingestion_burst_size_mb: 16
max_query_parallelism: 16
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
chunk_store_config:
max_look_back_period: 30d
If the copy button does not work in your browser, select the block and copy manually.
4) Create Docker Compose stack with Traefik labels
Pin image versions in production to avoid surprise changes. Attach to an existing external Traefik network and expose only through the reverse proxy.
services:
loki:
image: grafana/loki:3.0.0
container_name: loki
restart: unless-stopped
env_file:
- /opt/loki/.env
command: -config.file=/etc/loki/loki.yaml
volumes:
- /opt/loki/config/loki.yaml:/etc/loki/loki.yaml:ro
- /opt/loki/data:/loki
healthcheck:
test: ["CMD", "wget", "-qO-", "http://127.0.0.1:3100/ready"]
interval: 30s
timeout: 5s
retries: 5
labels:
- traefik.enable=true
- traefik.docker.network=proxy
- traefik.http.routers.loki.rule=Host(`${LOKI_DOMAIN}`)
- traefik.http.routers.loki.entrypoints=websecure
- traefik.http.routers.loki.tls=true
- traefik.http.routers.loki.tls.certresolver=${TRAEFIK_CERTRESOLVER}
- traefik.http.services.loki.loadbalancer.server.port=3100
- traefik.http.middlewares.loki-sec.headers.browserXssFilter=true
- traefik.http.middlewares.loki-sec.headers.contentTypeNosniff=true
- traefik.http.routers.loki.middlewares=loki-sec
networks:
- proxy
networks:
proxy:
external: true
If the copy button does not work in your browser, select the block and copy manually.
5) Start, validate health, and smoke test writes
Bring up the service, confirm readiness, then push a known line and query it. This closes the loop from ingest to retrieval before any dashboard work.
cp /opt/loki/config/loki.yaml /opt/loki/backups/loki.yaml.$(date +%F-%H%M%S)
docker compose -f /opt/loki/docker-compose.yml up -d
docker ps --filter name=loki
curl -fsS http://127.0.0.1:3100/ready
curl -k -u user:pass https://logs.example.com/ready || true
If the copy button does not work in your browser, select the block and copy manually.
ts_ns=$(date +%s%N)
curl -sS -H "Content-Type: application/json" \
-X POST "http://127.0.0.1:3100/loki/api/v1/push" \
--data-raw "{\"streams\":[{\"stream\":{\"app\":\"smoke\",\"env\":\"prod\"},\"values\":[[\"$ts_ns\",\"loki smoke test ok\"]]}]}"
start=$(date -u -d '5 minutes ago' +%s)000000000
end=$(date -u +%s)000000000
curl -G -s "http://127.0.0.1:3100/loki/api/v1/query_range" \
--data-urlencode 'query={app="smoke",env="prod"}' \
--data-urlencode "start=$start" \
--data-urlencode "end=$end" | jq .status
If the copy button does not work in your browser, select the block and copy manually.
Configuration and secrets handling best practices
First, keep your S3 key policy minimal: only list/get/put/delete on the single Loki bucket path. Second, rotate access keys quarterly or when team composition changes. Third, do not mix Loki and backup artifacts in one bucket prefix without lifecycle rules.
Enable lifecycle policies at the bucket layer aligned with retention_period. If your compliance policy says 30 days, enforce 35 days at storage for safety, then periodically audit effective object age. For regulated environments, archive audit logs for policy changes in the object store itself.
On host security: restrict shell access, keep Docker socket access limited, and treat compose files as controlled configuration. Where possible, integrate with external secret stores (Vault, SOPS, or cloud secret managers) and render .env during deploy from CI/CD with short-lived credentials.
Verification checklist
- Service health:
/readyreturns HTTP 200 locally and via Traefik route. - TLS: certificate chain is valid and auto-renew policy is tested.
- Object writes: new chunk/index objects appear in bucket within minutes.
- Query latency: common queries return under expected SLO during peak periods.
- Retention: old data ages out as expected in both Loki and bucket lifecycle.
- Backup/recovery: configuration snapshots and restore procedure are documented.
Common issues and fixes
Loki starts but no logs arrive
Usually this is an agent-side label or endpoint mismatch. Verify Promtail target URL and check tenant/auth assumptions. Start with a direct local push test to isolate whether ingestion path is alive.
S3 errors during compaction
Look for permission gaps (missing delete/list) or region mismatch. Also confirm endpoint format: many S3-compatible providers require path-style addressing or custom CA bundles.
Traefik route returns 404/502
Common causes include wrong Docker network attachment, bad router rule host, or container healthcheck failing. Inspect Traefik dashboard and ensure traefik.docker.network=proxy matches your runtime network name.
Queries time out on large ranges
Tune query parallelism and narrow label selectors. Avoid high-cardinality labels (request IDs, UUIDs) in stream labels; keep those in log body. Review dashboard queries for accidental broad scans.
Disk usage grows unexpectedly on host
Even with S3 storage, local cache/WAL can grow if compactor falls behind or cleanup jobs are blocked. Check compactor logs, verify retention is enabled, and ensure data volume has alerting thresholds.
FAQ
Do I need microservices mode for production?
Not always. For small-to-mid workloads, single-binary Loki with strong storage and clear limits is often enough. Move to distributed mode when ingestion/query scale or HA requirements exceed one node profile.
Can I run Loki without S3-compatible storage?
You can, but durability and long-term retention become fragile on node failures. Object storage is strongly recommended for production resilience and maintenance flexibility.
How should I choose retention settings?
Start from compliance and incident-response needs, then map to storage cost. Many teams keep 30 days hot logs and export longer-term archives separately for legal hold workflows.
What labels should I avoid?
Avoid high-cardinality labels such as user IDs, session IDs, and request UUIDs. Keep labels bounded (service, environment, region, cluster) so index cardinality remains healthy.
How do I upgrade Loki safely?
Pin the current version, snapshot configuration, read release notes for schema/storage changes, then roll in staging with replayed query smoke tests before production rollout.
How can I validate disaster recovery quickly?
Rebuild on a fresh host from versioned compose/config, restore environment secrets securely, and run the push/query smoke tests. Document measured RTO/RPO after each drill.
Related guides
- Production Guide: Deploy NetBird with Kubernetes + Helm + cert-manager on Ubuntu
- Production Guide: Deploy Outline with Docker Compose + Nginx + PostgreSQL on Ubuntu
- Production Guide: Deploy n8n with Docker Compose + Traefik + PostgreSQL on Ubuntu
Talk to us
If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.