Centralized logs are often the first thing teams realize they needed yesterday. A failed deploy, a flaky dependency, or a suspicious authentication pattern can burn hours when application logs are scattered across nodes and containers. In production, those delays quickly become revenue-impacting incidents.
This guide walks through a practical, production-oriented deployment of Grafana Loki + Promtail on Ubuntu using Docker Compose for orchestration and Traefik as the edge reverse proxy with automatic Let's Encrypt certificates. The goal is not just to make it run, but to make it operable: predictable upgrades, durable data paths, explicit retention, and clear validation checks.
We assume you are running this for a real team environment where auditability, rollback safety, and outage recovery matter. Along the way, we will include hardened defaults, practical troubleshooting patterns, and repeatable verification steps you can hand to operations staff.
Architecture and flow overview
The stack uses Loki as the log database, Promtail as the log shipper, and Grafana as the query/UI layer. Traefik terminates TLS and routes traffic to Grafana and Loki endpoints over an internal Docker network. Promtail tails container and host logs, enriches streams with labels, and pushes batches into Loki.
At a high level, the flow is:
- Applications and containers emit logs to files or stdout.
- Promtail reads and labels logs based on job and container metadata.
- Loki stores indexed labels plus compressed log chunks.
- Grafana queries Loki for dashboards, exploration, and incident timelines.
- Traefik secures public access with automatic certificate lifecycle management.
This design keeps ingestion lightweight, supports horizontally scalable patterns later, and avoids the operational overhead of a full-text indexing cluster for teams that mainly need fast filtering by labels, service, namespace, and severity.
Prerequisites
Before deployment, confirm the host is ready for sustained log ingestion and retention:
- Ubuntu 22.04+ with sudo access
- Public DNS A record pointing to your server (for TLS issuance)
- Ports 80/443 open on your firewall and cloud security group
- At least 4 vCPU, 8 GB RAM, and fast SSD storage
- Docker Engine and Docker Compose plugin installed
Run this baseline host preparation first:
sudo apt-get update && sudo apt-get -y upgrade
sudo apt-get install -y curl ca-certificates gnupg ufw jq
# Docker (if not already installed)
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
# Firewall profile
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable
# Verify runtime
systemctl is-active docker
Docker version
docker compose version
If copy button does not work in your browser/editor, select the block and copy manually.
Step-by-step deployment
1) Create project layout and persistent directories
Separate config, data, and backups from compose definitions. This keeps upgrades and rollback operations clean and avoids accidental data loss when changing compose files.
sudo mkdir -p /opt/observability/{traefik,loki,promtail,grafana,backups}
sudo mkdir -p /opt/observability/traefik/{letsencrypt,dynamic}
sudo mkdir -p /opt/observability/loki/{config,data}
sudo mkdir -p /opt/observability/promtail/config
sudo mkdir -p /opt/observability/grafana/data
sudo chown -R $USER:$USER /opt/observability
chmod 600 /opt/observability/traefik/letsencrypt/acme.json
cd /opt/observability
If copy button does not work in your browser/editor, select the block and copy manually.
2) Define environment variables and secrets
Use a dedicated environment file for non-public settings and deployment-specific values. For production teams, this file should be managed in a secret manager or encrypted repo workflow; avoid committing plaintext secrets in Git.
cat > /opt/observability/.env << 'EOF'
DOMAIN_LOGS=logs.example.com
GF_DOMAIN=grafana.example.com
[email protected]
GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=REPLACE_WITH_STRONG_PASSWORD
TZ=UTC
EOF
If copy button does not work in your browser/editor, select the block and copy manually.
3) Create Docker Compose stack
The compose file below pins stable image tags, places all services on a shared network, and sets Traefik labels explicitly for deterministic routing. Loki and Grafana mount persistent volumes; Promtail mounts host logs read-only.
cat > /opt/observability/docker-compose.yml << 'EOF'
services:
traefik:
image: traefik:v3.1
command:
- --api.dashboard=true
- --providers.docker=true
- --providers.docker.exposedbydefault=false
- --entrypoints.web.address=:80
- --entrypoints.websecure.address=:443
- --certificatesresolvers.le.acme.tlschallenge=true
- --certificatesresolvers.le.acme.email=${LETSENCRYPT_EMAIL}
- --certificatesresolvers.le.acme.storage=/letsencrypt/acme.json
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /opt/observability/traefik/letsencrypt:/letsencrypt
restart: unless-stopped
loki:
image: grafana/loki:3.1.1
command: ["-config.file=/etc/loki/loki.yaml"]
volumes:
- /opt/observability/loki/config/loki.yaml:/etc/loki/loki.yaml:ro
- /opt/observability/loki/data:/loki
labels:
- traefik.enable=true
- traefik.http.routers.loki.rule=Host(`${DOMAIN_LOGS}`)
- traefik.http.routers.loki.entrypoints=websecure
- traefik.http.routers.loki.tls.certresolver=le
- traefik.http.services.loki.loadbalancer.server.port=3100
restart: unless-stopped
promtail:
image: grafana/promtail:3.1.1
command: ["-config.file=/etc/promtail/promtail.yaml"]
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /opt/observability/promtail/config/promtail.yaml:/etc/promtail/promtail.yaml:ro
restart: unless-stopped
grafana:
image: grafana/grafana:11.1.0
env_file: /opt/observability/.env
environment:
- GF_SERVER_DOMAIN=${GF_DOMAIN}
- GF_SERVER_ROOT_URL=https://${GF_DOMAIN}
- GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
- TZ=${TZ}
volumes:
- /opt/observability/grafana/data:/var/lib/grafana
labels:
- traefik.enable=true
- traefik.http.routers.grafana.rule=Host(`${GF_DOMAIN}`)
- traefik.http.routers.grafana.entrypoints=websecure
- traefik.http.routers.grafana.tls.certresolver=le
- traefik.http.services.grafana.loadbalancer.server.port=3000
depends_on:
- loki
restart: unless-stopped
EOF
If copy button does not work in your browser/editor, select the block and copy manually.
4) Add Loki and Promtail configuration
Loki retention and compaction settings are where many teams under-invest. Define retention intentionally, align with compliance, and estimate disk growth before go-live.
cat > /opt/observability/loki/config/loki.yaml << 'EOF'
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 336h # 14 days
ingestion_rate_mb: 8
ingestion_burst_size_mb: 16
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
delete_request_store: filesystem
EOF
cat > /opt/observability/promtail/config/promtail.yaml << 'EOF'
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: varlogs
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*.log
- job_name: docker
static_configs:
- targets: [localhost]
labels:
job: docker
host: ${HOSTNAME}
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
EOF
If copy button does not work in your browser/editor, select the block and copy manually.
5) Launch stack and validate certificate issuance
Bring up services in order, then confirm Traefik issued valid certificates for both endpoints. First certificate generation can take a few minutes, depending on DNS propagation and rate-limit history.
cd /opt/observability
set -a && source .env && set +a
docker compose pull
docker compose up -d
docker compose ps
docker compose logs --tail=80 traefik
docker compose logs --tail=80 loki
curl -I https://${GF_DOMAIN}
curl -I https://${DOMAIN_LOGS}/ready
If copy button does not work in your browser/editor, select the block and copy manually.
6) Configure Grafana datasource and baseline dashboard checks
After login, add Loki as a datasource using the internal URL http://loki:3100 if Grafana is in the same compose network. Build at least one team-facing dashboard that filters by job, host, and level labels so incident responders can move from symptom to scope quickly.
For production readiness, create alert rules for ingestion drop, high error-rate patterns, and missing logs from critical services. A common anti-pattern is having dashboards without no-data alerts: outages then appear as calm charts instead of incidents.
Configuration and secrets handling best practices
Operational reliability comes from controlled change. Keep image tags pinned and use an explicit promotion process from staging to production. Store your compose file in Git, but keep secret values externalized in your environment manager.
Recommended controls:
- Rotate Grafana admin credentials and move to SSO where possible.
- Restrict inbound access by source IP or VPN for admin endpoints.
- Back up Loki data and Grafana state on a schedule; test restore monthly.
- Set retention by policy, not by disk panic, and monitor growth trends.
- Use separate tenants or label boundaries for regulated workloads.
If your compliance baseline requires immutable archives, forward selected streams to object storage with lifecycle policies. Even when using local filesystem in early phases, plan migration checkpoints before volume growth forces emergency architecture changes.
# Example nightly backup script
cat > /opt/observability/backups/backup.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
STAMP=$(date +%F-%H%M)
mkdir -p /opt/observability/backups/$STAMP
tar -czf /opt/observability/backups/$STAMP/loki-data.tgz -C /opt/observability/loki data
tar -czf /opt/observability/backups/$STAMP/grafana-data.tgz -C /opt/observability/grafana data
cp /opt/observability/.env /opt/observability/backups/$STAMP/env.backup
echo "Backup complete: $STAMP"
EOF
chmod +x /opt/observability/backups/backup.sh
If copy button does not work in your browser/editor, select the block and copy manually.
Verification checklist
Use this checklist after deployment and after every upgrade. The objective is to verify data ingestion, query path, and external accessβnot just container uptime.
cd /opt/observability
docker compose ps
curl -sf https://$DOMAIN_LOGS/ready
curl -sf https://$GF_DOMAIN/login >/dev/null && echo "Grafana reachable"
docker compose logs --tail=100 promtail | grep -Ei "client|batch|error"
docker exec -it $(docker ps --filter name=loki -q) wget -qO- http://localhost:3100/ready
docker exec -it $(docker ps --filter name=grafana -q) grafana-cli plugins ls
If copy button does not work in your browser/editor, select the block and copy manually.
Also perform a synthetic end-to-end check: generate a controlled log line from a known service, then query it in Grafana Explore using expected labels. This confirms parser stages and label mappings are behaving correctly.
Common issues and fixes
Certificates not issuing from Let's Encrypt
Most failures come from DNS mismatch, blocked port 80, or stale ACME challenge attempts. Revalidate A records and firewall policy, then inspect Traefik logs for challenge-specific errors.
No logs visible in Grafana Explore
Check Promtail path globs and permissions first. Container JSON logs are often mounted incorrectly or blocked by host security profiles. Confirm the files exist on host and are readable inside the promtail container.
High disk growth in Loki
Retention defaults may be too high for your ingest rate. Tighten retention windows, reduce noisy debug streams, and enforce label discipline so teams can stop shipping low-value logs.
Query performance slows during incidents
Unbounded queries across large time ranges hurt responsiveness. Train responders to start with narrow windows and strict labels, then widen only when needed. Prebuilt dashboards for high-risk services also reduce ad-hoc query load.
FAQ
1) When should we choose Loki over a full-text search stack?
Choose Loki when your team primarily filters by labels and time ranges, and wants lower operational overhead. If you require deep text analytics at massive scale, evaluate complementing Loki with a separate search pipeline.
2) Can we run this stack without public exposure?
Yes. Keep Grafana and Loki behind private networking or VPN, and use internal DNS with private CA certificates. For internet-facing deployments, enforce MFA/SSO and IP restrictions.
3) How much retention should we configure initially?
Start with 14β30 days based on incident response needs and compliance. Track growth for two weeks, then tune retention and storage classes before expanding historical windows.
4) What is the safest upgrade path?
Promote image tags through staging, validate queries and dashboards, then upgrade production during a maintenance window. Snapshot data and config before changing compose or Loki schema settings.
5) How do we prevent sensitive data from entering logs?
Apply redaction in application middleware, avoid logging raw tokens/PII, and add pipeline stages that drop high-risk patterns. Security review of logging standards should be part of release readiness.
6) Can this design scale later without replatforming immediately?
Yes. You can add remote object storage, split read/write paths, and eventually move to Kubernetes while preserving Loki query conventions and dashboard workflows.
7) Should Promtail run on every host?
In multi-node environments, yes. Running a local agent per node improves resilience and source labeling. Centralized scraping from one node usually misses host-specific logs and complicates failure isolation.
Related guides
- Production Guide: Grafana + Prometheus with Docker Compose + Nginx
- Production Guide: OpenObserve with Docker Compose + Traefik + ClickHouse
- Production Guide: Redash with Docker Compose + Nginx + PostgreSQL
Use these guides to align your metrics, logs, and BI workflows into a single operational runbook strategy across your environment.
Talk to us
If you want this deployed with production hardening, monitoring, and backup automation tailored to your environment, our team can help.