Skip to Content

Production Guide: Deploy Grafana Loki + Promtail with Docker Compose + Traefik + Let's Encrypt on Ubuntu

A production-ready Loki stack with secure ingress, retention controls, and operator-focused verification steps.

Centralized logs are often the first thing teams realize they needed yesterday. A failed deploy, a flaky dependency, or a suspicious authentication pattern can burn hours when application logs are scattered across nodes and containers. In production, those delays quickly become revenue-impacting incidents.

This guide walks through a practical, production-oriented deployment of Grafana Loki + Promtail on Ubuntu using Docker Compose for orchestration and Traefik as the edge reverse proxy with automatic Let's Encrypt certificates. The goal is not just to make it run, but to make it operable: predictable upgrades, durable data paths, explicit retention, and clear validation checks.

We assume you are running this for a real team environment where auditability, rollback safety, and outage recovery matter. Along the way, we will include hardened defaults, practical troubleshooting patterns, and repeatable verification steps you can hand to operations staff.

Architecture and flow overview

The stack uses Loki as the log database, Promtail as the log shipper, and Grafana as the query/UI layer. Traefik terminates TLS and routes traffic to Grafana and Loki endpoints over an internal Docker network. Promtail tails container and host logs, enriches streams with labels, and pushes batches into Loki.

At a high level, the flow is:

  • Applications and containers emit logs to files or stdout.
  • Promtail reads and labels logs based on job and container metadata.
  • Loki stores indexed labels plus compressed log chunks.
  • Grafana queries Loki for dashboards, exploration, and incident timelines.
  • Traefik secures public access with automatic certificate lifecycle management.

This design keeps ingestion lightweight, supports horizontally scalable patterns later, and avoids the operational overhead of a full-text indexing cluster for teams that mainly need fast filtering by labels, service, namespace, and severity.

Prerequisites

Before deployment, confirm the host is ready for sustained log ingestion and retention:

  • Ubuntu 22.04+ with sudo access
  • Public DNS A record pointing to your server (for TLS issuance)
  • Ports 80/443 open on your firewall and cloud security group
  • At least 4 vCPU, 8 GB RAM, and fast SSD storage
  • Docker Engine and Docker Compose plugin installed

Run this baseline host preparation first:


sudo apt-get update && sudo apt-get -y upgrade
sudo apt-get install -y curl ca-certificates gnupg ufw jq

# Docker (if not already installed)
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER

# Firewall profile
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable

# Verify runtime
systemctl is-active docker
Docker version
docker compose version

If copy button does not work in your browser/editor, select the block and copy manually.

Step-by-step deployment

1) Create project layout and persistent directories

Separate config, data, and backups from compose definitions. This keeps upgrades and rollback operations clean and avoids accidental data loss when changing compose files.


sudo mkdir -p /opt/observability/{traefik,loki,promtail,grafana,backups}
sudo mkdir -p /opt/observability/traefik/{letsencrypt,dynamic}
sudo mkdir -p /opt/observability/loki/{config,data}
sudo mkdir -p /opt/observability/promtail/config
sudo mkdir -p /opt/observability/grafana/data

sudo chown -R $USER:$USER /opt/observability
chmod 600 /opt/observability/traefik/letsencrypt/acme.json
cd /opt/observability

If copy button does not work in your browser/editor, select the block and copy manually.

2) Define environment variables and secrets

Use a dedicated environment file for non-public settings and deployment-specific values. For production teams, this file should be managed in a secret manager or encrypted repo workflow; avoid committing plaintext secrets in Git.


cat > /opt/observability/.env << 'EOF'
DOMAIN_LOGS=logs.example.com
GF_DOMAIN=grafana.example.com
[email protected]
GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=REPLACE_WITH_STRONG_PASSWORD
TZ=UTC
EOF

If copy button does not work in your browser/editor, select the block and copy manually.

3) Create Docker Compose stack

The compose file below pins stable image tags, places all services on a shared network, and sets Traefik labels explicitly for deterministic routing. Loki and Grafana mount persistent volumes; Promtail mounts host logs read-only.


cat > /opt/observability/docker-compose.yml << 'EOF'
services:
  traefik:
    image: traefik:v3.1
    command:
      - --api.dashboard=true
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --certificatesresolvers.le.acme.tlschallenge=true
      - --certificatesresolvers.le.acme.email=${LETSENCRYPT_EMAIL}
      - --certificatesresolvers.le.acme.storage=/letsencrypt/acme.json
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /opt/observability/traefik/letsencrypt:/letsencrypt
    restart: unless-stopped

  loki:
    image: grafana/loki:3.1.1
    command: ["-config.file=/etc/loki/loki.yaml"]
    volumes:
      - /opt/observability/loki/config/loki.yaml:/etc/loki/loki.yaml:ro
      - /opt/observability/loki/data:/loki
    labels:
      - traefik.enable=true
      - traefik.http.routers.loki.rule=Host(`${DOMAIN_LOGS}`)
      - traefik.http.routers.loki.entrypoints=websecure
      - traefik.http.routers.loki.tls.certresolver=le
      - traefik.http.services.loki.loadbalancer.server.port=3100
    restart: unless-stopped

  promtail:
    image: grafana/promtail:3.1.1
    command: ["-config.file=/etc/promtail/promtail.yaml"]
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /opt/observability/promtail/config/promtail.yaml:/etc/promtail/promtail.yaml:ro
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    env_file: /opt/observability/.env
    environment:
      - GF_SERVER_DOMAIN=${GF_DOMAIN}
      - GF_SERVER_ROOT_URL=https://${GF_DOMAIN}
      - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD}
      - TZ=${TZ}
    volumes:
      - /opt/observability/grafana/data:/var/lib/grafana
    labels:
      - traefik.enable=true
      - traefik.http.routers.grafana.rule=Host(`${GF_DOMAIN}`)
      - traefik.http.routers.grafana.entrypoints=websecure
      - traefik.http.routers.grafana.tls.certresolver=le
      - traefik.http.services.grafana.loadbalancer.server.port=3000
    depends_on:
      - loki
    restart: unless-stopped
EOF

If copy button does not work in your browser/editor, select the block and copy manually.

4) Add Loki and Promtail configuration

Loki retention and compaction settings are where many teams under-invest. Define retention intentionally, align with compliance, and estimate disk growth before go-live.


cat > /opt/observability/loki/config/loki.yaml << 'EOF'
auth_enabled: false
server:
  http_listen_port: 3100
common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory
schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h
limits_config:
  retention_period: 336h   # 14 days
  ingestion_rate_mb: 8
  ingestion_burst_size_mb: 16
compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  delete_request_store: filesystem
EOF

cat > /opt/observability/promtail/config/promtail.yaml << 'EOF'
server:
  http_listen_port: 9080
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
  - job_name: varlogs
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*.log
  - job_name: docker
    static_configs:
      - targets: [localhost]
        labels:
          job: docker
          host: ${HOSTNAME}
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
EOF

If copy button does not work in your browser/editor, select the block and copy manually.

5) Launch stack and validate certificate issuance

Bring up services in order, then confirm Traefik issued valid certificates for both endpoints. First certificate generation can take a few minutes, depending on DNS propagation and rate-limit history.


cd /opt/observability
set -a && source .env && set +a

docker compose pull
docker compose up -d

docker compose ps
docker compose logs --tail=80 traefik
docker compose logs --tail=80 loki

curl -I https://${GF_DOMAIN}
curl -I https://${DOMAIN_LOGS}/ready

If copy button does not work in your browser/editor, select the block and copy manually.

6) Configure Grafana datasource and baseline dashboard checks

After login, add Loki as a datasource using the internal URL http://loki:3100 if Grafana is in the same compose network. Build at least one team-facing dashboard that filters by job, host, and level labels so incident responders can move from symptom to scope quickly.

For production readiness, create alert rules for ingestion drop, high error-rate patterns, and missing logs from critical services. A common anti-pattern is having dashboards without no-data alerts: outages then appear as calm charts instead of incidents.

Configuration and secrets handling best practices

Operational reliability comes from controlled change. Keep image tags pinned and use an explicit promotion process from staging to production. Store your compose file in Git, but keep secret values externalized in your environment manager.

Recommended controls:

  • Rotate Grafana admin credentials and move to SSO where possible.
  • Restrict inbound access by source IP or VPN for admin endpoints.
  • Back up Loki data and Grafana state on a schedule; test restore monthly.
  • Set retention by policy, not by disk panic, and monitor growth trends.
  • Use separate tenants or label boundaries for regulated workloads.

If your compliance baseline requires immutable archives, forward selected streams to object storage with lifecycle policies. Even when using local filesystem in early phases, plan migration checkpoints before volume growth forces emergency architecture changes.


# Example nightly backup script
cat > /opt/observability/backups/backup.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
STAMP=$(date +%F-%H%M)
mkdir -p /opt/observability/backups/$STAMP

tar -czf /opt/observability/backups/$STAMP/loki-data.tgz -C /opt/observability/loki data
tar -czf /opt/observability/backups/$STAMP/grafana-data.tgz -C /opt/observability/grafana data
cp /opt/observability/.env /opt/observability/backups/$STAMP/env.backup

echo "Backup complete: $STAMP"
EOF
chmod +x /opt/observability/backups/backup.sh

If copy button does not work in your browser/editor, select the block and copy manually.

Verification checklist

Use this checklist after deployment and after every upgrade. The objective is to verify data ingestion, query path, and external accessβ€”not just container uptime.


cd /opt/observability

docker compose ps
curl -sf https://$DOMAIN_LOGS/ready
curl -sf https://$GF_DOMAIN/login >/dev/null && echo "Grafana reachable"

docker compose logs --tail=100 promtail | grep -Ei "client|batch|error"

docker exec -it $(docker ps --filter name=loki -q) wget -qO- http://localhost:3100/ready

docker exec -it $(docker ps --filter name=grafana -q) grafana-cli plugins ls

If copy button does not work in your browser/editor, select the block and copy manually.

Also perform a synthetic end-to-end check: generate a controlled log line from a known service, then query it in Grafana Explore using expected labels. This confirms parser stages and label mappings are behaving correctly.

Common issues and fixes

Certificates not issuing from Let's Encrypt

Most failures come from DNS mismatch, blocked port 80, or stale ACME challenge attempts. Revalidate A records and firewall policy, then inspect Traefik logs for challenge-specific errors.

No logs visible in Grafana Explore

Check Promtail path globs and permissions first. Container JSON logs are often mounted incorrectly or blocked by host security profiles. Confirm the files exist on host and are readable inside the promtail container.

High disk growth in Loki

Retention defaults may be too high for your ingest rate. Tighten retention windows, reduce noisy debug streams, and enforce label discipline so teams can stop shipping low-value logs.

Query performance slows during incidents

Unbounded queries across large time ranges hurt responsiveness. Train responders to start with narrow windows and strict labels, then widen only when needed. Prebuilt dashboards for high-risk services also reduce ad-hoc query load.

FAQ

1) When should we choose Loki over a full-text search stack?

Choose Loki when your team primarily filters by labels and time ranges, and wants lower operational overhead. If you require deep text analytics at massive scale, evaluate complementing Loki with a separate search pipeline.

2) Can we run this stack without public exposure?

Yes. Keep Grafana and Loki behind private networking or VPN, and use internal DNS with private CA certificates. For internet-facing deployments, enforce MFA/SSO and IP restrictions.

3) How much retention should we configure initially?

Start with 14–30 days based on incident response needs and compliance. Track growth for two weeks, then tune retention and storage classes before expanding historical windows.

4) What is the safest upgrade path?

Promote image tags through staging, validate queries and dashboards, then upgrade production during a maintenance window. Snapshot data and config before changing compose or Loki schema settings.

5) How do we prevent sensitive data from entering logs?

Apply redaction in application middleware, avoid logging raw tokens/PII, and add pipeline stages that drop high-risk patterns. Security review of logging standards should be part of release readiness.

6) Can this design scale later without replatforming immediately?

Yes. You can add remote object storage, split read/write paths, and eventually move to Kubernetes while preserving Loki query conventions and dashboard workflows.

7) Should Promtail run on every host?

In multi-node environments, yes. Running a local agent per node improves resilience and source labeling. Centralized scraping from one node usually misses host-specific logs and complicates failure isolation.

Related guides

Use these guides to align your metrics, logs, and BI workflows into a single operational runbook strategy across your environment.

Talk to us

If you want this deployed with production hardening, monitoring, and backup automation tailored to your environment, our team can help.

Contact Us

Production Guide: Deploy ClickHouse with Docker Compose, NGINX, and Automated Backups on Ubuntu
A production-first ClickHouse deployment pattern with TLS, role-based users, backup automation, and practical ops checks.