Skip to Content

Production Guide: Deploy Prometheus with Docker Compose + Caddy + Alertmanager on Ubuntu

A production-focused walkthrough with secure secrets handling, alert routing, verification commands, and operational troubleshooting.

When infrastructure grows beyond a few services, teams often discover that ad-hoc checks and disconnected dashboards are not enough to catch outages early. A production monitoring stack needs predictable ingestion, clear alert routing, and maintenance procedures that can survive incidents and handoffs between engineers. This guide walks through a practical deployment of Prometheus with Alertmanager and Node Exporter on Ubuntu using Docker Compose, fronted by Caddy for TLS and clean ingress handling.

The target reader is an operator or platform engineer who needs a repeatable setup for a real environment: private host metrics collection, hardened service boundaries, durable storage, and a verification flow that can be executed quickly during change windows. The approach below is intentionally explicit so you can adapt it for staging and production with minimal ambiguity.

Architecture and flow overview

This deployment separates concerns into four components: Prometheus for scrape and query, Alertmanager for deduplication and routing, Node Exporter for host-level telemetry, and Caddy for HTTPS ingress. Prometheus scrapes Node Exporter on an internal Docker network and sends alerts to Alertmanager. Caddy terminates TLS and proxies requests to Prometheus and Alertmanager UI paths as required.

Operationally, this model gives you clear fault domains. If Caddy fails, scraping still continues internally. If Alertmanager is unavailable, data ingestion still works while notifications are queued and retried after recovery. Data is stored on persistent volumes under /opt/prometheus, allowing controlled upgrades and straightforward backup jobs.

In real-world operations, monitoring quality depends on disciplined ownership: alerts must map to on-call responders, runbooks must include exact diagnostic commands, and dashboard panels should focus on decision-making rather than vanity metrics. A noisy stack creates alert fatigue; a curated stack improves incident response speed.

This guide assumes you want a pragmatic baseline that can be extended later with remote-write, long-term storage, and service-level objective dashboards. The immediate objective is stable collection and actionable alerts, not maximal complexity on day one.

Prerequisites

  • Ubuntu 22.04/24.04 server with sudo access
  • DNS record for monitor.example.com
  • Docker Engine + Compose plugin
  • Ports 80/443 open
  • SMTP relay or webhook for notifications
  • 2 vCPU, 4GB RAM, 30GB disk minimum

Step-by-step deployment

Create deterministic directories for config, state, and secrets:

sudo mkdir -p /opt/prometheus/{caddy,prometheus,alertmanager,data/prometheus,data/alertmanager,secrets}
sudo chown -R $USER:$USER /opt/prometheus
cd /opt/prometheus

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

cat > /opt/prometheus/docker-compose.yml <<'YAML'
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    restart: unless-stopped
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --web.enable-lifecycle
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules.yml:/etc/prometheus/rules.yml:ro
      - ./data/prometheus:/prometheus
    networks: [monitoring]

  alertmanager:
    image: prom/alertmanager:v0.27.0
    restart: unless-stopped
    command: ["--config.file=/etc/alertmanager/alertmanager.yml"]
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - ./data/alertmanager:/alertmanager
    networks: [monitoring]

  node-exporter:
    image: prom/node-exporter:v1.8.1
    restart: unless-stopped
    command: ["--path.rootfs=/host"]
    pid: host
    volumes:
      - /:/host:ro,rslave
    networks: [monitoring]

  caddy:
    image: caddy:2.8
    restart: unless-stopped
    ports: ["80:80","443:443"]
    volumes:
      - ./caddy/Caddyfile:/etc/caddy/Caddyfile:ro
      - ./caddy/data:/data
      - ./caddy/config:/config
    depends_on: [prometheus]
    networks: [monitoring]

networks:
  monitoring:
    driver: bridge
YAML

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

cat > /opt/prometheus/prometheus/prometheus.yml <<'YAML'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["prometheus:9090"]
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
YAML

cat > /opt/prometheus/prometheus/rules.yml <<'YAML'
groups:
- name: host-health
  rules:
  - alert: HostHighCPU
    expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
      description: "CPU usage above 90% for 5m"
YAML

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

cat > /opt/prometheus/alertmanager/alertmanager.yml <<'YAML'
route:
  receiver: default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
receivers:
  - name: default
    webhook_configs:
      - url: http://example-webhook.local/alerts
        send_resolved: true
YAML

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

cat > /opt/prometheus/caddy/Caddyfile <<'CADDY'
monitor.example.com {
  encode gzip
  @alert path /alertmanager*
  handle @alert {
    reverse_proxy alertmanager:9093
  }
  handle {
    reverse_proxy prometheus:9090
  }
}
CADDY

docker compose -f /opt/prometheus/docker-compose.yml up -d

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

Configuration and secrets handling best practices

Keep alert receiver credentials out of repository-tracked files. Use root-owned files under /opt/prometheus/secrets, with strict file permissions and explicit rotation cadence.

Define ownership for each alert class. Warning alerts should route to operating teams; critical alerts should escalate to primary on-call with backup coverage. Document silence policies to avoid hidden outages during maintenance windows.

Control retention based on storage and investigation needs. A 15-day local retention is practical for single-node stacks; larger estates often pair local retention with remote long-term storage.

Harden network exposure: only Caddy should bind public ports. Keep Node Exporter private and avoid direct internet exposure of metrics endpoints.

Implement change discipline: validate config syntax before every deploy, and keep rollback paths simple by pinning image tags and preserving previous config revisions.

Finally, treat observability as a product: define service-level dashboards with clear operational intent, and retire low-value alerts that do not produce actionable responses.

Verification checklist

Use a repeatable validation sequence after deploys and upgrades:

cd /opt/prometheus
docker compose ps
docker compose logs --tail=100 prometheus
docker compose logs --tail=100 alertmanager
docker compose exec prometheus wget -qO- http://node-exporter:9100/metrics | head
curl -I https://monitor.example.com
curl -s https://monitor.example.com/-/healthy

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

# trigger synthetic load to test HostHighCPU
docker run --rm --cpus=1 alpine sh -c 'apk add --no-cache stress-ng >/dev/null && stress-ng --cpu 1 --timeout 360s'

# reload config without restart
curl -X POST http://localhost:9090/-/reload

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

Common issues and fixes

1) Targets show DOWN

Usually a service-name mismatch or network boundary issue. Verify scrape target names and shared network attachment.

2) HTTPS returns 502

Inspect Caddy logs and Prometheus startup logs. Malformed Prometheus config is a common root cause.

3) No alert notifications

Validate webhook/SMTP reachability from Alertmanager container and review route matching labels.

4) Unexpected disk growth

Review cardinality, scrape intervals, and retention. High-cardinality labels can inflate storage rapidly.

5) Config changes not applied

Use explicit reload or restart procedures and validate that mounted config paths match expected files.

6) Alert fatigue from noisy rules

Promote only actionable rules, add inhibition, and tune repeat intervals to reduce duplicate pages.

FAQ

Can I split Prometheus and Alertmanager onto different hosts?

Yes, with private networking and strict ACLs.

Is Caddy mandatory?

No. Nginx or Traefik can be used if you preserve equivalent security and routing controls.

How frequently should I back up TSDB and alert state?

At least daily for small environments; more often for strict recovery objectives.

Should I expose Node Exporter publicly?

No. Keep exporter endpoints private and scrape through controlled internal networks.

How do I perform low-risk upgrades?

Pin versions, test in staging, back up state, then roll forward with post-upgrade checks.

What is the first scaling step when metrics volume grows?

Tune cardinality and retention first, then evaluate remote-write and long-term storage backends.

Related internal guides

Talk to us

If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.

Contact Us

Before closing deployment, add explicit backup and upgrade commands to your runbook so on-call engineers can execute recovery without guesswork.

mkdir -p /opt/prometheus/backups
tar -czf /opt/prometheus/backups/prometheus-$(date +%F).tgz /opt/prometheus/data/prometheus
tar -czf /opt/prometheus/backups/alertmanager-$(date +%F).tgz /opt/prometheus/data/alertmanager
cd /opt/prometheus
docker compose pull
docker compose up -d --remove-orphans
docker compose ps

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

# quick rollback pattern
cd /opt/prometheus
git checkout -- docker-compose.yml prometheus/prometheus.yml prometheus/rules.yml alertmanager/alertmanager.yml caddy/Caddyfile
docker compose up -d --remove-orphans
docker compose logs --tail=100

If the copy button does not work in your browser, select the code block manually and copy with Ctrl/Cmd+C.

Production Guide: Deploy PostHog with Docker Compose + Traefik + PostgreSQL + ClickHouse on Ubuntu
A production-focused guide to deploying a self-hosted product analytics stack with secure routing, persistent data, and reliable operations.