Skip to Content

Production Guide: Deploy Netdata with Docker Compose + Caddy on Ubuntu

A production-focused deployment pattern for Netdata with secure proxying, sane retention, alerting, and operational runbooks.

Most teams do not fail because they lack metrics; they fail because their monitoring stack is hard to operate under pressure. During an incident, you need fast host visibility, low-friction dashboards, predictable alert behavior, and a deployment model any on-call engineer can recover at 2 a.m. Netdata is strong for that use case because it gives near real-time telemetry with minimal setup, while Docker Compose keeps operational overhead low for single-node and small-cluster environments.

This guide shows how to deploy Netdata on Ubuntu with Docker Compose + Caddy in a production-oriented way. We will focus on practical concerns: secure exposure, persistent storage, safe secret handling, health checks, backup/restore discipline, and verification steps that reduce guesswork. The target outcome is a monitoring deployment that is reliable enough for real services, not just a quick lab demo.

Architecture and flow overview

The deployment uses one Netdata container for collection and dashboard/API serving, with Caddy as the public HTTPS entry point. Netdata listens internally on port 19999, while Caddy terminates TLS and applies response headers before forwarding traffic. Persistent volumes keep Netdata state across container restarts and host reboots.

  • Edge: Caddy handles TLS, redirects, and HTTP security headers.
  • Monitoring core: Netdata scrapes host metrics and exposes dashboard/API.
  • Persistence: Dedicated config/lib/cache paths under /srv/netdata.
  • Security posture: public access only through Caddy; direct 19999 blocked.
  • Operations: health checks, smoke tests, and scheduled backup script.

For larger environments, you can later extend this baseline by adding child nodes, streaming, and external auth. Start simple, validate operational routines, then scale architecture with confidence.

Prerequisites

  • Ubuntu 22.04 or 24.04 with sudo access
  • DNS record such as metrics.example.com pointing to your host
  • Docker Engine 24+ and Docker Compose plugin
  • Caddy available either as host service or Docker-based reverse proxy
  • Firewall policy permitting 80/443 but denying public access to 19999

Before proceeding, patch the host, confirm time sync, and ensure only SSH key-based administration is allowed. Monitoring infrastructure is a privileged visibility surface and deserves production hardening from day one.

Step-by-step deployment

1) Install Docker and create persistent directories

Use a predictable filesystem layout so operational handoffs remain simple and audit-friendly.

sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo mkdir -p /opt/netdata /srv/netdata/config /srv/netdata/lib /srv/netdata/cache /srv/netdata/backups
sudo chown -R $USER:$USER /opt/netdata /srv/netdata

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

2) Define environment values and protect secret files

Even when a value is optional in first deployment, include it explicitly so future rotations and automation changes stay deterministic.

cd /opt/netdata
cat > .env << 'EOF'
NETDATA_CLAIM_TOKEN=replace_with_claim_token_if_using_netdata_cloud
NETDATA_CLAIM_ROOMS=
NETDATA_CLAIM_URL=https://app.netdata.cloud
NETDATA_MEMORY_MODE=dbengine
TZ=UTC
NETDATA_HOSTNAME=monitoring-prod-1
EOF
chmod 600 .env

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

Secret-handling policy: never commit .env to source control, restrict file access to deployment operators, and rotate credentials whenever team membership or CI access changes.

3) Create the Netdata Compose stack

This stack binds host telemetry sources read-only, persists Netdata state, and adds labels for Caddy-based routing where applicable.

cd /opt/netdata
cat > compose.yml << 'EOF'
version: "3.9"
services:
  netdata:
    image: netdata/netdata:stable
    container_name: netdata
    hostname: ${NETDATA_HOSTNAME}
    env_file: .env
    cap_add: [SYS_PTRACE]
    security_opt: [apparmor=unconfined]
    ports: []
    volumes:
      - /srv/netdata/config:/etc/netdata
      - /srv/netdata/lib:/var/lib/netdata
      - /srv/netdata/cache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    labels:
      - caddy=metrics.example.com
      - caddy.reverse_proxy={{upstreams 19999}}
      - caddy.header.Strict-Transport-Security=max-age=31536000;
      - caddy.header.X-Content-Type-Options=nosniff
      - caddy.header.X-Frame-Options=SAMEORIGIN
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "-q", "-O", "-", "http://127.0.0.1:19999/api/v1/info"]
      interval: 30s
      timeout: 10s
      retries: 5
EOF

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

If your Caddy setup is file-based instead of label-driven, keep Netdata internal and route via local reverse proxy rules. Do not expose 19999 directly to the internet.

4) Configure Caddy edge routing and access controls

Use either Docker label automation or a traditional Caddyfile. In both patterns, enforce HTTPS and add an authentication layer (for example basic auth or SSO) before exposing dashboards publicly.

docker network create edge || true
cat | sudo tee /etc/caddy/Caddyfile > /dev/null << 'EOF'
metrics.example.com {
    reverse_proxy 127.0.0.1:19999
    encode gzip
    header {
      Strict-Transport-Security "max-age=31536000"
      X-Content-Type-Options "nosniff"
      X-Frame-Options "SAMEORIGIN"
    }
    basicauth {
      ops JDJhJDE0JHlLd3BvWE56aW5FSjQxQ3R2R2xMVS5xVjVWNW1iRG9LQmxkQXhKZlB3dW9DTHNQOGRaRGo2
    }
}
EOF
sudo systemctl reload caddy

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

For enterprise environments, prefer SSO in front of Netdata via your identity provider and retain basic auth only as emergency break-glass access.

5) Launch and validate service health

Pull images, deploy, then verify container state and logs before handing over to monitoring consumers.

cd /opt/netdata
docker compose pull
docker compose up -d
docker compose ps
docker logs --tail=100 netdata

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

Healthy startup signals include successful plugin initialization, stable memory mode, and API responses from /api/v1/info.

Configuration and secrets management

Production monitoring is not only about collection; it is about controlled change. Treat Netdata and proxy configuration as managed artifacts with explicit versioning and rollback notes. Keep a short runbook that captures:

  • Who can edit Netdata config and where approvals are recorded
  • How Caddy auth credentials are rotated and tested
  • What data retention window is expected for incident forensics
  • How to restore from backup tarballs during host recovery

When possible, move sensitive values from .env into a secret manager and template them into runtime at deploy time. Even small environments benefit from reducing static plaintext secrets.

Verification checklist

Run these checks immediately after deployment and after every significant configuration change:

curl -sS http://127.0.0.1:19999/api/v1/info | jq '.version, .hostname'
curl -sS http://127.0.0.1:19999/api/v1/charts | jq '.charts | keys | length'
curl -I https://metrics.example.com
time curl -sS https://metrics.example.com/api/v1/info > /dev/null

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

  • Dashboard loads over HTTPS with valid certificate chain
  • Host CPU, memory, disk, and network charts update in near real time
  • No public listener on port 19999
  • Container restarts cleanly without data loss
  • On-call team can execute checks without tribal knowledge

Backup and recovery operations

Backups are useful only if they are tested. Add a lightweight backup script, retain a sensible number of archives, and perform quarterly restore drills to a staging host.

cat > /opt/netdata/backup-netdata.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%F-%H%M%S)
DEST=/srv/netdata/backups/$TS
mkdir -p "$DEST"
cp -a /srv/netdata/config "$DEST/config"
cp -a /srv/netdata/lib "$DEST/lib"
tar -C /srv/netdata/backups -czf /srv/netdata/backups/netdata-$TS.tgz "$TS"
rm -rf "$DEST"
echo "Backup created: /srv/netdata/backups/netdata-$TS.tgz"
EOF
chmod +x /opt/netdata/backup-netdata.sh
/opt/netdata/backup-netdata.sh
ls -1t /srv/netdata/backups/netdata-*.tgz | tail -n +15 | xargs -r rm -f

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

During restore testing, validate not only that files unpack correctly, but that dashboard continuity and alert behavior match expectations. Capture timing and bottlenecks in your runbook.

Common issues and practical fixes

Issue: Dashboard loads slowly through proxy

Check CPU saturation on the host, verify no oversized compression settings, and confirm upstream keepalive behavior in Caddy. In constrained VMs, right-size scrape modules to reduce overhead.

Issue: Missing host-level metrics in containerized deployment

Review bind mounts for /proc, /sys, and host identity files. Ensure mounts are read-only and present exactly as expected by Netdata collectors.

Issue: TLS works, but access control is weak

Add authentication in Caddy immediately. For long-term posture, integrate SSO and group-based authorization rather than shared static credentials.

Issue: Data disappears after restart

Confirm volume paths map to persistent disk and are not accidentally pointed to ephemeral container storage. Verify ownership/permissions under /srv/netdata.

Issue: Security scan reports exposed local port

Apply host firewall deny on 19999 and re-check with ss/nmap from a remote vantage point.

sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw deny 19999/tcp
sudo ufw --force enable
sudo ss -tulpen | grep 19999 || true

If the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.

FAQ

1) Is Netdata suitable for production, or only for quick diagnostics?

It is suitable for production when deployed with proper access controls, persistence, and operational procedures. The biggest risk is not Netdata itself, but weak proxy/auth and no runbook discipline.

2) Should I expose Netdata directly on 19999?

No. Keep 19999 internal and publish through Caddy on 443 with authentication and TLS. This reduces attack surface and gives a single ingress control point.

3) Do I need Netdata Cloud for this architecture?

No, this guide works without Netdata Cloud. Cloud features can be added later for fleet visibility. Start with a stable local baseline and add dependencies intentionally.

4) How much retention should I keep?

Retention depends on disk budget and investigation needs. Many teams start with a modest local window and export long-term metrics to another system for extended analytics.

5) What is the safest way to rotate credentials?

Rotate in stages: create new credentials, update proxy/secret source, reload services, validate access, then revoke old values. Always perform rotation during a staffed window.

6) Can I run this with rootless containers?

Possible, but host telemetry access and capability requirements can differ. Validate collector coverage and performance in staging before changing runtime model in production.

Related guides

If you are building a broader platform stack, these guides are useful next steps:

Talk to us

If you need help designing a resilient monitoring stack, hardening reverse-proxy exposure, or building alert runbooks for production operations, our team can help.

Contact Us

Production Guide: Deploy ZITADEL with Docker Compose + Traefik + PostgreSQL on Ubuntu
A production-focused IAM deployment with secure secrets handling, TLS routing, backup strategy, and operational troubleshooting.