Most teams do not fail because they lack metrics; they fail because their monitoring stack is hard to operate under pressure. During an incident, you need fast host visibility, low-friction dashboards, predictable alert behavior, and a deployment model any on-call engineer can recover at 2 a.m. Netdata is strong for that use case because it gives near real-time telemetry with minimal setup, while Docker Compose keeps operational overhead low for single-node and small-cluster environments.
This guide shows how to deploy Netdata on Ubuntu with Docker Compose + Caddy in a production-oriented way. We will focus on practical concerns: secure exposure, persistent storage, safe secret handling, health checks, backup/restore discipline, and verification steps that reduce guesswork. The target outcome is a monitoring deployment that is reliable enough for real services, not just a quick lab demo.
Architecture and flow overview
The deployment uses one Netdata container for collection and dashboard/API serving, with Caddy as the public HTTPS entry point. Netdata listens internally on port 19999, while Caddy terminates TLS and applies response headers before forwarding traffic. Persistent volumes keep Netdata state across container restarts and host reboots.
- Edge: Caddy handles TLS, redirects, and HTTP security headers.
- Monitoring core: Netdata scrapes host metrics and exposes dashboard/API.
- Persistence: Dedicated config/lib/cache paths under
/srv/netdata. - Security posture: public access only through Caddy; direct 19999 blocked.
- Operations: health checks, smoke tests, and scheduled backup script.
For larger environments, you can later extend this baseline by adding child nodes, streaming, and external auth. Start simple, validate operational routines, then scale architecture with confidence.
Prerequisites
- Ubuntu 22.04 or 24.04 with sudo access
- DNS record such as
metrics.example.compointing to your host - Docker Engine 24+ and Docker Compose plugin
- Caddy available either as host service or Docker-based reverse proxy
- Firewall policy permitting 80/443 but denying public access to 19999
Before proceeding, patch the host, confirm time sync, and ensure only SSH key-based administration is allowed. Monitoring infrastructure is a privileged visibility surface and deserves production hardening from day one.
Step-by-step deployment
1) Install Docker and create persistent directories
Use a predictable filesystem layout so operational handoffs remain simple and audit-friendly.
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg lsb-release
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo mkdir -p /opt/netdata /srv/netdata/config /srv/netdata/lib /srv/netdata/cache /srv/netdata/backups
sudo chown -R $USER:$USER /opt/netdata /srv/netdataIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
2) Define environment values and protect secret files
Even when a value is optional in first deployment, include it explicitly so future rotations and automation changes stay deterministic.
cd /opt/netdata
cat > .env << 'EOF'
NETDATA_CLAIM_TOKEN=replace_with_claim_token_if_using_netdata_cloud
NETDATA_CLAIM_ROOMS=
NETDATA_CLAIM_URL=https://app.netdata.cloud
NETDATA_MEMORY_MODE=dbengine
TZ=UTC
NETDATA_HOSTNAME=monitoring-prod-1
EOF
chmod 600 .envIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
Secret-handling policy: never commit .env to source control, restrict file access to deployment operators, and rotate credentials whenever team membership or CI access changes.
3) Create the Netdata Compose stack
This stack binds host telemetry sources read-only, persists Netdata state, and adds labels for Caddy-based routing where applicable.
cd /opt/netdata
cat > compose.yml << 'EOF'
version: "3.9"
services:
netdata:
image: netdata/netdata:stable
container_name: netdata
hostname: ${NETDATA_HOSTNAME}
env_file: .env
cap_add: [SYS_PTRACE]
security_opt: [apparmor=unconfined]
ports: []
volumes:
- /srv/netdata/config:/etc/netdata
- /srv/netdata/lib:/var/lib/netdata
- /srv/netdata/cache:/var/cache/netdata
- /etc/passwd:/host/etc/passwd:ro
- /etc/group:/host/etc/group:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /etc/os-release:/host/etc/os-release:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
labels:
- caddy=metrics.example.com
- caddy.reverse_proxy={{upstreams 19999}}
- caddy.header.Strict-Transport-Security=max-age=31536000;
- caddy.header.X-Content-Type-Options=nosniff
- caddy.header.X-Frame-Options=SAMEORIGIN
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "-q", "-O", "-", "http://127.0.0.1:19999/api/v1/info"]
interval: 30s
timeout: 10s
retries: 5
EOFIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
If your Caddy setup is file-based instead of label-driven, keep Netdata internal and route via local reverse proxy rules. Do not expose 19999 directly to the internet.
4) Configure Caddy edge routing and access controls
Use either Docker label automation or a traditional Caddyfile. In both patterns, enforce HTTPS and add an authentication layer (for example basic auth or SSO) before exposing dashboards publicly.
docker network create edge || true
cat | sudo tee /etc/caddy/Caddyfile > /dev/null << 'EOF'
metrics.example.com {
reverse_proxy 127.0.0.1:19999
encode gzip
header {
Strict-Transport-Security "max-age=31536000"
X-Content-Type-Options "nosniff"
X-Frame-Options "SAMEORIGIN"
}
basicauth {
ops JDJhJDE0JHlLd3BvWE56aW5FSjQxQ3R2R2xMVS5xVjVWNW1iRG9LQmxkQXhKZlB3dW9DTHNQOGRaRGo2
}
}
EOF
sudo systemctl reload caddyIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
For enterprise environments, prefer SSO in front of Netdata via your identity provider and retain basic auth only as emergency break-glass access.
5) Launch and validate service health
Pull images, deploy, then verify container state and logs before handing over to monitoring consumers.
cd /opt/netdata
docker compose pull
docker compose up -d
docker compose ps
docker logs --tail=100 netdataIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
Healthy startup signals include successful plugin initialization, stable memory mode, and API responses from /api/v1/info.
Configuration and secrets management
Production monitoring is not only about collection; it is about controlled change. Treat Netdata and proxy configuration as managed artifacts with explicit versioning and rollback notes. Keep a short runbook that captures:
- Who can edit Netdata config and where approvals are recorded
- How Caddy auth credentials are rotated and tested
- What data retention window is expected for incident forensics
- How to restore from backup tarballs during host recovery
When possible, move sensitive values from .env into a secret manager and template them into runtime at deploy time. Even small environments benefit from reducing static plaintext secrets.
Verification checklist
Run these checks immediately after deployment and after every significant configuration change:
curl -sS http://127.0.0.1:19999/api/v1/info | jq '.version, .hostname'
curl -sS http://127.0.0.1:19999/api/v1/charts | jq '.charts | keys | length'
curl -I https://metrics.example.com
time curl -sS https://metrics.example.com/api/v1/info > /dev/nullIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
- Dashboard loads over HTTPS with valid certificate chain
- Host CPU, memory, disk, and network charts update in near real time
- No public listener on port 19999
- Container restarts cleanly without data loss
- On-call team can execute checks without tribal knowledge
Backup and recovery operations
Backups are useful only if they are tested. Add a lightweight backup script, retain a sensible number of archives, and perform quarterly restore drills to a staging host.
cat > /opt/netdata/backup-netdata.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%F-%H%M%S)
DEST=/srv/netdata/backups/$TS
mkdir -p "$DEST"
cp -a /srv/netdata/config "$DEST/config"
cp -a /srv/netdata/lib "$DEST/lib"
tar -C /srv/netdata/backups -czf /srv/netdata/backups/netdata-$TS.tgz "$TS"
rm -rf "$DEST"
echo "Backup created: /srv/netdata/backups/netdata-$TS.tgz"
EOF
chmod +x /opt/netdata/backup-netdata.sh
/opt/netdata/backup-netdata.sh
ls -1t /srv/netdata/backups/netdata-*.tgz | tail -n +15 | xargs -r rm -fIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
During restore testing, validate not only that files unpack correctly, but that dashboard continuity and alert behavior match expectations. Capture timing and bottlenecks in your runbook.
Common issues and practical fixes
Issue: Dashboard loads slowly through proxy
Check CPU saturation on the host, verify no oversized compression settings, and confirm upstream keepalive behavior in Caddy. In constrained VMs, right-size scrape modules to reduce overhead.
Issue: Missing host-level metrics in containerized deployment
Review bind mounts for /proc, /sys, and host identity files. Ensure mounts are read-only and present exactly as expected by Netdata collectors.
Issue: TLS works, but access control is weak
Add authentication in Caddy immediately. For long-term posture, integrate SSO and group-based authorization rather than shared static credentials.
Issue: Data disappears after restart
Confirm volume paths map to persistent disk and are not accidentally pointed to ephemeral container storage. Verify ownership/permissions under /srv/netdata.
Issue: Security scan reports exposed local port
Apply host firewall deny on 19999 and re-check with ss/nmap from a remote vantage point.
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw deny 19999/tcp
sudo ufw --force enable
sudo ss -tulpen | grep 19999 || trueIf the copy button does not work in your browser, manually select the code block and copy with Ctrl/Cmd+C.
FAQ
1) Is Netdata suitable for production, or only for quick diagnostics?
It is suitable for production when deployed with proper access controls, persistence, and operational procedures. The biggest risk is not Netdata itself, but weak proxy/auth and no runbook discipline.
2) Should I expose Netdata directly on 19999?
No. Keep 19999 internal and publish through Caddy on 443 with authentication and TLS. This reduces attack surface and gives a single ingress control point.
3) Do I need Netdata Cloud for this architecture?
No, this guide works without Netdata Cloud. Cloud features can be added later for fleet visibility. Start with a stable local baseline and add dependencies intentionally.
4) How much retention should I keep?
Retention depends on disk budget and investigation needs. Many teams start with a modest local window and export long-term metrics to another system for extended analytics.
5) What is the safest way to rotate credentials?
Rotate in stages: create new credentials, update proxy/secret source, reload services, validate access, then revoke old values. Always perform rotation during a staffed window.
6) Can I run this with rootless containers?
Possible, but host telemetry access and capability requirements can differ. Validate collector coverage and performance in staging before changing runtime model in production.
Related guides
If you are building a broader platform stack, these guides are useful next steps:
- Production Guide: Deploy ZITADEL with Docker Compose + Traefik + PostgreSQL
- Production Guide: Deploy Vaultwarden with Docker Compose + NGINX + PostgreSQL
- Production Guide: Deploy RabbitMQ with Docker Compose + Caddy
Talk to us
If you need help designing a resilient monitoring stack, hardening reverse-proxy exposure, or building alert runbooks for production operations, our team can help.