When teams move from ad-hoc server checks to reliable operations, they usually need three things quickly: a trustworthy metrics pipeline, clear dashboards for non-SRE stakeholders, and a deployment model that can be audited and recovered under pressure. This guide delivers exactly that by deploying Prometheus for scraping and storage, Grafana for visualization, and Nginx as a hardened public edge. The stack is intentionally opinionated for small and mid-size production environments where you need practical reliability without committing to full Kubernetes complexity on day one.
The workflow in this guide mirrors real operations: isolate services on a private Docker network, expose only a TLS-terminated reverse proxy, keep credentials outside image layers, and build verification checks that catch silent misconfiguration before users do. You will deploy, validate, and harden the environment, then add backup and restore procedures so the monitoring platform remains dependable during incidents and upgrades.
Architecture and flow overview
Request flow is straightforward: administrators access Grafana over HTTPS through Nginx, while Prometheus and exporters remain private on an internal bridge network. Prometheus scrapes configured targets on fixed intervals, stores time-series data in its local volume, and exposes query endpoints to Grafana through the internal network. This keeps public attack surface minimal while preserving low-latency data access between services.
For resilience, persistent volumes are used for both Prometheus TSDB and Grafana configuration/dashboards. If the host reboots, containers restart automatically and state remains intact. Operationally, this design separates concerns cleanly: Nginx controls certificates and edge policy, Prometheus handles collection and retention, and Grafana handles presentation, access control, and alerting integrations.
# Request flow
# Browser -> Nginx :443 -> Grafana :3000
# Prometheus :9090 and exporters stay on internal network
# Grafana queries Prometheus over docker network only
If the copy button does not work in your browser/editor, manually select and copy the command block.
Prerequisites
- Ubuntu 22.04+ host (minimum 2 vCPU, 4 GB RAM; recommended 4 vCPU, 8 GB RAM).
- A DNS name pointed to your server (example:
metrics.example.com). - Docker Engine and Docker Compose plugin installed.
- Ports 80/443 reachable from the internet for TLS validation.
- A non-root sudo user for day-to-day operations.
- SMTP or chat webhook destination for alert notifications.
Step-by-step deployment
1) Prepare directories and strict permissions
Create an isolated project path and lock down secrets files so API keys and admin credentials are never world-readable.
sudo mkdir -p /opt/monitoring/{prometheus,grafana/nginx}
cd /opt/monitoring
install -m 700 -d secrets
touch .env
chmod 600 .env
If the copy button does not work in your browser/editor, manually select and copy the command block.
2) Generate secrets and environment variables
Use long random secrets and keep them in .env. Avoid hardcoding credentials in compose files or committing them to source control.
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 36 | tr -d "\n")
cat > /opt/monitoring/.env <
If the copy button does not work in your browser/editor, manually select and copy the command block.
3) Create Prometheus configuration
Start with core scrape jobs for Prometheus itself and node-exporter. Keep scrape intervals explicit so capacity planning and troubleshooting are predictable.
cat > /opt/monitoring/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['prometheus:9090']
- job_name: node-exporter
static_configs:
- targets: ['node-exporter:9100']
EOF
If the copy button does not work in your browser/editor, manually select and copy the command block.
4) Create Docker Compose stack
This stack keeps Prometheus internal, publishes Grafana only to localhost, and routes public traffic through Nginx. That pattern allows security policy and TLS controls to remain centralized.
cat > /opt/monitoring/docker-compose.yml <<'EOF'
services:
prometheus:
image: prom/prometheus:latest
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=${PROM_RETENTION}
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
command:
- --path.rootfs=/host
volumes:
- /:/host:ro,rslave
restart: unless-stopped
grafana:
image: grafana/grafana:latest
env_file: .env
volumes:
- grafana_data:/var/lib/grafana
ports:
- 127.0.0.1:3000:3000
depends_on:
- prometheus
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
EOF
If the copy button does not work in your browser/editor, manually select and copy the command block.
5) Configure Nginx reverse proxy with TLS
Nginx terminates TLS and forwards requests to Grafana on localhost. Add security headers and conservative request limits to reduce abuse risk.
sudo apt-get update && sudo apt-get install -y nginx certbot python3-certbot-nginx
cat | sudo tee /etc/nginx/sites-available/grafana >/dev/null <<'EOF'
server {
listen 80;
server_name metrics.example.com;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 60s;
}
}
EOF
sudo ln -s /etc/nginx/sites-available/grafana /etc/nginx/sites-enabled/grafana
sudo nginx -t && sudo systemctl reload nginx
sudo certbot --nginx -d metrics.example.com --redirect --agree-tos -m [email protected] -n
If the copy button does not work in your browser/editor, manually select and copy the command block.
6) Launch services and initialize Grafana datasource
Bring the stack up, verify container health, and register Prometheus as a data source. For repeatable provisioning, use Grafana provisioning files in production rather than manual clicks.
cd /opt/monitoring
docker compose up -d
docker compose ps
# quick health checks
curl -sf http://127.0.0.1:3000/api/health
docker compose exec prometheus wget -qO- http://localhost:9090/-/ready
If the copy button does not work in your browser/editor, manually select and copy the command block.
7) Add retention, backups, and upgrade discipline
Monitoring data is most valuable during incidents, so backup discipline matters. Capture Grafana and Prometheus volumes with predictable retention and test restores monthly.
cat > /usr/local/bin/monitoring-backup.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%F-%H%M%S)
DEST=/var/backups/monitoring/$TS
mkdir -p "$DEST"
docker run --rm -v monitoring_grafana_data:/src -v "$DEST":/dest alpine tar czf /dest/grafana.tgz -C /src .
docker run --rm -v monitoring_prometheus_data:/src -v "$DEST":/dest alpine tar czf /dest/prometheus.tgz -C /src .
find /var/backups/monitoring -mindepth 1 -maxdepth 1 -type d -mtime +14 -exec rm -rf {} +
EOF
chmod +x /usr/local/bin/monitoring-backup.sh
( crontab -l 2>/dev/null; echo "20 2 * * * /usr/local/bin/monitoring-backup.sh" ) | crontab -
If the copy button does not work in your browser/editor, manually select and copy the command block.
Configuration and secrets handling best practices
Use environment variables only for non-sensitive defaults and prefer Docker secrets or external secret managers for high-trust environments. Restrict shell history in shared bastions, rotate Grafana admin credentials after handoff, and use separate service accounts for automation. If you integrate cloud metrics or managed databases, place API tokens in secret files mounted read-only into the container and avoid embedding tokens in dashboard JSON exports.
At the network level, enforce host firewall rules so only 80/443 are public and all monitoring backplane ports remain private. If teams need remote Prometheus access, expose it through authenticated private networking (VPN, Tailscale, WireGuard) rather than opening port 9090 publicly. Finally, define retention based on incident review windows and available disk IOPS; long retention without storage tuning causes preventable query latency and compaction churn.
Verification checklist
docker compose psshows all services as running and healthy.https://metrics.example.comloads with valid TLS certificate.- Grafana can query Prometheus and dashboard panels return recent datapoints.
- Node exporter metrics include CPU, memory, filesystem, and network timeseries.
- Backup script creates restorable archives and old snapshots rotate automatically.
cd /opt/monitoring
docker compose ps
curl -I https://metrics.example.com
curl -sf https://metrics.example.com/api/health
docker compose exec prometheus wget -qO- http://localhost:9090/api/v1/targets | jq ".data.activeTargets[] | {health: .health, labels: .labels.job}"
If the copy button does not work in your browser/editor, manually select and copy the command block.
Common issues and fixes
Grafana login loops after reverse proxy setup
Set the external URL and protocol correctly via GF_SERVER_ROOT_URL and ensure Nginx forwards X-Forwarded-Proto. Mismatch between internal and external URL is the usual root cause.
Prometheus target is down for node-exporter
Confirm exporter container is attached to the same compose network and target hostname matches service name. Check for host firewall rules blocking internal container traffic if custom bridge networks are used.
Disk usage grows unexpectedly
Retention and scrape cardinality are often misaligned. Reduce label explosion, trim scrape intervals for noisy jobs, and tune retention to match available storage and recovery objectives.
Intermittent 502 from Nginx
Grafana may be restarting during plugin changes or memory pressure. Inspect container logs, increase host memory/swap baseline, and add upstream timeouts conservatively.
FAQ
Can I expose Prometheus directly for remote troubleshooting?
Avoid public exposure. Use private networking plus authentication, or route through a bastion with strict access control and audit logging.
How much retention should I keep in Prometheus?
Start with 15 days for small environments, then tune by incident review needs, cardinality, and disk performance. Increase only after measuring compaction overhead.
Should I run Alertmanager in the same stack?
Yes for most production setups. Keep Alertmanager internal, integrate with email/chat/webhooks, and test escalation paths during daytime drills.
How do I make dashboards reproducible across environments?
Provision datasources and dashboards as code using Grafana provisioning files and version control. Avoid one-off UI-only changes in production.
Is Docker Compose enough for production monitoring?
For many teams yes, especially single-region workloads with clear backup and restore procedures. Move to Kubernetes when you need stronger multi-node orchestration constraints.
How do I rotate Grafana admin credentials safely?
Create named admin users first, validate access, then rotate or disable bootstrap credentials in a maintenance window with rollback notes.
What is the safest way to add application metrics?
Expose application metrics on private interfaces and scrape through Prometheus service discovery. Do not place metrics endpoints directly on public networks.
Related guides
For teams building a broader platform baseline, these guides complement this deployment:
- Production Guide: Deploy n8n with Docker Compose + Nginx + PostgreSQL on Ubuntu
- Production Guide: Deploy OpenObserve with Docker Compose + Traefik + ClickHouse on Ubuntu
- Production Guide: Deploy Docmost with Docker Compose, Traefik, and PostgreSQL on Ubuntu
Talk to us
If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.