Production monitoring needs more than a running container. Teams need reliable startup, controlled upgrades, secure secrets, and a repeatable acceptance checklist. This guide shows a practical Grafana deployment pattern using Docker Compose with systemd supervision so the service behaves predictably during restarts, incidents, and maintenance windows.
The objective is operational confidence: any engineer on rotation should be able to deploy, verify, and recover the stack without tribal knowledge. We focus on clear structure, version pinning, and runbook-friendly commands instead of one-off setup shortcuts.
In real environments, monitoring failures are often discovered during an outage, exactly when stress is highest. A resilient baseline prevents avoidable firefighting by giving your team deterministic behavior on boot, clear health checks, and rollback-ready backups. This tutorial is intentionally production-oriented, with explicit controls for reliability and ownership.
Another practical advantage of this approach is onboarding speed. New engineers can review the same directory structure, the same systemd unit semantics, and the same verification checklist, rather than learning a custom setup from chat history. The result is fewer surprises, faster incident response, and cleaner accountability across platform and application teams.
Architecture and flow overview
Grafana runs as a container, receives traffic through a TLS reverse proxy, and stores persistent state on host storage. systemd controls lifecycle so service state remains consistent after reboot and during controlled restarts.
- Edge: HTTPS reverse proxy with forwarded headers
- App: Grafana container managed by Docker Compose
- State: Mounted persistent data path
- Control: systemd unit for startup and stop semantics
- Ops: Backup workflow plus restore drill cadence
This separation keeps each concern explicit and easier to audit.
Prerequisites
- Linux VM/server with sudo access
- Docker Engine + Compose plugin
- DNS record for monitoring domain
- Firewall access for SSH and HTTPS
- Password manager entry for admin credentials
- Owner assigned for monitoring operations
Step-by-step deployment
1) Create directories and permissions
Use a fixed layout so every runbook references the same paths.
sudo mkdir -p /opt/grafana/{compose,provisioning,dashboards,backups}
sudo mkdir -p /var/lib/grafana-data
sudo chown -R 472:472 /var/lib/grafana-data
sudo chown -R $USER:$USER /opt/grafanaIf the copy button is unavailable in your browser/editor, manually copy the command block above.
2) Store secrets in environment file
Keep credentials out of compose yaml and repositories.
cat > /opt/grafana/.env.prod <<'EOF'
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=REPLACE_WITH_LONG_RANDOM_PASSWORD
GF_SERVER_ROOT_URL=https://monitoring.example.com
GF_USERS_ALLOW_SIGN_UP=false
GF_AUTH_ANONYMOUS_ENABLED=false
GF_LOG_MODE=console
EOF
chmod 600 /opt/grafana/.env.prodIf the copy button is unavailable in your browser/editor, manually copy the command block above.
3) Define compose service
cat > /opt/grafana/compose/docker-compose.yml <<'EOF'
services:
grafana:
image: grafana/grafana:11.1.4
container_name: grafana
restart: unless-stopped
env_file:
- /opt/grafana/.env.prod
ports:
- "127.0.0.1:3000:3000"
volumes:
- /var/lib/grafana-data:/var/lib/grafana
- /opt/grafana/provisioning:/etc/grafana/provisioning
- /opt/grafana/dashboards:/var/lib/grafana/dashboards
healthcheck:
test: ["CMD", "wget", "-qO-", "http://127.0.0.1:3000/api/health"]
interval: 30s
timeout: 5s
retries: 5
start_period: 40s
EOFIf the copy button is unavailable in your browser/editor, manually copy the command block above.
4) Create systemd unit file
Use a unit wrapper so startup ordering and service state are predictable.
cat > /tmp/grafana-compose.service <<'EOF'
[Unit]
Description=Grafana Docker Compose Stack
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/grafana/compose
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
EOFIf the copy button is unavailable in your browser/editor, manually copy the command block above.
5) Verify runtime state
systemctl status grafana-compose.service --no-pager
docker compose -f /opt/grafana/compose/docker-compose.yml ps
curl -s https://monitoring.example.com/api/healthIf the copy button is unavailable in your browser/editor, manually copy the command block above.
6) Backup job
cat > /opt/grafana/backups/backup-grafana.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%F-%H%M)
mkdir -p /opt/grafana/backups/archive
sudo tar -czf /opt/grafana/backups/archive/grafana-$TS.tgz /var/lib/grafana-data /opt/grafana/provisioning /opt/grafana/dashboards
echo "backup created: grafana-$TS.tgz"
EOF
chmod +x /opt/grafana/backups/backup-grafana.shIf the copy button is unavailable in your browser/editor, manually copy the command block above.
7) Acceptance checklist execution
docker compose -f /opt/grafana/compose/docker-compose.yml ps
curl -s https://monitoring.example.com/api/health
ls -lh /opt/grafana/backups/archive | tail -n 3If the copy button is unavailable in your browser/editor, manually copy the command block above.
Configuration/secrets handling
Limit admin access to named maintainers and rotate credentials on a schedule. Document who can change dashboards, data sources, and alert policies. If you are in a regulated environment, keep change approvals linked to pull requests and maintenance tickets.
Secrets should be managed as first-class assets. At minimum, protect environment files with strict permissions and encrypted storage. In larger environments, migrate to a dedicated secret manager and keep key names stable so deployment templates do not drift between environments.
Treat provisioning as code: dashboards, folders, and data source settings should be reviewed and versioned. This reduces unexpected behavior during incidents and helps new team members understand intent quickly. Consistency is more important than cleverness.
Finally, define upgrade policy early. Pin versions, test in staging, and require validation evidence before production rollouts. Monitoring must be dependable under pressure; operational discipline is what makes that true over time.
From an operations-management perspective, assign a primary and secondary owner for this service and document decision rights clearly: who can approve upgrades, who can rollback, and who must be paged when acceptance tests fail. This governance layer is often missing in technical tutorials, but it is essential for stable production ownership and faster decision-making during incidents.
Verification
After deployment and each upgrade, verify service state, health endpoint response, and backup output. Save results in a runbook so anyone on-call can confirm baseline health in minutes.
- Service starts after host reboot
- Container reports healthy status
- Health endpoint responds successfully
- TLS path remains valid through proxy
- Backup artifact is generated on schedule
Verification is a contract with future incident response. If checks are skipped, downtime risk increases.
For stronger operational maturity, keep a short acceptance template with timestamp, environment, operator, and outcome. Store these records in your incident-management system so future postmortems can correlate service changes with platform behavior.
Common issues/fixes
Container exits on startup
Check environment variable syntax and filesystem permissions on mounted data paths.
Login works but settings are not saved
Usually indicates write-permission mismatch in persistent storage. Reapply ownership and restart.
Wrong callback URLs
Set the public root URL and confirm forwarded-proto header handling in the proxy layer.
Slow dashboards during incidents
Reduce refresh intervals and optimize data-source queries before scaling compute.
Plugin issues after upgrade
Pin plugin versions and maintain rollback artifacts for the previous image and data.
Service not available after reboot
Validate unit dependencies so Docker is ready before the compose service is started.
FAQ
Is Compose enough for production Grafana?
Yes for many teams, if you add lifecycle control, verification, and backup discipline.
How often should restores be tested?
Monthly minimum, and after backup-script changes.
Should Grafana be internet-exposed directly?
No. Put it behind an HTTPS proxy and restrict direct port exposure.
What is a safe upgrade process?
Pin versions, test in staging, back up first, and verify after rollout.
Can we start with .env and migrate later?
Yes. Begin with strict permissions, then migrate to secret management as maturity increases.
What should the handoff include?
Runbook commands, acceptance checks, rollback steps, and escalation contacts.
How many admins are recommended?
Keep it minimal: two named maintainers plus a controlled break-glass account.
Internal links
- https://sysbrix.com/blog/guides-3/production-guide-deploy-uptime-kuma-with-docker-compose-nginx-postgresql-on-ubuntu-313
- https://sysbrix.com/blog/guides-3/production-guide-deploy-gitea-with-docker-compose-traefik-postgresql-on-ubuntu-307
- https://sysbrix.com/blog/guides-3/how-to-deploy-authentik-with-docker-compose-and-traefik-production-guide-299
Talk to us
If you want support designing or hardening your observability platform, we can help with architecture, migration planning, and production readiness.