When monitoring is unreliable, teams do not just lose graphsβthey lose time in the most expensive minutes of an incident. A practical self-hosted uptime stack gives operations teams ownership over probe cadence, alert routing, retention, and change control. In this guide, you will deploy Uptime Kuma for health checks and status visibility, place it behind NGINX for clean edge handling, and persist state in PostgreSQL for durability and cleaner backup workflows.
The design here is intentionally production-oriented rather than demo-friendly. We focus on explicit network boundaries, least-privilege secrets handling, idempotent service startup, TLS handling, and verification steps that can be run by on-call engineers during active incidents without guesswork. You can adopt this pattern for a single VM today, then extend it later to multi-node failover as your requirements grow.
At the end, you will have a public HTTPS monitoring endpoint, persistent data volumes, baseline hardening controls, and a repeatable operations checklist for upgrades and recovery drills.
Why this matters operationally: uptime tooling only pays off when it is trusted during chaos. That trust comes from predictable alert routing, clear ownership boundaries, and testable recovery paths. The implementation below is structured to reduce ambiguity for the engineer who inherits the platform months later.
We also aim to minimize hidden coupling. Reverse proxy behavior, application behavior, and data persistence each fail in different ways. By separating these concerns and documenting each verification gate, your team can triage faster and avoid all-or-nothing outages caused by monolithic deployment decisions.
Finally, we keep this workflow practical: commands are copy-ready, fallback handling is explicit, and each step has a reason tied to reliability, security, or maintainability.
Architecture and flow overview
This deployment separates concerns into three layers:
- Edge layer (NGINX): handles HTTPS, headers, redirects, and proxy behavior.
- Application layer (Uptime Kuma): runs monitors, schedules probes, and dispatches notifications.
- Data layer (PostgreSQL): stores persistent state and monitor history.
Traffic flow: users hit NGINX on 443, NGINX proxies to the internal app service, and the app reads/writes durable records in PostgreSQL over a private bridge network. Only NGINX is exposed publicly; app and DB remain private.
This keeps blast radius smaller and simplifies upgrades: edge policy can evolve without touching database internals, and data maintenance can proceed without changing request routing. During incidents, this separation shortens mean-time-to-diagnosis because each layer has clear ownership and logs.
For teams with compliance constraints, this pattern is also easier to audit. You can demonstrate explicit ingress boundaries, encrypted transit at the edge, and principle-of-least-exposure for internal services.
If growth requires high availability later, this baseline still helps: NGINX can be replaced by a managed edge, and PostgreSQL can move to a managed cluster while preserving the same service contracts.
Prerequisites
- Ubuntu 22.04/24.04 server with sudo access (2 vCPU / 4GB RAM minimum).
- DNS record like
status.example.compointed to the host. - Docker Engine + Docker Compose plugin installed.
- Notification channel endpoints (SMTP/chat webhook/on-call tool).
- Firewall allowing only necessary inbound ports.
Before deployment: patch OS packages, verify NTP/timezone, ensure disk headroom, and document ownership for this service. Monitoring stacks can silently fail when ownership is unclear, especially after handoffs.
Establish naming conventions now (stack directory, backup prefixes, environment labels). Consistent naming pays off when searching logs, restoring backups, or rotating credentials under time pressure.
Decide retention and escalation policy before launch. Technical deployment is fast; operational policy drift is what usually causes noisy alerts and alert fatigue later.
Step-by-step deployment
Create runtime directories and environment variables first.
sudo mkdir -p /opt/uptime-kuma/{nginx,postgres,backups}
sudo chown -R $USER:$USER /opt/uptime-kuma
cd /opt/uptime-kuma
cat > .env <<'ENV'
DOMAIN=status.example.com
TZ=America/Chicago
POSTGRES_DB=uptimekuma
POSTGRES_USER=uptimekuma
POSTGRES_PASSWORD=REPLACE_WITH_LONG_RANDOM_PASSWORD
KUMA_DB_TYPE=postgres
KUMA_DB_HOST=postgres
KUMA_DB_PORT=5432
KUMA_DB_NAME=uptimekuma
KUMA_DB_USER=uptimekuma
KUMA_DB_PASSWORD=REPLACE_WITH_LONG_RANDOM_PASSWORD
ENV
If the copy button does not work in your browser/editor, manually select the code block and copy it.
services:
postgres:
image: postgres:16-alpine
container_name: kuma-postgres
env_file: .env
environment:
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- TZ=${TZ}
volumes:
- ./postgres:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 10
restart: unless-stopped
networks: [backend]
uptime-kuma:
image: louislam/uptime-kuma:1
container_name: uptime-kuma
env_file: .env
depends_on:
postgres:
condition: service_healthy
restart: unless-stopped
networks: [backend]
nginx:
image: nginx:1.27-alpine
container_name: kuma-nginx
volumes:
- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
ports:
- "80:80"
- "443:443"
depends_on: [uptime-kuma]
restart: unless-stopped
networks: [backend]
networks:
backend:
driver: bridge
If the copy button does not work in your browser/editor, manually select the code block and copy it.
server {
listen 80;
server_name status.example.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name status.example.com;
ssl_certificate /etc/letsencrypt/live/status.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/status.example.com/privkey.pem;
add_header X-Frame-Options SAMEORIGIN;
add_header X-Content-Type-Options nosniff;
location / {
proxy_pass http://uptime-kuma:3001;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
If the copy button does not work in your browser/editor, manually select the code block and copy it.
cd /opt/uptime-kuma
docker compose up -d
docker compose ps
docker compose logs --tail=120 uptime-kuma nginx postgres
If the copy button does not work in your browser/editor, manually select the code block and copy it.
After initial startup, log into Uptime Kuma and create baseline monitors for homepage HTTP, API health endpoint, and at least one third-party dependency your product relies on. Tag monitors by service domain and severity so routing can remain clean as monitor count grows.
Set notification templates with consistent incident metadata: service name, environment, current state, failure reason, and runbook link. This significantly reduces triage time and avoids context switching during escalations.
Configuration and secrets handling best practices
Do not commit runtime secrets to repository files. Restrict permissions on environment files and rotate credentials on a schedule aligned with your broader secrets policy.
Maintain separate credentials for application DB access and operational maintenance. Shared credentials broaden blast radius and complicate audit trails.
Use two alert channels minimum for critical monitors. Route low-priority checks to asynchronous channels, and reserve paging for user-impacting paths. This avoids alert fatigue while preserving urgency for real incidents.
Define maintenance windows and monitor pause strategy ahead of upgrades. Planned downtime without suppression creates noisy incidents that reduce trust in the platform.
Implement a documented backup routine for PostgreSQL plus periodic restore testing. A backup strategy is only complete when restore drills are rehearsed.
Harden the host: minimal exposed ports, updated packages, and restricted SSH ingress. Monitoring infrastructure should be treated as production control plane, not utility sandbox.
Capture runbooks close to the deployment artifacts. If one engineer knows the stack by memory alone, you have a people-risk problem that will surface during vacations and incidents.
Verification checklist
Run this checklist after deploy and after each upgrade.
curl -I https://status.example.com
docker compose ps
docker exec uptime-kuma sh -lc "nc -zv postgres 5432"
docker compose restart
sleep 8
docker compose ps
If the copy button does not work in your browser/editor, manually select the code block and copy it.
In the UI, run synthetic failure drills: temporarily point one HTTP monitor to a failing path and verify incident state transition, alert dispatch, and recovery notification behavior. Document timestamps for each phase to baseline your expected response timing.
Validate persistence by confirming monitor definitions and history remain intact after controlled restarts. If data is missing, stop rollout and inspect volume bindings before adding more production monitors.
Common issues and fixes
1) 502 from NGINX
Check app readiness and upstream name. Verify proxy_pass target and inspect app logs for startup errors.
2) Database auth failures
Confirm credentials match in environment and database initialization settings. Regenerate passwords if uncertain.
3) TLS cert errors
Validate DNS target, certificate chain, and mounted certificate paths in NGINX.
4) No alerts received
Test channel credentials directly, then verify notification assignment per monitor and severity tag.
5) Disk growth too fast
Tune retention and monitor frequency. Move data volume to larger storage before emergency thresholds.
6) WebSocket disconnects
Keep proxy upgrade headers and HTTP/1.1 enabled for app traffic.
7) Slow dashboard loads
Check host CPU saturation, prune stale records, and review probe concurrency settings.
8) Drift between staging and production
Pin image versions and keep compose definitions versioned so rollbacks are deterministic.
FAQ
Can I run this without PostgreSQL?
Yes, but PostgreSQL is strongly recommended for predictable durability and backup workflows in production.
How often should probes run?
Start with 60-second intervals for non-critical paths and lower only where SLOs justify tighter detection windows.
What should page and what should not?
Only user-impacting failures should page by default. Route lower-priority monitors to async channels.
How do I handle planned maintenance?
Use maintenance windows and monitor pause groups to avoid false escalations during approved changes.
What is a good backup policy?
Nightly backups with periodic restore drills and clear RTO/RPO targets is a practical baseline.
Can I integrate with PagerDuty/Opsgenie/Slack?
Yes. Configure and test each integration with non-production monitors before enabling critical routes.
How do I scale later?
Keep edge, app, and data concerns separated so each layer can scale independently with minimal redesign.
Related internal guides
- https://sysbrix.com/blog/guides-3/production-guide-deploy-gitea-with-docker-compose-traefik-postgresql-on-ubuntu-307
- https://sysbrix.com/blog/guides-3/how-to-deploy-authentik-with-docker-compose-and-traefik-production-guide-299
- https://sysbrix.com/blog/guides-3/production-guide-deploy-outline-wiki-with-docker-compose-caddy-postgresql-on-ubuntu-303
Talk to us
If you want support designing or hardening your observability platform, we can help with architecture, migration planning, and production readiness.