Healthchecks is a focused monitoring service for work that should happen on a schedule but often fails quietly: cron jobs, backup scripts, certificate renewals, ETL pipelines, queue drains, and periodic maintenance tasks. Instead of scraping metrics from a daemon, each job pings Healthchecks when it starts or finishes. If the ping does not arrive inside the expected window, the team gets an alert before the missing job becomes an outage or a failed restore.
This guide follows the production style used throughout the SysBrix Guides section. We will run Healthchecks on Ubuntu with Docker Compose, PostgreSQL for durable state, and Caddy for automatic HTTPS. The result is a small, auditable dead-man-switch monitoring stack that can be backed up, restored, and operated by a lean infrastructure team.
Architecture and flow overview
The public flow is simple. Operators open https://checks.example.com, Caddy terminates TLS, and traffic is proxied to the Healthchecks web container. Healthchecks stores users, projects, check schedules, notification channels, and ping history in PostgreSQL. Your production jobs call unique ping URLs from shell scripts, CI pipelines, or application tasks. When a ping is late, Healthchecks sends alerts through email or another configured integration.
Keep Healthchecks separate from the systems it monitors. If the same server runs both the backup job and the monitoring system, a host failure can break the job and the alert at the same time. A small independent VPS is usually enough, and the service can monitor jobs across many other servers.
Prerequisites
- Ubuntu 22.04 or 24.04 server with SSH and sudo access.
- A DNS record such as
checks.example.compointing to the server. - SMTP credentials for alert delivery, or a plan to configure another notification channel after login.
- Basic firewall access for ports 22, 80, and 443.
- A password manager for the application secret key, database password, and SMTP password.
Step-by-step deployment
1) Install Docker, Compose, and firewall basics
Start with the platform packages and a narrow firewall. If another proxy already owns ports 80 and 443, place this stack behind that proxy instead of exposing Caddy directly.
sudo apt update
sudo apt install -y ca-certificates curl gnupg ufw
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable
If the copy button is unavailable, select the command block and copy it manually.
2) Create the application layout and secrets
Keep the whole deployment under /opt/healthchecks. The generated Django secret key and database password are required for restores, so save them securely before relying on the service.
sudo mkdir -p /opt/healthchecks/{postgres,caddy,backups}
sudo chown -R $USER:$USER /opt/healthchecks
cd /opt/healthchecks
openssl rand -base64 32 > .secret-key
openssl rand -base64 24 > .postgres-password
chmod 600 .secret-key .postgres-password
If the copy button is unavailable, select the command block and copy it manually.
3) Write environment values
Update the domain, email sender, and SMTP settings for your environment. Registration is disabled by default so random visitors cannot create accounts on a public monitoring service.
cat > .env <<'EOF'
HC_DOMAIN=checks.example.com
POSTGRES_DB=healthchecks
POSTGRES_USER=healthchecks
POSTGRES_PASSWORD=replace-me
SECRET_KEY=replace-me
[email protected]
EMAIL_HOST=smtp.example.com
EMAIL_PORT=587
EMAIL_HOST_USER=smtp-user
EMAIL_HOST_PASSWORD=replace-with-smtp-password
EMAIL_USE_TLS=True
SITE_ROOT=https://checks.example.com
REGISTRATION_OPEN=False
EOF
python3 - <<'PY'
from pathlib import Path
s=Path('.env').read_text()
s=s.replace('POSTGRES_PASSWORD=replace-me','POSTGRES_PASSWORD='+Path('.postgres-password').read_text().strip())
s=s.replace('SECRET_KEY=replace-me','SECRET_KEY='+Path('.secret-key').read_text().strip())
Path('.env').write_text(s)
PY
chmod 600 .env
If the copy button is unavailable, select the command block and copy it manually.
4) Define the Compose stack
The web container runs migrations before starting uWSGI. PostgreSQL remains private on the Docker network, and Caddy is the only service bound to public ports.
cat > docker-compose.yml <<'EOF'
services:
postgres:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_DB: ${POSTGRES_DB}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- ./postgres:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 6
web:
image: healthchecks/healthchecks:latest
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
env_file: .env
environment:
DB: postgres
DB_HOST: postgres
DB_PORT: 5432
command: bash -c "./manage.py migrate && uwsgi /opt/healthchecks/docker/uwsgi.ini"
expose:
- "8000"
caddy:
image: caddy:2-alpine
restart: unless-stopped
depends_on:
- web
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- ./caddy/data:/data
- ./caddy/config:/config
EOF
If the copy button is unavailable, select the command block and copy it manually.
5) Configure Caddy and start the service
Caddy obtains certificates automatically once DNS is correct. Always run docker compose config before startup to catch missing environment variables or YAML mistakes.
cat > Caddyfile <<'EOF'
{$HC_DOMAIN} {
encode zstd gzip
reverse_proxy web:8000
header {
Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
X-Content-Type-Options "nosniff"
Referrer-Policy "strict-origin-when-cross-origin"
}
}
EOF
docker compose config
docker compose up -d
docker compose ps
If the copy button is unavailable, select the command block and copy it manually.
Configuration and secrets handling best practices
Healthchecks ping URLs are operational secrets. Anyone with a check URL can mark a job as healthy, so do not paste them into public logs, screenshots, or shared documentation. Store URLs in root-readable environment files on the machines that run the jobs, or inject them through your CI secret store.
Use projects to separate teams or environments. Production backup checks, staging cleanup tasks, and customer-specific automation should not all live in one flat list. Name checks after the business function, not just the script name: prod-postgres-nightly-backup is better than backup.sh.
Set realistic grace periods. A job that normally takes five minutes should not alert after six minutes if package mirrors or object storage occasionally add delay. Conversely, a daily backup should not have a three-day grace period. Tune alerts to catch real risk without training the team to ignore noise.
Verification checklist
Create the first administrator, confirm HTTPS, check database readiness, and review application logs before creating production checks.
docker compose exec web ./manage.py createsuperuser
curl -I https://checks.example.com
docker compose exec postgres pg_isready -U healthchecks -d healthchecks
docker compose logs --tail=100 web
If the copy button is unavailable, select the command block and copy it manually.
Next, create a test check in the UI and run a manual ping from another host. Confirm the status changes to up, then temporarily skip the ping or lower the schedule window to verify alerts are delivered.
# Example: signal a successful backup job
BACKUP_CHECK_URL="https://checks.example.com/ping/your-uuid-here"
/usr/local/bin/nightly-backup.sh && curl -fsS --retry 3 "$BACKUP_CHECK_URL"
If the copy button is unavailable, select the command block and copy it manually.
Backups and recovery
Back up PostgreSQL plus the Compose files, Caddy state, and environment file. A database dump without the environment file is incomplete because notification configuration and the Django secret key are part of the recovery story.
cat > backup.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
cd /opt/healthchecks
stamp=$(date -u +%Y%m%dT%H%M%SZ)
mkdir -p backups/$stamp
docker compose exec -T postgres pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" > backups/$stamp/healthchecks.sql
cp -a .env Caddyfile docker-compose.yml caddy backups/$stamp/
tar -C backups -czf backups/healthchecks-$stamp.tgz $stamp
find backups -name 'healthchecks-*.tgz' -mtime +14 -delete
EOF
chmod +x backup.sh
./backup.sh
ls -lh backups/*.tgz | tail
If the copy button is unavailable, select the command block and copy it manually.
Test restores on a disposable VM quarterly. After restoration, disable outbound alerts until you are ready, otherwise old overdue checks may page the team during a drill.
Common issues and fixes
Caddy cannot obtain a certificate
Check DNS, make sure ports 80 and 443 are reachable, and inspect docker compose logs caddy. Avoid repeated retries if you hit ACME rate limits.
Email alerts do not send
Verify SMTP host, port, TLS mode, username, password, and sender address. Many providers require an app password or approved sender domain.
Checks stay down even though jobs run
Confirm the job uses the correct ping URL, that outbound HTTPS is allowed from the job host, and that failures are not hidden by shell pipelines. Use set -euo pipefail in critical scripts.
Too many false positives
Increase grace periods, split long jobs into start and success pings, and avoid scheduling checks exactly when maintenance windows or package updates normally run.
The service is up but pages no one
Test every notification channel. Monitoring without alert verification is only a dashboard. Add a recurring test check that proves the alert path still works.
FAQ
Is Healthchecks a replacement for Prometheus?
No. It complements metrics monitoring by focusing on scheduled jobs that must report in. Use Prometheus for time-series metrics and Healthchecks for missing heartbeats.
Should ping URLs use start and success signals?
For long-running or important jobs, yes. Start signals show that a job began, while success signals prove it completed. That distinction helps diagnose stuck tasks.
Can I monitor jobs on private servers?
Yes. The monitored server only needs outbound HTTPS access to the Healthchecks URL. No inbound firewall rule is required on the job host.
How should I handle failed backup scripts?
Ping only after the backup and upload both succeed. If the script fails early, the missing ping should trigger an alert.
Can multiple teams share one instance?
Yes, but use projects, naming conventions, and separate notification channels. Limit administrative access to people who manage the monitoring platform.
What retention should I use?
Keep enough history to investigate recurring failures, but do not treat Healthchecks as a compliance archive. Export critical incident evidence elsewhere if needed.
How often should I update the stack?
Review releases monthly, back up first, update in a maintenance window, and send test pings after the upgrade to confirm alerts still work.
Internal links
These related SysBrix guides use the same operational pattern for production self-hosted services:
- Production Guide: Deploy Apache Superset with Docker Compose + Caddy + PostgreSQL + Redis on Ubuntu
- Production Guide: Deploy ntfy with Docker Compose + Caddy + Auth + Attachments on Ubuntu
- Production Guide: Deploy Meilisearch with Docker Compose + Caddy + Master Key on Ubuntu
Talk to us
If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.