Uptime Kuma Setup: SLA Tracking, Certificate Management, Database Monitoring, and SRE Best Practices

Uptime Kuma Setup: SLA Tracking, Certificate Management, Database Monitoring, and SRE Best Practices

The first three guides in this series covered deployment, advanced monitor types, and API automation. This final guide covers what most monitoring setups never get to: turning uptime data into meaningful reliability metrics, automating certificate expiry management before it causes incidents, monitoring internal services that never touch the public internet, and applying SRE practices to your Uptime Kuma setup so your data actually drives decisions rather than sitting in a dashboard nobody looks at.

This guide is the fourth in the series. For initial setup see our basic setup guide, for production configuration see our advanced monitoring guide, and for API automation and multi-location monitoring see our automation guide. This guide assumes a fully operational, production-configured Uptime Kuma instance.

Prerequisites

A running Uptime Kuma instance with HTTPS and at least 10 monitors configured
Uptime Kuma version 1.23+ for the certificate monitoring features covered here
Docker socket mounted (for container health monitoring)
At least 30 days of historical monitoring data for meaningful SLA calculations
python3 and jq on your workstation for the SLA reporting scripts

Verify your instance has sufficient history for SLA tracking:

# Check database size and earliest heartbeat records
docker exec uptime-kuma sqlite3 /app/data/kuma.db \
  "SELECT MIN(time), MAX(time), COUNT(*) FROM heartbeat;"

# Should show at least 30 days between MIN and MAX
# COUNT(*) gives total heartbeat records

# Check how many monitors you have:
docker exec uptime-kuma sqlite3 /app/data/kuma.db \
  "SELECT COUNT(*) as monitors, type FROM monitor GROUP BY type ORDER BY monitors DESC;"

# Verify Prometheus metrics are accessible:
curl -s https://monitor.yourdomain.com/metrics | grep monitor_status | wc -l
# Should equal your monitor count

SLA Tracking and Reliability Metrics

Uptime Kuma shows you real-time status and rolling uptime percentages. But SLA compliance requires something more specific: calculating uptime over defined time windows (calendar month, quarter, fiscal year), comparing against targets, and generating reports your team can actually use in planning and postmortems.

Understanding What Uptime Kuma Measures

Uptime Kuma's built-in uptime display (the percentage shown per monitor) is calculated over the last 24 hours and 30 days using heartbeat success rates. For SLA purposes, you need calendar-aligned calculations over specific periods. The formula:

# SLA calculation from Uptime Kuma's SQLite database
# Uptime Kuma stores heartbeats in the 'heartbeat' table:
# - status: 1 = up, 0 = down
# - time: Unix timestamp in milliseconds
# - monitor_id: links to monitor table

# Calculate SLA for a specific monitor over the last calendar month:
docker exec uptime-kuma sqlite3 /app/data/kuma.db << 'EOF'
.headers on
.mode csv

SELECT
  m.name AS monitor_name,
  COUNT(*) AS total_checks,
  SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS successful_checks,
  ROUND(
    CAST(SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS FLOAT) /
    COUNT(*) * 100, 4
  ) AS uptime_percent,
  -- SLA targets: 99.9% = 8.7h/year, 99.95% = 4.4h/year, 99.99% = 52m/year
  CASE
    WHEN ROUND(CAST(SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) * 100, 4) >= 99.99 THEN 'MEETING 99.99%'
    WHEN ROUND(CAST(SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) * 100, 4) >= 99.9  THEN 'MEETING 99.9%'
    WHEN ROUND(CAST(SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) * 100, 4) >= 99.0  THEN 'MEETING 99.0%'
    ELSE 'BELOW SLA'
  END AS sla_status
FROM heartbeat h
JOIN monitor m ON h.monitor_id = m.id
WHERE
  h.time >= strftime('%s', 'now', 'start of month') * 1000
  AND h.time < strftime('%s', 'now', 'start of month', '+1 month') * 1000
GROUP BY h.monitor_id, m.name
ORDER BY uptime_percent ASC;
EOF

Automated Monthly SLA Report

#!/usr/bin/env python3
# generate-sla-report.py
# Generates a monthly SLA report from Uptime Kuma's SQLite database
# and emails it to stakeholders

import sqlite3
import smtplib
import os
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from datetime import datetime, date
from dateutil.relativedelta import relativedelta

DB_PATH = "/app/data/kuma.db"  # Path inside container; adjust for host mount
SMTP_HOST = os.environ.get("SMTP_HOST", "smtp.yourdomain.com")
SMTP_PORT = int(os.environ.get("SMTP_PORT", "587"))
SMTP_USER = os.environ.get("SMTP_USER")
SMTP_PASS = os.environ.get("SMTP_PASS")
RECIPIENTS = os.environ.get("REPORT_RECIPIENTS", "").split(",")

# SLA targets per monitor name pattern
SLA_TARGETS = {
    "Production":  99.99,
    "Staging":     99.9,
    "Internal":    99.0,
    "default":     99.9,
}

def get_sla_target(monitor_name: str) -> float:
    for pattern, target in SLA_TARGETS.items():
        if pattern.lower() in monitor_name.lower():
            return target
    return SLA_TARGETS["default"]

def calculate_monthly_sla(db_path: str, year: int, month: int) -> list:
    # Calculate period boundaries as Unix milliseconds
    period_start = int(datetime(year, month, 1).timestamp() * 1000)
    if month == 12:
        period_end = int(datetime(year + 1, 1, 1).timestamp() * 1000)
    else:
        period_end = int(datetime(year, month + 1, 1).timestamp() * 1000)

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        SELECT
            m.name,
            m.url,
            COUNT(*) as total,
            SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) as up_count,
            ROUND(
                CAST(SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END) AS FLOAT)
                / COUNT(*) * 100, 4
            ) as uptime_pct
        FROM heartbeat h
        JOIN monitor m ON h.monitor_id = m.id
        WHERE h.time >= ? AND h.time < ?
          AND m.active = 1
        GROUP BY m.id, m.name
        ORDER BY uptime_pct ASC
    """, (period_start, period_end))

    results = []
    for row in cursor.fetchall():
        name, url, total, up_count, uptime_pct = row
        target = get_sla_target(name)
        downtime_minutes = round((total - up_count) * (60 / 60), 1)  # assuming 60s interval
        results.append({
            "name": name,
            "url": url or "",
            "total": total,
            "uptime_pct": uptime_pct or 0,
            "target": target,
            "met_sla": (uptime_pct or 0) >= target,
            "downtime_min": downtime_minutes
        })

    conn.close()
    return results

if __name__ == "__main__":
    # Report for previous month
    today = date.today()
    report_month = today - relativedelta(months=1)

    results = calculate_monthly_sla(
        DB_PATH, report_month.year, report_month.month
    )

    failing = [r for r in results if not r["met_sla"]]
    period_label = report_month.strftime("%B %Y")

    print(f"SLA Report: {period_label}")
    print(f"Total monitors: {len(results)} | SLA violations: {len(failing)}")
    print()
    for r in results:
        status = "✅" if r["met_sla"] else "❌"
        print(f"{status} {r['name']}: {r['uptime_pct']:.4f}% (target: {r['target']}%) | Downtime: {r['downtime_min']}m")

# Add to crontab — run on 1st of each month:
# 0 8 1 * * docker exec uptime-kuma python3 /app/generate-sla-report.py

SLA Error Budget Tracking

SRE teams use error budgets — the allowable downtime before SLA violation — to make deployment and risk decisions. Track how much error budget remains for each service:

# Error budget calculations for common SLA targets
# (per 30-day month = 43,200 minutes)

# 99.0%  SLA → 432 minutes/month allowable downtime (7.2 hours)
# 99.5%  SLA → 216 minutes/month allowable downtime (3.6 hours)
# 99.9%  SLA → 43.2 minutes/month allowable downtime
# 99.95% SLA → 21.6 minutes/month allowable downtime
# 99.99% SLA → 4.3 minutes/month allowable downtime

# Query current month's error budget consumption:
docker exec uptime-kuma sqlite3 /app/data/kuma.db << 'QUERY'
.headers on
SELECT
  m.name,
  -- Downtime in minutes (assuming 60s check interval)
  ROUND((COUNT(*) - SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END)) * 1.0, 1) as downtime_minutes,
  -- Error budget for 99.9% SLA on 30-day month
  43.2 as budget_minutes,
  ROUND(
    ((COUNT(*) - SUM(CASE WHEN h.status = 1 THEN 1 ELSE 0 END)) * 1.0 / 43.2) * 100, 1
  ) as budget_consumed_pct
FROM heartbeat h
JOIN monitor m ON h.monitor_id = m.id
WHERE h.time >= strftime('%s', 'now', 'start of month') * 1000
GROUP BY m.id
HAVING budget_consumed_pct > 0
ORDER BY budget_consumed_pct DESC;
QUERY

Certificate Monitoring: Never Let SSL Expire Again

SSL certificate expiry is one of the most embarrassing and preventable causes of outages. Uptime Kuma has built-in certificate monitoring, but getting it configured correctly — with the right alert lead times and the right people notified — requires more than just enabling the checkbox.

Configuring Certificate Expiry Monitoring

For every HTTPS monitor in Uptime Kuma, certificate monitoring is enabled by default. The key settings are in the monitor's Advanced section:

Notify on certificate expiry: Enable on every HTTPS monitor
Expiry notification days: Set to 30, 14, 7 — alert progressively as expiry approaches
Notification channel for expiry: Use a lower-urgency channel (email or Slack) rather than PagerDuty for 30-day warnings; escalate urgency as days decrease

Dedicated Certificate Monitors

For domains that don't have HTTP services but still need certificate monitoring (SMTP, IMAP, database TLS), create dedicated certificate-type monitors:

#!/usr/bin/env python3
# audit-certificates.py
# Scans all HTTPS endpoints and reports certificate expiry status
# Run independently of Uptime Kuma for a comprehensive audit

import ssl
import socket
import json
from datetime import datetime, timezone
from typing import Optional

ENDPOINTS = [
    # Format: (hostname, port, display_name)
    ("app.yourdomain.com",     443,  "Main Application"),
    ("api.yourdomain.com",     443,  "REST API"),
    ("mail.yourdomain.com",    993,  "IMAP (SSL)"),
    ("smtp.yourdomain.com",    465,  "SMTP (SSL)"),
    ("db.yourdomain.com",      5432, "PostgreSQL TLS"),
    ("monitor.yourdomain.com", 443,  "Uptime Kuma"),
]

WARN_DAYS = [30, 14, 7, 3, 1]

def get_cert_expiry(hostname: str, port: int) -> Optional[datetime]:
    try:
        ctx = ssl.create_default_context()
        with ctx.wrap_socket(
            socket.create_connection((hostname, port), timeout=10),
            server_hostname=hostname
        ) as s:
            cert = s.getpeercert()
            expiry_str = cert["notAfter"]
            return datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc)
    except Exception as e:
        return None

results = []
now = datetime.now(timezone.utc)

for hostname, port, name in ENDPOINTS:
    expiry = get_cert_expiry(hostname, port)
    if expiry:
        days_remaining = (expiry - now).days
        status = "OK"
        if days_remaining <= 7:   status = "CRITICAL"
        elif days_remaining <= 14: status = "WARNING"
        elif days_remaining <= 30: status = "NOTICE"

        results.append({
            "name": name,
            "endpoint": f"{hostname}:{port}",
            "expires": expiry.strftime("%Y-%m-%d"),
            "days_remaining": days_remaining,
            "status": status
        })
    else:
        results.append({
            "name": name,
            "endpoint": f"{hostname}:{port}",
            "expires": "ERROR",
            "days_remaining": -1,
            "status": "ERROR"
        })

# Sort by urgency
results.sort(key=lambda x: x["days_remaining"])

print(f"Certificate Audit Report — {now.strftime('%Y-%m-%d')}")
print("=" * 70)
for r in results:
    icon = {"OK": "✅", "NOTICE": "📅", "WARNING": "⚠️", "CRITICAL": "🚨", "ERROR": "❌"}[r["status"]]
    print(f"{icon} {r['name']:30} Expires: {r['expires']:12} ({r['days_remaining']}d)")

# Run weekly via cron:
# 0 9 * * 1 python3 /opt/scripts/audit-certificates.py | mail -s "Weekly Cert Audit" [email protected]

Certificate Renewal Automation

If you're using Let's Encrypt (via Certbot or Traefik's ACME), wire cert renewal success into a push monitor so Uptime Kuma alerts if auto-renewal stops working:

#!/bin/bash
# certbot-renew-with-monitoring.sh
# Runs certbot renewal and pings Uptime Kuma push monitor on success
# If this script fails silently, the push monitor alerts after the interval

set -euo pipefail

PUSH_URL="https://monitor.yourdomain.com/api/push/YOUR_CERT_MONITOR_PUSH_TOKEN"

# Attempt renewal
certbot renew --quiet --non-interactive

# Reload services that use the certs
systemctl reload nginx

# Ping push monitor — only runs if above succeeded
curl -fsS --retry 3 \
  "${PUSH_URL}?status=up&msg=Certbot+renewal+check+OK&ping=" \
  > /dev/null

# In your crontab:
# 0 3 * * * /opt/scripts/certbot-renew-with-monitoring.sh >> /var/log/certbot-kuma.log 2>&1
# Set the push monitor interval to 25 hours — if renewal runs daily and the
# push doesn't arrive within 25 hours, Uptime Kuma alerts

# For Traefik ACME (automatic renewal):
# Create a separate certificate check script that just validates expiry:
#!/bin/bash
# check-traefik-certs.sh
DOMAIN="app.yourdomain.com"
EXPIRY=$(echo | openssl s_client -servername "$DOMAIN" -connect "${DOMAIN}:443" 2>/dev/null | \
  openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

if [ "$DAYS_LEFT" -gt 14 ]; then
  curl -fsS "${PUSH_URL}?status=up&msg=Cert+valid+${DAYS_LEFT}+days&ping=" > /dev/null
else
  curl -fsS "${PUSH_URL}?status=down&msg=CERT+EXPIRY+WARNING+${DAYS_LEFT}+days&ping=" > /dev/null
fi

Monitoring Internal Services and Databases

Most infrastructure has more to monitor than public HTTP endpoints. Databases, message queues, internal APIs, and background workers all need health checks — but they're not publicly reachable, so Uptime Kuma needs to be deployed in the same network to reach them.

Database TCP Monitors

TCP port monitors confirm a database is accepting connections — the first layer of database health monitoring:

# Database TCP monitors to configure in Uptime Kuma:
# (All assume Uptime Kuma is on the same Docker network as your databases)

# PostgreSQL
# Monitor Type: TCP Port
# Hostname: postgres-container-name (or internal IP)
# Port: 5432
# Interval: 30s
# Friendly Name: PostgreSQL — Production

# Redis
# Monitor Type: TCP Port
# Hostname: redis-container-name
# Port: 6379
# Interval: 30s

# MySQL/MariaDB
# Monitor Type: TCP Port
# Hostname: mysql-container-name
# Port: 3306

# MongoDB
# Monitor Type: TCP Port
# Hostname: mongo-container-name
# Port: 27017

# For databases that expose a health endpoint:
# Redis has a PING command — create an HTTP monitor against a Redis proxy
# or use a small sidecar that exposes /health

# Example Redis health sidecar (add to your docker-compose.yml):
  redis-health:
    image: alpine
    command: sh -c 'while true; do nc -l -p 8888 -e sh -c "redis-cli -h redis -a $$REDIS_PASSWORD ping > /dev/null && echo -e HTTP/1.1 200 OK\r\n\r\nOK || echo -e HTTP/1.1 503 DOWN\r\n\r\nDOWN"; done'
    environment:
      - REDIS_PASSWORD=${REDIS_PASSWORD}
    ports:
      - "8888:8888"
    networks:
      - app_network

Application-Level Health Endpoints

TCP port checks confirm the process is running — they don't confirm it's actually healthy. Build proper health endpoints into every service and monitor those instead:

# Comprehensive health endpoint pattern — implement in every service
# Returns 200 only when ALL dependencies are healthy

# Python (FastAPI) example:
from fastapi import FastAPI, HTTPException
import asyncpg
import redis.asyncio as redis
import httpx

app = FastAPI()

@app.get("/health")
async def health_check():
    checks = {}
    healthy = True

    # Check PostgreSQL
    try:
        conn = await asyncpg.connect(DATABASE_URL)
        await conn.fetchval("SELECT 1")
        await conn.close()
        checks["postgres"] = "ok"
    except Exception as e:
        checks["postgres"] = f"error: {str(e)[:50]}"
        healthy = False

    # Check Redis
    try:
        r = redis.from_url(REDIS_URL)
        await r.ping()
        await r.close()
        checks["redis"] = "ok"
    except Exception as e:
        checks["redis"] = f"error: {str(e)[:50]}"
        healthy = False

    # Check downstream service
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            resp = await client.get(f"{PAYMENT_SERVICE_URL}/health")
            checks["payment_service"] = "ok" if resp.status_code == 200 else f"status:{resp.status_code}"
            if resp.status_code != 200:
                healthy = False
    except Exception as e:
        checks["payment_service"] = f"error: {str(e)[:50]}"
        healthy = False

    if not healthy:
        raise HTTPException(status_code=503, detail={"status": "degraded", "checks": checks})

    return {"status": "ok", "checks": checks, "version": APP_VERSION}

# Configure in Uptime Kuma:
# Monitor Type: HTTP(s) — Keyword
# URL: http://myapp:8000/health   (internal Docker network URL)
# Keyword: "status":"ok"
# Accepted Status Codes: 200-299
# Interval: 30s

Queue and Worker Health Monitoring

Background job queues fail silently — jobs stop processing, queue depth grows, but nothing obviously breaks at the HTTP layer. Use push monitors to detect worker failures:

#!/usr/bin/env python3
# worker-heartbeat.py
# Run inside your worker process — pings Uptime Kuma to confirm worker is alive
# Call this function at the end of each successful job batch

import requests
import os
import logging
from functools import wraps

PUSH_URL = os.environ.get(
    "UPTIME_KUMA_PUSH_URL",
    "https://monitor.yourdomain.com/api/push/YOUR_WORKER_PUSH_TOKEN"
)

def ping_uptime_kuma(msg: str = "OK", status: str = "up") -> None:
    """Send heartbeat to Uptime Kuma push monitor."""
    try:
        requests.get(
            PUSH_URL,
            params={"status": status, "msg": msg, "ping": ""},
            timeout=10
        )
    except Exception as e:
        logging.warning(f"Failed to ping Uptime Kuma: {e}")
        # Don't let monitoring failure affect worker

def monitored_job(job_name: str = ""):
    """Decorator that reports job success/failure to Uptime Kuma."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                result = func(*args, **kwargs)
                ping_uptime_kuma(msg=f"{job_name or func.__name__} completed", status="up")
                return result
            except Exception as e:
                ping_uptime_kuma(msg=f"{job_name or func.__name__} FAILED: {str(e)[:50]}", status="down")
                raise
        return wrapper
    return decorator

# Usage in your worker code:
@monitored_job(job_name="Email Queue Processor")
def process_email_queue():
    # Your actual job logic
    emails = fetch_pending_emails()
    for email in emails:
        send_email(email)
    return len(emails)

# Also add a queue depth check that reports down if queue is too deep:
def check_queue_health(max_depth: int = 1000):
    depth = get_queue_depth()
    if depth > max_depth:
        ping_uptime_kuma(
            msg=f"Queue depth critical: {depth} items",
            status="down"
        )
    else:
        ping_uptime_kuma(msg=f"Queue healthy: {depth} items")

SRE Practices for Your Monitoring Stack

Monitor Hygiene: The Four Golden Signals

SRE methodology defines four golden signals for monitoring: latency, traffic, errors, and saturation. Translate these into Uptime Kuma monitors:

Latency — HTTP monitors report response time. Set alert thresholds for P95 latency degradation, not just failures. In Uptime Kuma, set Maximum Response Time in the monitor settings — requests exceeding it are marked as degraded.
Traffic — push monitors from your API gateway that fire when traffic drops below expected minimums (sudden traffic drops = likely upstream failure).
Errors — keyword monitors that check your health endpoint's error rates, not just 200/500 status codes.
Saturation — TCP monitors for connection acceptance + application health endpoints that include queue depth and memory pressure.

On-Call Runbook Integration

Every monitor in Uptime Kuma supports a Description field. Use it as an inline runbook reference so on-call engineers know exactly what to do when an alert fires:

# Monitor description template (set in Uptime Kuma monitor settings):
# Add this to the Description field of each critical monitor:

"""
Service: Payment Processing API
Owner: Backend Team
Severity: P0

Runbook: https://wiki.yourdomain.com/runbooks/payment-api

Immediate steps:
1. Check container health: docker logs payment-api --tail 50
2. Check database connectivity: docker exec payment-api npm run db:check
3. Check Stripe API status: https://status.stripe.com
4. Escalate to: @backend-team on Slack | +1-555-0100 (PagerDuty)

Recent incidents: https://git.yourdomain.com/ops/incidents
"""

# Reference this in webhook-driven alerts by including the monitor
# description in the payload (Uptime Kuma webhook includes full monitor object)
# Your incident handler can then extract and include the runbook link
# in every Slack/PagerDuty alert automatically

Reducing Alert Fatigue: Maintenance Windows and Alert Policies

#!/usr/bin/env python3
# schedule-maintenance.py
# Creates a maintenance window in Uptime Kuma before planned deployments
# Suppresses alerts during the window so on-call isn't paged

from uptime_kuma_api import UptimeKumaApi, MaintenanceStrategy, MaintenanceTimezone
import os
from datetime import datetime, timedelta

api = UptimeKumaApi(os.environ["UPTIME_KUMA_URL"])
api.login(os.environ["UPTIME_KUMA_USER"], os.environ["UPTIME_KUMA_PASSWORD"])

# Get monitor IDs for services being deployed
monitors = api.get_monitors()
deployment_monitors = [
    m["id"] for m in monitors
    if "payment" in m["name"].lower() or "checkout" in m["name"].lower()
]

# Create 30-minute maintenance window starting now
start_time = datetime.utcnow()
end_time = start_time + timedelta(minutes=30)

result = api.add_maintenance(
    title="Payment Service Deploy - v2.4.1",
    strategy=MaintenanceStrategy.SINGLE,
    timezone=MaintenanceTimezone.UTC,
    start_date=start_time.strftime("%Y-%m-%d %H:%M"),
    end_date=end_time.strftime("%Y-%m-%d %H:%M"),
    monitors=deployment_monitors
)

print(f"Maintenance window created: ID={result.get('maintenanceID')}")
print(f"Window: {start_time.strftime('%H:%M')} - {end_time.strftime('%H:%M')} UTC")
print(f"Covering {len(deployment_monitors)} monitors")

api.disconnect()

# Wire this into your CI/CD pipeline:
# - name: Create maintenance window
#   run: |
#     pip install uptime-kuma-api --quiet
#     python3 schedule-maintenance.py
#   env:
#     UPTIME_KUMA_URL: ${{ secrets.UPTIME_KUMA_URL }}
#     UPTIME_KUMA_USER: ${{ secrets.UPTIME_KUMA_USER }}
#     UPTIME_KUMA_PASSWORD: ${{ secrets.UPTIME_KUMA_PASSWORD }}

Tips, Gotchas, and Troubleshooting

SLA Numbers Don't Match Your Perception of Uptime

# Common causes of SLA numbers seeming wrong:

# 1. Uptime Kuma itself was down during the period
#    If Kuma was down, no heartbeats were recorded — those gaps count as unknown, not down
#    Check for gaps in heartbeat records:
docker exec uptime-kuma sqlite3 /app/data/kuma.db << 'EOF'
SELECT
  monitor_id,
  datetime(time/1000, 'unixepoch') as check_time,
  status
FROM heartbeat
WHERE monitor_id = 1
ORDER BY time DESC
LIMIT 10;
EOF
# Look for gaps larger than 2× the check interval

# 2. Checks failing during maintenance windows
#    Uptime Kuma still records heartbeats during maintenance but doesn't alert
#    Decide whether to exclude maintenance windows from SLA calculations
#    (Most SLAs exclude planned maintenance)

# 3. Check interval matters for short outages
#    With 60s intervals, a 30-second outage may not be detected at all
#    Reduce interval to 30s or 20s for P0 services
docker exec uptime-kuma sqlite3 /app/data/kuma.db \
  "SELECT name, interval FROM monitor WHERE active = 1 ORDER BY interval DESC LIMIT 10;"

Internal Service Monitors Showing as Down From Uptime Kuma

# Test reachability from inside the Uptime Kuma container:
docker exec uptime-kuma curl -sv http://your-internal-service:8080/health 2>&1 | tail -20

# If it fails:
# 1. Check Docker network — Uptime Kuma must be on the same network as the service
docker inspect uptime-kuma | jq '.[0].NetworkSettings.Networks | keys'
docker inspect your-service | jq '.[0].NetworkSettings.Networks | keys'
# Both must share at least one network

# 2. Add Uptime Kuma to the service's network:
# In your Uptime Kuma docker-compose.yml:
# networks:
#   - uptime_kuma_net
#   - app_network       ← Add this
#
# networks:
#   uptime_kuma_net:
#   app_network:
#     external: true    ← Reference the existing app network

# 3. Restart Uptime Kuma after network changes:
docker compose up -d --force-recreate uptime-kuma

Certificate Monitor Showing Wrong Expiry Date

# Verify what certificate is actually being served:
echo | openssl s_client -servername your-domain.com -connect your-domain.com:443 2>/dev/null | \
  openssl x509 -noout -dates -subject -issuer

# If the cert was recently renewed but Uptime Kuma still shows old expiry:
# Force a fresh check by temporarily disabling and re-enabling the monitor
# Or restart Uptime Kuma to clear the certificate cache

# Check from inside the Uptime Kuma container:
docker exec uptime-kuma sh -c \
  "echo | openssl s_client -servername your-domain.com -connect your-domain.com:443 2>/dev/null | \
   openssl x509 -noout -enddate"

# If the cert check fails for internal hostnames that use self-signed certs:
# Enable "Ignore TLS/SSL error" in the monitor's Advanced settings
# This allows monitoring while still tracking the cert expiry date

Pro Tips

Run the SLA report script quarterly and share with stakeholders — a monthly SLA report that goes to engineering leadership creates accountability and surfaces reliability trends before they become problems. The data is already in Uptime Kuma; the only cost is writing the query once.
Set tighter check intervals for payment and auth flows — a 60-second check interval means you could have a 59-second outage that's never detected. For revenue-critical paths, use 20-second intervals and 1-retry with 10-second retry interval. The additional load on Uptime Kuma is negligible.
Use tags to define SLA tiers — tag monitors with sla:99.99, sla:99.9, or sla:99.0. Your SLA reporting scripts can then automatically apply the right target per monitor without maintaining a separate configuration file.
Monitor Uptime Kuma's own SLA — use the external monitoring approach from our advanced guide (a second independent instance, or a free-tier Uptime Robot) to track Kuma's own availability. Include it in your SLA reports as infrastructure uptime.
Automate runbook links in alerts — wire your webhook handler to extract the monitor's description field and include it in every alert. When the on-call engineer gets paged at 3am, the first message they see contains the runbook link, the owning team, and the immediate troubleshooting steps.

Wrapping Up

This fourth guide completes the Uptime Kuma series. Across the four guides, the progression is clear: from getting alerts when services go down, to configuring production-grade monitoring for teams, to automating monitor management and incident workflows, to this guide's SLA tracking, certificate management, and SRE-grade practices.

The end state is a monitoring setup that doesn't just tell you when something broke — it tracks your reliability over time, warns you about certificates before they expire, monitors the internal services that HTTP checks can't reach, and generates the data your engineering leadership needs to make informed decisions about where to invest in reliability improvements.

Need a Complete Reliability Engineering Practice Built for Your Team?

SLA tracking, error budget management, incident workflows, multi-location monitoring, and SRE practices require both tooling and process. The sysbrix team designs complete reliability engineering setups — from monitoring infrastructure through incident management processes — for engineering organizations that need to make real SLA commitments and have the data to back them up.

Talk to Us →

in Guides

# Monitoring SLA SRE Self-Hosted Uptime Kuma

n8n Coolify Deployment: Queue Mode, Worker Scaling, and Production Hardening for Heavy Workflows

Go beyond basic n8n deployment — learn how to enable queue mode for parallel workflow execution, scale workers independently on Coolify, configure production-grade security, and monitor your automation stack under real load.