Uptime Kuma Setup: Grafana Integration, Custom Dashboards, Alertmanager, and Enterprise Observability
The four previous guides in this series covered the complete Uptime Kuma operational picture: basic deployment, advanced monitors and status pages, API automation and multi-location monitoring, and SLA tracking and SRE practices. This final guide covers the integration layer: pulling Uptime Kuma's metrics into Grafana for unified infrastructure dashboards, routing alerts through Prometheus Alertmanager for sophisticated deduplication and escalation, embedding live status widgets in your internal tools, and operating Uptime Kuma as a first-class component of an enterprise observability stack.
Prerequisites
- A fully configured Uptime Kuma instance — see the series starting with our advanced monitoring guide
- Grafana running (standalone or as part of a monitoring stack)
- Prometheus with the Uptime Kuma metrics endpoint being scraped
- Prometheus Alertmanager deployed (separate from Grafana alerting)
- At least 30 days of Uptime Kuma history for meaningful dashboard visualizations
Verify your Prometheus is scraping Uptime Kuma:
# Verify Uptime Kuma metrics are being scraped:
curl -s http://localhost:9090/api/v1/query?query=monitor_status | \
jq '.data.result | length'
# Should return your monitor count (e.g., 24)
# Check the last scrape time:
curl -s 'http://localhost:9090/api/v1/query?query=up{job="uptime-kuma"}' | \
jq '.data.result[0] | {up: .value[1], scrape_time: (.value[0] | todate)}'
# Verify key metrics exist:
curl -s http://localhost:9090/api/v1/label/__name__/values | \
jq '.data | map(select(startswith("monitor_")))'
# Should show: monitor_status, monitor_response_time, monitor_cert_days_remaining
Grafana Integration: Uptime Kuma Dashboards
Uptime Kuma provides a clean per-service view, but it doesn't show how uptime correlates with your infrastructure metrics — CPU spikes before an outage, memory pressure causing slowdowns, deployment events triggering degraded response times. Grafana unifies these views so you can see availability alongside everything else in one place.
Creating the Uptime Kuma Data Source in Grafana
# Provision the Prometheus data source pointing at your Prometheus instance
# (which already scrapes Uptime Kuma)
# grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: '30s' # Match Uptime Kuma's scrape interval
queryTimeout: '60s'
httpMethod: POST
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo
# Apply by restarting Grafana:
docker compose restart grafana
# Verify data source connectivity:
curl -s http://admin:password@localhost:3000/api/datasources | \
jq '[.[] | {name: .name, type: .type, url: .url}]'
# Test the Prometheus connection:
curl -s 'http://admin:password@localhost:3000/api/datasources/proxy/1/api/v1/query?query=monitor_status' | \
jq '.data.result | length'
Building the Unified Availability Dashboard
# Grafana dashboard JSON — paste in Dashboard → Import → JSON
# This creates a comprehensive availability overview dashboard
cat > uptime-kuma-dashboard.json << 'EOF'
{
"title": "Service Availability Overview",
"tags": ["uptime", "availability", "sla"],
"timezone": "browser",
"refresh": "1m",
"panels": [
{
"title": "🔴 Services Currently Down",
"type": "stat",
"datasource": "Prometheus",
"targets": [{
"expr": "count(monitor_status == 0)",
"legendFormat": "Down"
}],
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "orange"},
{"value": 3, "color": "red"}
]
}
}
},
"gridPos": {"x": 0, "y": 0, "w": 4, "h": 4}
},
{
"title": "Overall Availability (24h)",
"type": "stat",
"datasource": "Prometheus",
"targets": [{
"expr": "avg(avg_over_time(monitor_status[24h])) * 100",
"legendFormat": "Availability %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"decimals": 3,
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 99, "color": "orange"},
{"value": 99.9, "color": "green"}
]
}
}
},
"gridPos": {"x": 4, "y": 0, "w": 4, "h": 4}
},
{
"title": "P95 Response Time (5m avg)",
"type": "stat",
"datasource": "Prometheus",
"targets": [{
"expr": "quantile(0.95, avg_over_time(monitor_response_time[5m]))",
"legendFormat": "P95 Response"
}],
"fieldConfig": {
"defaults": {
"unit": "ms",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 500, "color": "orange"},
{"value": 2000, "color": "red"}
]
}
}
},
"gridPos": {"x": 8, "y": 0, "w": 4, "h": 4}
}
]
}
EOF
# Import the dashboard:
curl -X POST http://admin:password@localhost:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-d "{\"dashboard\": $(cat uptime-kuma-dashboard.json), \"overwrite\": true}" | \
jq .url
Key PromQL Queries for Uptime Kuma Dashboards
# Essential PromQL queries for Uptime Kuma Grafana panels:
# 1. Heatmap: Monitor status over time (green=up, red=down)
# Use type: State timeline with this query:
monitor_status
# Splits by monitor_name label automatically
# 2. Response time time series (per service):
avg_over_time(monitor_response_time{monitor_name=~"$service"}[5m])
# 3. Uptime percentage for last 30 days:
avg_over_time(monitor_status{monitor_name=~"$service"}[30d]) * 100
# 4. Services not meeting 99.9% SLA this month:
avg_over_time(monitor_status[30d]) * 100 < 99.9
# 5. Certificate days remaining (for cert monitoring panels):
monitor_cert_days_remaining
# Alert threshold: < 30 days = warning, < 7 days = critical
# 6. Count of outages in the last 7 days per service:
count_over_time((
monitor_status == 0
and
(monitor_status offset 1m) == 1
)[7d:])
# 7. Average time to recover (MTTR approximation):
# This requires custom recording rules in Prometheus:
# Record: job:monitor_state_changes:rate5m
# Expression: changes(monitor_status[5m])
# 8. Correlation: response time spike vs status change
# Overlay on the same panel:
monitor_response_time > 2000
monitor_status == 0
# Configure alerts when both are true simultaneously
Prometheus Alertmanager: Sophisticated Alert Routing
Uptime Kuma's built-in notifications fire on every status change. Alertmanager adds what Uptime Kuma can't do natively: alert deduplication (don't page twice for the same outage), grouping (one Slack message for 5 simultaneous failures, not 5 messages), silence rules, inhibition (don't alert about API being down if the database is already known down), and multi-receiver routing based on severity and service ownership.
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Routing tree:
route:
receiver: default-receiver
group_by: ['alertname', 'environment']
group_wait: 30s # Wait 30s to group related alerts before firing
group_interval: 5m # Re-notify every 5 minutes if still firing
repeat_interval: 4h # Don't re-alert if nothing changed after 4 hours
routes:
# P0 production critical — immediate PagerDuty
- matchers:
- severity = critical
- environment = production
receiver: pagerduty-critical
group_wait: 0s # No grouping delay for critical
repeat_interval: 30m
# P1 production warning — Slack with short delay
- matchers:
- severity = warning
- environment = production
receiver: slack-production
group_wait: 2m
repeat_interval: 2h
# Staging — Slack only, lower urgency
- matchers:
- environment = staging
receiver: slack-staging
group_wait: 5m
repeat_interval: 8h
# SSL certificate warnings — email to ops
- matchers:
- alertname = SSLCertExpiringSoon
receiver: email-ops
group_wait: 1h # Cert alerts don't need immediate grouping
repeat_interval: 24h
# Inhibition: suppress downstream alerts when root cause is known
inhibit_rules:
# If database is down, suppress all API alerts (root cause is DB)
- source_matchers:
- alertname = DatabaseDown
target_matchers:
- alertname = ServiceDown
equal: ['environment'] # Only inhibit same environment
# If entire datacenter is down, suppress individual service alerts
- source_matchers:
- alertname = DatacenterDown
target_matchers:
- severity =~ "warning|critical"
# Receivers:
receivers:
- name: default-receiver
slack_configs:
- channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty-critical
pagerduty_configs:
- routing_key: '${PAGERDUTY_KEY}'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.environment }}'
severity: critical
details:
services: '{{ range .Alerts }}{{ .Labels.monitor_name }}, {{ end }}'
- name: slack-production
slack_configs:
- channel: '#incidents'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: >-
{{ if eq .Status "firing" }}🔴{{ else }}✅{{ end }}
{{ .GroupLabels.alertname }}
text: |
*Environment:* {{ .GroupLabels.environment }}
*Services:* {{ range .Alerts }}{{ .Labels.monitor_name }}, {{ end }}
*Started:* {{ .Alerts[0].StartsAt.Format "15:04 UTC" }}
{{ if .CommonAnnotations.runbook }}*Runbook:* {{ .CommonAnnotations.runbook }}{{ end }}
- name: slack-staging
slack_configs:
- channel: '#monitoring-staging'
title: '[Staging] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Labels.monitor_name }}: {{ .Annotations.description }}{{ end }}'
- name: email-ops
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: '${SMTP_HOST}:587'
auth_username: '${SMTP_USER}'
auth_password: '${SMTP_PASS}'
subject: '{{ .GroupLabels.alertname }}'
html: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
Prometheus Alert Rules for Uptime Kuma Metrics
# prometheus/rules/uptime-kuma.yml
groups:
- name: uptime-kuma
interval: 30s
rules:
# Service down alert with environment and team labels from monitor tags:
- alert: ServiceDown
expr: monitor_status == 0
for: 2m # Must be down for 2 consecutive minutes
labels:
severity: critical
# Environment extracted from monitor name prefix:
environment: '{{ if contains "staging" .Labels.monitor_name }}staging{{ else }}production{{ end }}'
annotations:
summary: "{{ $labels.monitor_name }} is DOWN"
description: "{{ $labels.monitor_name }} has been unavailable for {{ $value | humanizeDuration }}"
runbook: "https://wiki.company.com/runbooks/{{ $labels.monitor_name }}"
# Slow response time warning:
- alert: SlowResponseTime
expr: avg_over_time(monitor_response_time[5m]) > 2000
for: 5m
labels:
severity: warning
environment: production
annotations:
summary: "{{ $labels.monitor_name }} response time degraded"
description: "P5m avg response: {{ $value | humanize }}ms (threshold: 2000ms)"
# SSL certificate expiry:
- alert: SSLCertExpiringSoon
expr: monitor_cert_days_remaining < 30
labels:
severity: warning
annotations:
summary: "SSL cert expiring: {{ $labels.monitor_name }}"
description: "Certificate expires in {{ $value | humanize }} days"
- alert: SSLCertExpiringCritical
expr: monitor_cert_days_remaining < 7
labels:
severity: critical
annotations:
summary: "URGENT: SSL cert expiring: {{ $labels.monitor_name }}"
description: "Certificate expires in {{ $value | humanize }} days — IMMEDIATE ACTION REQUIRED"
# SLA violation:
- alert: SLAViolation
expr: avg_over_time(monitor_status[30d]) * 100 < 99.9
labels:
severity: warning
annotations:
summary: "SLA violation: {{ $labels.monitor_name }}"
description: "30-day uptime {{ $value | humanize }}% is below 99.9% SLA target"
# Monitoring itself is down:
- alert: UptimeKumaDown
expr: absent(monitor_status)
for: 3m
labels:
severity: critical
annotations:
summary: "Uptime Kuma is not reporting metrics"
description: "No monitor_status metrics received for 3+ minutes. Uptime Kuma may be down."
Embedded Status Widgets and API-Driven Status Pages
Uptime Kuma's public status page is great for external communication. But your internal tools — engineering wikis, incident dashboards, Notion pages — can embed live status data without requiring people to open a separate tab.
Embedding Status Badges in Internal Tools
# Uptime Kuma badge API:
# /api/badge/{monitor_id}/status
# /api/badge/{monitor_id}/uptime/24 (24-hour uptime %)
# /api/badge/{monitor_id}/uptime/720 (30-day uptime %)
# /api/badge/{monitor_id}/response (avg response time)
# /api/badge/{monitor_id}/ping (current ping)
# /api/badge/{monitor_id}/cert-exp (days until cert expiry)
# Embed in Markdown (GitHub wikis, Notion, etc.):
# 
# 
# Customizable badge parameters:
# ?label=API&upLabel=Operational&downLabel=Degraded
# ?color=green&downColor=red
# ?style=for-the-badge (shields.io style)
# Example embedded in HTML:
cat << 'EOF'
Service Status
Production API
Database
EOF
# Fetch current status for all monitors programmatically:
curl -s https://monitor.yourdomain.com/api/status-page/all-services | \
jq '.publicGroupList[].monitorList[] | {name: .name, status: .status, uptime: .uptime_24}'
# Build a custom HTML status widget that refreshes every 60 seconds:
curl -s https://monitor.yourdomain.com/metrics | \
grep 'monitor_status{' | \
awk '{print $1, $2}' | \
sed 's/monitor_status{monitor_name="//' | sed 's/"}//'
Building a Custom Consolidated Status API
#!/usr/bin/env python3
# status-api.py — Thin API wrapper that aggregates Uptime Kuma + other sources
# Exposes a clean /status endpoint for your internal dashboards
from flask import Flask, jsonify
import requests
import re
from functools import lru_cache
from datetime import datetime
import os
app = Flask(__name__)
KUMA_URL = os.environ["UPTIME_KUMA_URL"]
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")
@lru_cache(maxsize=1)
def get_cached_status(ttl_hash=None):
"""Cache status for 30 seconds to avoid hammering metrics endpoint."""
_ = ttl_hash
metrics_text = requests.get(f"{KUMA_URL}/metrics", timeout=10).text
services = {}
for line in metrics_text.split("\n"):
# Parse monitor_status
status_match = re.match(
r'monitor_status\{.*?monitor_name="([^"]+)".*?\} (\d+)', line
)
if status_match:
name, status = status_match.group(1), int(status_match.group(2))
services.setdefault(name, {})["status"] = "up" if status == 1 else "down"
# Parse response time
rt_match = re.match(
r'monitor_response_time\{.*?monitor_name="([^"]+)".*?\} (\d+)', line
)
if rt_match:
name, rt = rt_match.group(1), int(rt_match.group(2))
services.setdefault(name, {})["response_ms"] = rt
return services
def make_ttl_hash(seconds=30):
return round(datetime.utcnow().timestamp() / seconds)
@app.route("/status")
def status():
services = get_cached_status(ttl_hash=make_ttl_hash(30))
down_services = [name for name, data in services.items() if data.get("status") == "down"]
overall = "degraded" if down_services else "operational"
return jsonify({
"overall_status": overall,
"services": services,
"down_count": len(down_services),
"total_count": len(services),
"down_services": down_services,
"last_updated": datetime.utcnow().isoformat() + "Z"
})
@app.route("/status/")
def service_status(service_name):
services = get_cached_status(ttl_hash=make_ttl_hash(30))
if service_name not in services:
return jsonify({"error": "service not found"}), 404
return jsonify({"service": service_name, **services[service_name]})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8090)
# Deploy as a Docker container:
# docker run -d \
# -p 8090:8090 \
# -e UPTIME_KUMA_URL=https://monitor.yourdomain.com \
# -e PROMETHEUS_URL=http://prometheus:9090 \
# your-status-api:latest
# Use in other services:
# curl http://status-api:8090/status | jq .overall_status
Enterprise Observability Stack Integration
At enterprise scale, Uptime Kuma is one component of a larger observability platform alongside distributed tracing, log aggregation, synthetic monitoring, and business metrics. The integration patterns here connect Uptime Kuma into that broader ecosystem without making it a single point of orchestration.
Connecting to OpenTelemetry for Trace Correlation
# When Uptime Kuma detects an outage, enrich the alert with recent trace data
# This webhook handler queries Tempo for traces around the failure time
#!/usr/bin/env python3
# kuma-trace-enricher.py — Webhook handler that adds trace context to Kuma alerts
from flask import Flask, request, jsonify
import requests
import os
from datetime import datetime, timedelta
app = Flask(__name__)
TEMPO_URL = os.environ.get("TEMPO_URL", "http://tempo:3200")
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK"]
@app.route("/kuma-alert", methods=["POST"])
def handle_kuma_alert():
data = request.json
heartbeat = data.get("heartbeat", {})
monitor = data.get("monitor", {})
if heartbeat.get("status") != 0: # Only process DOWN events
return jsonify({"status": "ok"})
monitor_name = monitor.get("name", "unknown")
failure_time = heartbeat.get("time", "")
# Query Tempo for recent error traces around the failure time
recent_traces = query_recent_error_traces(monitor_name, failure_time)
# Send enriched alert to Slack
message = {
"blocks": [
{
"type": "header",
"text": {"type": "plain_text", "text": f"🔴 {monitor_name} is DOWN"}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*URL:*\n{monitor.get('url', 'N/A')}"},
{"type": "mrkdwn", "text": f"*Error:*\n{heartbeat.get('msg', 'Unknown')}"},
]
}
]
}
if recent_traces:
trace_links = "\n".join([
f"• ({t['duration']}ms)"
for t in recent_traces[:3]
])
message["blocks"].append({
"type": "section",
"text": {"type": "mrkdwn", "text": f"*Recent Error Traces:*\n{trace_links}"}
})
requests.post(SLACK_WEBHOOK, json=message)
return jsonify({"status": "ok"})
def query_recent_error_traces(service_name: str, failure_time: str) -> list:
"""Query Tempo for error traces near the failure time."""
try:
end_time = int(datetime.utcnow().timestamp() * 1_000_000_000)
start_time = end_time - (5 * 60 * 1_000_000_000) # Last 5 minutes
resp = requests.get(
f"{TEMPO_URL}/api/search",
params={
"q": f'{{.service.name="{service_name}" && status=error}}',
"start": start_time,
"end": end_time,
"limit": 5
},
timeout=10
)
return resp.json().get("traces", []) if resp.status_code == 200 else []
except Exception:
return []
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8091)
Runbook Automation via Kuma Webhooks
#!/bin/bash
# automated-remediation.sh
# Triggered by Uptime Kuma webhook when specific monitors go down
# Runs predefined remediation steps before paging on-call
set -euo pipefail
# Parse webhook payload:
MONITOR_NAME="${MONITOR_NAME:-unknown}"
MONITOR_URL="${MONITOR_URL:-}"
STATUS="${STATUS:-0}"
if [ "$STATUS" != "0" ]; then
echo "Service recovered — no remediation needed"
exit 0
fi
echo "[$(date -u)] Automated remediation triggered for: $MONITOR_NAME"
# Remediation playbook based on monitor name:
case "$MONITOR_NAME" in
*"redis"*)
echo "Attempting Redis restart..."
docker restart redis 2>/dev/null && sleep 10
# Check if recovered:
if curl -sf "${MONITOR_URL}" > /dev/null 2>&1; then
echo "Redis recovered after restart — notifying team"
curl -X POST "$SLACK_WEBHOOK" \
-d "{\"text\": \"✅ Redis auto-recovered after restart for $MONITOR_NAME\"}"
exit 0
fi
;;
*"nginx"*)
echo "Attempting Nginx config reload..."
nginx -t && nginx -s reload 2>/dev/null
sleep 5
if curl -sf "${MONITOR_URL}" > /dev/null 2>&1; then
echo "Nginx recovered after reload"
curl -X POST "$SLACK_WEBHOOK" \
-d "{\"text\": \"✅ Nginx auto-recovered after config reload\"}"
exit 0
fi
;;
*"disk"*|*"storage"*)
echo "Checking disk space..."
DISK_USAGE=$(df / | awk 'NR==2{print $5}' | tr -d '%')
if [ "$DISK_USAGE" -gt 90 ]; then
echo "Disk >90% — running emergency cleanup"
docker system prune -f --volumes 2>/dev/null || true
journalctl --vacuum-size=500M 2>/dev/null || true
fi
;;
esac
# If we reach here, auto-remediation failed — escalate:
echo "Auto-remediation unsuccessful — escalating to on-call"
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
-H 'Content-Type: application/json' \
-d "{
\"routing_key\": \"$PAGERDUTY_KEY\",
\"event_action\": \"trigger\",
\"dedup_key\": \"kuma-$MONITOR_NAME\",
\"payload\": {
\"summary\": \"$MONITOR_NAME is DOWN (auto-remediation failed)\",
\"severity\": \"critical\",
\"source\": \"uptime-kuma\"
}
}"
Tips, Gotchas, and Troubleshooting
Alertmanager Receiving Alerts But Not Routing Correctly
# Test Alertmanager routing without waiting for a real alert:
# Use amtool to simulate an alert and see which receiver it goes to:
# Install amtool:
go install github.com/prometheus/alertmanager/cmd/amtool@latest
# Check routing for a simulated alert:
amtool check-routes \
--config.file=/etc/alertmanager/alertmanager.yml \
--tree
# Shows the complete routing tree
# Test with specific labels:
amtool check-routes \
--config.file=/etc/alertmanager/alertmanager.yml \
severity=critical \
environment=production \
alertname=ServiceDown
# Shows which receiver would handle this alert
# Send a test alert to verify end-to-end:
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"environment": "staging",
"monitor_name": "test-service"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing Alertmanager routing"
},
"generatorURL": "http://prometheus:9090"
}]'
# View active alerts in Alertmanager:
curl -s http://localhost:9093/api/v1/alerts | jq '[.data[] | {name: .labels.alertname, state: .status.state}]'
Grafana Dashboard Shows No Data for Monitor_Status
# Debug the data pipeline from Uptime Kuma to Grafana:
# Step 1: Confirm Uptime Kuma is exposing metrics:
curl -s https://monitor.yourdomain.com/metrics | grep monitor_status | head -3
# Should show: monitor_status{...} 1
# Step 2: Confirm Prometheus is scraping successfully:
curl -s 'http://prometheus:9090/api/v1/query?query=monitor_status' | \
jq '.data.result | length'
# Should return > 0
# Step 3: Check the last successful scrape time:
curl -s 'http://prometheus:9090/api/v1/query?query=up{job="uptime-kuma"}' | \
jq '.data.result[0].value'
# Step 4: Check Prometheus scrape targets:
curl -s http://prometheus:9090/api/v1/targets | \
jq '.data.activeTargets[] | select(.labels.job == "uptime-kuma") | {health: .health, lastScrape: .lastScrape, lastError: .lastError}'
# Step 5: In Grafana query editor, check for label mismatches:
# Run: monitor_status
# Look at the actual label names in the results
# dashboard variables must match exactly: monitor_name not monitorName
# Step 6: Time range issues — Uptime Kuma metrics reset on restart
# If Prometheus shows data but Grafana shows empty:
# Check the Grafana time range — set to "Last 1 hour" for recent data
# Older data may have different label values if monitors were renamed
Uptime Kuma Metrics Cardinality Issues in Large Deployments
# With 100+ monitors, Prometheus cardinality can become a concern
# Check current cardinality:
curl -s http://prometheus:9090/api/v1/label/__name__/values | \
jq '.data | length'
# Check how many time series uptime-kuma metrics create:
curl -s 'http://prometheus:9090/api/v1/query?query=count({job="uptime-kuma"})' | \
jq '.data.result[0].value[1]'
# With 100 monitors × 4 metrics = 400 series — well within Prometheus limits
# At 1000+ monitors, consider:
# Option 1: Use recording rules to pre-aggregate:
curl -X POST http://prometheus:9090/-/reload # After adding rules
# Add to prometheus/rules/uptime-kuma-recording.yml:
groups:
- name: uptime_kuma_aggregations
interval: 5m
rules:
# Pre-aggregate 30-day uptime per monitor:
- record: monitor:uptime_30d:avg
expr: avg_over_time(monitor_status[30d]) * 100
# Average response time per monitor (5m):
- record: monitor:response_time_5m:avg
expr: avg_over_time(monitor_response_time[5m])
# Option 2: Drop high-cardinality labels before storing:
# In prometheus.yml scrape config for uptime-kuma:
# metric_relabel_configs:
# - source_labels: [__name__]
# regex: 'monitor_response_time'
# target_label: __name__
# replacement: monitor_response_time_aggregated
# - source_labels: [monitor_hostname]
# target_label: monitor_hostname
# replacement: "" # Drop hostname to reduce cardinality
Pro Tips
- Use Grafana's alert annotations to mark outages on all panels simultaneously — configure Grafana to query Alertmanager for alert history and render it as annotations on your infrastructure panels. When an outage is visible as a red band across CPU, memory, and uptime panels simultaneously, root cause analysis is dramatically faster.
- Throttle automated remediation scripts aggressively — a remediation script that restarts a service should have a cooldown: if it ran in the last 10 minutes, skip and escalate directly. Without throttling, a flapping service can trigger hundreds of restarts.
- Export Grafana dashboards as JSON and commit to Git — your Uptime Kuma dashboards are as important as your alert rules. Treat them like code: version, review, and deploy them through the same pipeline as everything else.
- Build a separate Grafana dashboard for the on-call engineer — not a comprehensive infrastructure overview, but a focused view that answers the four questions an on-call needs at 3am: what's down, how long has it been down, what were conditions right before it failed, and what's the runbook link?
- Test your Alertmanager inhibition rules before you need them — inhibition rules that silence downstream alerts during root-cause failures are only valuable if they work correctly. Run quarterly chaos tests where you manually fire the root-cause alert and verify downstream alerts are properly suppressed.
Wrapping Up
This five-guide series now covers the complete Uptime Kuma lifecycle from initial deployment through enterprise observability integration. The progression: start with advanced monitors and status pages, add API automation and multi-location coverage, implement SLA tracking and SRE practices, and finally integrate with Grafana, Alertmanager, and distributed tracing for a complete enterprise observability platform.
Uptime Kuma at this integration depth stops being a standalone tool and becomes an active participant in your observability stack — enriching alerts with trace context, feeding unified dashboards alongside infrastructure metrics, and routing through Alertmanager's sophisticated deduplication and escalation logic. That's the difference between knowing things are down and understanding why, how long, and what's affected.
Need an Enterprise Observability Platform Built for Your Infrastructure?
Designing and integrating Uptime Kuma, Grafana, Prometheus, Alertmanager, distributed tracing, and automated remediation into a coherent enterprise observability platform — with proper alert routing, SLA reporting, and on-call workflows — is a significant engineering project. The sysbrix team builds complete observability platforms that give engineering organizations the visibility they need to run reliable services.
Talk to Us →