Uptime Kuma Setup: Grafana Integration, Custom Dashboards, Alertmanager, and Enterprise Observability

Uptime Kuma Setup: Grafana Integration, Custom Dashboards, Alertmanager, and Enterprise Observability

The four previous guides in this series covered the complete Uptime Kuma operational picture: basic deployment, advanced monitors and status pages, API automation and multi-location monitoring, and SLA tracking and SRE practices. This final guide covers the integration layer: pulling Uptime Kuma's metrics into Grafana for unified infrastructure dashboards, routing alerts through Prometheus Alertmanager for sophisticated deduplication and escalation, embedding live status widgets in your internal tools, and operating Uptime Kuma as a first-class component of an enterprise observability stack.

Prerequisites

A fully configured Uptime Kuma instance — see the series starting with our advanced monitoring guide
Grafana running (standalone or as part of a monitoring stack)
Prometheus with the Uptime Kuma metrics endpoint being scraped
Prometheus Alertmanager deployed (separate from Grafana alerting)
At least 30 days of Uptime Kuma history for meaningful dashboard visualizations

Verify your Prometheus is scraping Uptime Kuma:

# Verify Uptime Kuma metrics are being scraped:
curl -s http://localhost:9090/api/v1/query?query=monitor_status | \
  jq '.data.result | length'
# Should return your monitor count (e.g., 24)

# Check the last scrape time:
curl -s 'http://localhost:9090/api/v1/query?query=up{job="uptime-kuma"}' | \
  jq '.data.result[0] | {up: .value[1], scrape_time: (.value[0] | todate)}'

# Verify key metrics exist:
curl -s http://localhost:9090/api/v1/label/__name__/values | \
  jq '.data | map(select(startswith("monitor_")))'
# Should show: monitor_status, monitor_response_time, monitor_cert_days_remaining

Grafana Integration: Uptime Kuma Dashboards

Uptime Kuma provides a clean per-service view, but it doesn't show how uptime correlates with your infrastructure metrics — CPU spikes before an outage, memory pressure causing slowdowns, deployment events triggering degraded response times. Grafana unifies these views so you can see availability alongside everything else in one place.

Creating the Uptime Kuma Data Source in Grafana

# Provision the Prometheus data source pointing at your Prometheus instance
# (which already scrapes Uptime Kuma)
# grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: '30s'          # Match Uptime Kuma's scrape interval
      queryTimeout: '60s'
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

# Apply by restarting Grafana:
docker compose restart grafana

# Verify data source connectivity:
curl -s http://admin:password@localhost:3000/api/datasources | \
  jq '[.[] | {name: .name, type: .type, url: .url}]'

# Test the Prometheus connection:
curl -s 'http://admin:password@localhost:3000/api/datasources/proxy/1/api/v1/query?query=monitor_status' | \
  jq '.data.result | length'

Building the Unified Availability Dashboard

# Grafana dashboard JSON — paste in Dashboard → Import → JSON
# This creates a comprehensive availability overview dashboard

cat > uptime-kuma-dashboard.json << 'EOF'
{
  "title": "Service Availability Overview",
  "tags": ["uptime", "availability", "sla"],
  "timezone": "browser",
  "refresh": "1m",
  "panels": [
    {
      "title": "🔴 Services Currently Down",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [{
        "expr": "count(monitor_status == 0)",
        "legendFormat": "Down"
      }],
      "fieldConfig": {
        "defaults": {
          "color": {"mode": "thresholds"},
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 1, "color": "orange"},
              {"value": 3, "color": "red"}
            ]
          }
        }
      },
      "gridPos": {"x": 0, "y": 0, "w": 4, "h": 4}
    },
    {
      "title": "Overall Availability (24h)",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [{
        "expr": "avg(avg_over_time(monitor_status[24h])) * 100",
        "legendFormat": "Availability %"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "decimals": 3,
          "thresholds": {
            "steps": [
              {"value": 0, "color": "red"},
              {"value": 99, "color": "orange"},
              {"value": 99.9, "color": "green"}
            ]
          }
        }
      },
      "gridPos": {"x": 4, "y": 0, "w": 4, "h": 4}
    },
    {
      "title": "P95 Response Time (5m avg)",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [{
        "expr": "quantile(0.95, avg_over_time(monitor_response_time[5m]))",
        "legendFormat": "P95 Response"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "ms",
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 500, "color": "orange"},
              {"value": 2000, "color": "red"}
            ]
          }
        }
      },
      "gridPos": {"x": 8, "y": 0, "w": 4, "h": 4}
    }
  ]
}
EOF

# Import the dashboard:
curl -X POST http://admin:password@localhost:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d "{\"dashboard\": $(cat uptime-kuma-dashboard.json), \"overwrite\": true}" | \
  jq .url

Key PromQL Queries for Uptime Kuma Dashboards

# Essential PromQL queries for Uptime Kuma Grafana panels:

# 1. Heatmap: Monitor status over time (green=up, red=down)
# Use type: State timeline with this query:
monitor_status
# Splits by monitor_name label automatically

# 2. Response time time series (per service):
avg_over_time(monitor_response_time{monitor_name=~"$service"}[5m])

# 3. Uptime percentage for last 30 days:
avg_over_time(monitor_status{monitor_name=~"$service"}[30d]) * 100

# 4. Services not meeting 99.9% SLA this month:
avg_over_time(monitor_status[30d]) * 100 < 99.9

# 5. Certificate days remaining (for cert monitoring panels):
monitor_cert_days_remaining
# Alert threshold: < 30 days = warning, < 7 days = critical

# 6. Count of outages in the last 7 days per service:
count_over_time((
  monitor_status == 0
  and
  (monitor_status offset 1m) == 1
)[7d:])

# 7. Average time to recover (MTTR approximation):
# This requires custom recording rules in Prometheus:
# Record: job:monitor_state_changes:rate5m
# Expression: changes(monitor_status[5m])

# 8. Correlation: response time spike vs status change
# Overlay on the same panel:
monitor_response_time > 2000
monitor_status == 0
# Configure alerts when both are true simultaneously

Prometheus Alertmanager: Sophisticated Alert Routing

Uptime Kuma's built-in notifications fire on every status change. Alertmanager adds what Uptime Kuma can't do natively: alert deduplication (don't page twice for the same outage), grouping (one Slack message for 5 simultaneous failures, not 5 messages), silence rules, inhibition (don't alert about API being down if the database is already known down), and multi-receiver routing based on severity and service ownership.

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: '${SLACK_WEBHOOK_URL}'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Routing tree:
route:
  receiver: default-receiver
  group_by: ['alertname', 'environment']
  group_wait: 30s       # Wait 30s to group related alerts before firing
  group_interval: 5m    # Re-notify every 5 minutes if still firing
  repeat_interval: 4h   # Don't re-alert if nothing changed after 4 hours

  routes:
    # P0 production critical — immediate PagerDuty
    - matchers:
        - severity = critical
        - environment = production
      receiver: pagerduty-critical
      group_wait: 0s      # No grouping delay for critical
      repeat_interval: 30m

    # P1 production warning — Slack with short delay
    - matchers:
        - severity = warning
        - environment = production
      receiver: slack-production
      group_wait: 2m
      repeat_interval: 2h

    # Staging — Slack only, lower urgency
    - matchers:
        - environment = staging
      receiver: slack-staging
      group_wait: 5m
      repeat_interval: 8h

    # SSL certificate warnings — email to ops
    - matchers:
        - alertname = SSLCertExpiringSoon
      receiver: email-ops
      group_wait: 1h     # Cert alerts don't need immediate grouping
      repeat_interval: 24h

# Inhibition: suppress downstream alerts when root cause is known
inhibit_rules:
  # If database is down, suppress all API alerts (root cause is DB)
  - source_matchers:
      - alertname = DatabaseDown
    target_matchers:
      - alertname = ServiceDown
    equal: ['environment']  # Only inhibit same environment

  # If entire datacenter is down, suppress individual service alerts
  - source_matchers:
      - alertname = DatacenterDown
    target_matchers:
      - severity =~ "warning|critical"

# Receivers:
receivers:
  - name: default-receiver
    slack_configs:
      - channel: '#monitoring'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_KEY}'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.environment }}'
        severity: critical
        details:
          services: '{{ range .Alerts }}{{ .Labels.monitor_name }}, {{ end }}'

  - name: slack-production
    slack_configs:
      - channel: '#incidents'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: >-
          {{ if eq .Status "firing" }}🔴{{ else }}✅{{ end }}
          {{ .GroupLabels.alertname }}
        text: |
          *Environment:* {{ .GroupLabels.environment }}
          *Services:* {{ range .Alerts }}{{ .Labels.monitor_name }}, {{ end }}
          *Started:* {{ .Alerts[0].StartsAt.Format "15:04 UTC" }}
          {{ if .CommonAnnotations.runbook }}*Runbook:* {{ .CommonAnnotations.runbook }}{{ end }}

  - name: slack-staging
    slack_configs:
      - channel: '#monitoring-staging'
        title: '[Staging] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Labels.monitor_name }}: {{ .Annotations.description }}{{ end }}'

  - name: email-ops
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: '${SMTP_HOST}:587'
        auth_username: '${SMTP_USER}'
        auth_password: '${SMTP_PASS}'
        subject: '{{ .GroupLabels.alertname }}'
        html: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'

Prometheus Alert Rules for Uptime Kuma Metrics

# prometheus/rules/uptime-kuma.yml
groups:
  - name: uptime-kuma
    interval: 30s
    rules:

      # Service down alert with environment and team labels from monitor tags:
      - alert: ServiceDown
        expr: monitor_status == 0
        for: 2m    # Must be down for 2 consecutive minutes
        labels:
          severity: critical
          # Environment extracted from monitor name prefix:
          environment: '{{ if contains "staging" .Labels.monitor_name }}staging{{ else }}production{{ end }}'
        annotations:
          summary: "{{ $labels.monitor_name }} is DOWN"
          description: "{{ $labels.monitor_name }} has been unavailable for {{ $value | humanizeDuration }}"
          runbook: "https://wiki.company.com/runbooks/{{ $labels.monitor_name }}"

      # Slow response time warning:
      - alert: SlowResponseTime
        expr: avg_over_time(monitor_response_time[5m]) > 2000
        for: 5m
        labels:
          severity: warning
          environment: production
        annotations:
          summary: "{{ $labels.monitor_name }} response time degraded"
          description: "P5m avg response: {{ $value | humanize }}ms (threshold: 2000ms)"

      # SSL certificate expiry:
      - alert: SSLCertExpiringSoon
        expr: monitor_cert_days_remaining < 30
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring: {{ $labels.monitor_name }}"
          description: "Certificate expires in {{ $value | humanize }} days"

      - alert: SSLCertExpiringCritical
        expr: monitor_cert_days_remaining < 7
        labels:
          severity: critical
        annotations:
          summary: "URGENT: SSL cert expiring: {{ $labels.monitor_name }}"
          description: "Certificate expires in {{ $value | humanize }} days — IMMEDIATE ACTION REQUIRED"

      # SLA violation:
      - alert: SLAViolation
        expr: avg_over_time(monitor_status[30d]) * 100 < 99.9
        labels:
          severity: warning
        annotations:
          summary: "SLA violation: {{ $labels.monitor_name }}"
          description: "30-day uptime {{ $value | humanize }}% is below 99.9% SLA target"

      # Monitoring itself is down:
      - alert: UptimeKumaDown
        expr: absent(monitor_status)
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Uptime Kuma is not reporting metrics"
          description: "No monitor_status metrics received for 3+ minutes. Uptime Kuma may be down."

Embedded Status Widgets and API-Driven Status Pages

Uptime Kuma's public status page is great for external communication. But your internal tools — engineering wikis, incident dashboards, Notion pages — can embed live status data without requiring people to open a separate tab.

Embedding Status Badges in Internal Tools

# Uptime Kuma badge API:
# /api/badge/{monitor_id}/status
# /api/badge/{monitor_id}/uptime/24  (24-hour uptime %)
# /api/badge/{monitor_id}/uptime/720 (30-day uptime %)
# /api/badge/{monitor_id}/response   (avg response time)
# /api/badge/{monitor_id}/ping       (current ping)
# /api/badge/{monitor_id}/cert-exp   (days until cert expiry)

# Embed in Markdown (GitHub wikis, Notion, etc.):
# ![Status](https://monitor.yourdomain.com/api/badge/1/status)
# ![Uptime 30d](https://monitor.yourdomain.com/api/badge/1/uptime/720)

# Customizable badge parameters:
# ?label=API&upLabel=Operational&downLabel=Degraded
# ?color=green&downColor=red
# ?style=for-the-badge  (shields.io style)

# Example embedded in HTML:
cat << 'EOF'


  Service Status
  
    
      Production API
      
        
        
      
    
    
      Database
      
        
        
      
    
  

EOF

# Fetch current status for all monitors programmatically:
curl -s https://monitor.yourdomain.com/api/status-page/all-services | \
  jq '.publicGroupList[].monitorList[] | {name: .name, status: .status, uptime: .uptime_24}'

# Build a custom HTML status widget that refreshes every 60 seconds:
curl -s https://monitor.yourdomain.com/metrics | \
  grep 'monitor_status{' | \
  awk '{print $1, $2}' | \
  sed 's/monitor_status{monitor_name="//' | sed 's/"}//'

Building a Custom Consolidated Status API

#!/usr/bin/env python3
# status-api.py — Thin API wrapper that aggregates Uptime Kuma + other sources
# Exposes a clean /status endpoint for your internal dashboards

from flask import Flask, jsonify
import requests
import re
from functools import lru_cache
from datetime import datetime
import os

app = Flask(__name__)
KUMA_URL = os.environ["UPTIME_KUMA_URL"]
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://prometheus:9090")

@lru_cache(maxsize=1)
def get_cached_status(ttl_hash=None):
    """Cache status for 30 seconds to avoid hammering metrics endpoint."""
    _ = ttl_hash
    metrics_text = requests.get(f"{KUMA_URL}/metrics", timeout=10).text
    services = {}

    for line in metrics_text.split("\n"):
        # Parse monitor_status
        status_match = re.match(
            r'monitor_status\{.*?monitor_name="([^"]+)".*?\} (\d+)', line
        )
        if status_match:
            name, status = status_match.group(1), int(status_match.group(2))
            services.setdefault(name, {})["status"] = "up" if status == 1 else "down"

        # Parse response time
        rt_match = re.match(
            r'monitor_response_time\{.*?monitor_name="([^"]+)".*?\} (\d+)', line
        )
        if rt_match:
            name, rt = rt_match.group(1), int(rt_match.group(2))
            services.setdefault(name, {})["response_ms"] = rt

    return services

def make_ttl_hash(seconds=30):
    return round(datetime.utcnow().timestamp() / seconds)

@app.route("/status")
def status():
    services = get_cached_status(ttl_hash=make_ttl_hash(30))
    down_services = [name for name, data in services.items() if data.get("status") == "down"]
    overall = "degraded" if down_services else "operational"

    return jsonify({
        "overall_status": overall,
        "services": services,
        "down_count": len(down_services),
        "total_count": len(services),
        "down_services": down_services,
        "last_updated": datetime.utcnow().isoformat() + "Z"
    })

@app.route("/status/")
def service_status(service_name):
    services = get_cached_status(ttl_hash=make_ttl_hash(30))
    if service_name not in services:
        return jsonify({"error": "service not found"}), 404
    return jsonify({"service": service_name, **services[service_name]})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8090)

# Deploy as a Docker container:
# docker run -d \
#   -p 8090:8090 \
#   -e UPTIME_KUMA_URL=https://monitor.yourdomain.com \
#   -e PROMETHEUS_URL=http://prometheus:9090 \
#   your-status-api:latest

# Use in other services:
# curl http://status-api:8090/status | jq .overall_status

Enterprise Observability Stack Integration

At enterprise scale, Uptime Kuma is one component of a larger observability platform alongside distributed tracing, log aggregation, synthetic monitoring, and business metrics. The integration patterns here connect Uptime Kuma into that broader ecosystem without making it a single point of orchestration.

Connecting to OpenTelemetry for Trace Correlation

# When Uptime Kuma detects an outage, enrich the alert with recent trace data
# This webhook handler queries Tempo for traces around the failure time

#!/usr/bin/env python3
# kuma-trace-enricher.py — Webhook handler that adds trace context to Kuma alerts

from flask import Flask, request, jsonify
import requests
import os
from datetime import datetime, timedelta

app = Flask(__name__)
TEMPO_URL = os.environ.get("TEMPO_URL", "http://tempo:3200")
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK"]

@app.route("/kuma-alert", methods=["POST"])
def handle_kuma_alert():
    data = request.json
    heartbeat = data.get("heartbeat", {})
    monitor = data.get("monitor", {})

    if heartbeat.get("status") != 0:  # Only process DOWN events
        return jsonify({"status": "ok"})

    monitor_name = monitor.get("name", "unknown")
    failure_time = heartbeat.get("time", "")

    # Query Tempo for recent error traces around the failure time
    recent_traces = query_recent_error_traces(monitor_name, failure_time)

    # Send enriched alert to Slack
    message = {
        "blocks": [
            {
                "type": "header",
                "text": {"type": "plain_text", "text": f"🔴 {monitor_name} is DOWN"}
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*URL:*\n{monitor.get('url', 'N/A')}"},
                    {"type": "mrkdwn", "text": f"*Error:*\n{heartbeat.get('msg', 'Unknown')}"},
                ]
            }
        ]
    }

    if recent_traces:
        trace_links = "\n".join([
            f"•  ({t['duration']}ms)"
            for t in recent_traces[:3]
        ])
        message["blocks"].append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*Recent Error Traces:*\n{trace_links}"}
        })

    requests.post(SLACK_WEBHOOK, json=message)
    return jsonify({"status": "ok"})

def query_recent_error_traces(service_name: str, failure_time: str) -> list:
    """Query Tempo for error traces near the failure time."""
    try:
        end_time = int(datetime.utcnow().timestamp() * 1_000_000_000)
        start_time = end_time - (5 * 60 * 1_000_000_000)  # Last 5 minutes

        resp = requests.get(
            f"{TEMPO_URL}/api/search",
            params={
                "q": f'{{.service.name="{service_name}" && status=error}}',
                "start": start_time,
                "end": end_time,
                "limit": 5
            },
            timeout=10
        )
        return resp.json().get("traces", []) if resp.status_code == 200 else []
    except Exception:
        return []

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8091)

Runbook Automation via Kuma Webhooks

#!/bin/bash
# automated-remediation.sh
# Triggered by Uptime Kuma webhook when specific monitors go down
# Runs predefined remediation steps before paging on-call

set -euo pipefail

# Parse webhook payload:
MONITOR_NAME="${MONITOR_NAME:-unknown}"
MONITOR_URL="${MONITOR_URL:-}"
STATUS="${STATUS:-0}"

if [ "$STATUS" != "0" ]; then
  echo "Service recovered — no remediation needed"
  exit 0
fi

echo "[$(date -u)] Automated remediation triggered for: $MONITOR_NAME"

# Remediation playbook based on monitor name:
case "$MONITOR_NAME" in
  *"redis"*)
    echo "Attempting Redis restart..."
    docker restart redis 2>/dev/null && sleep 10
    # Check if recovered:
    if curl -sf "${MONITOR_URL}" > /dev/null 2>&1; then
      echo "Redis recovered after restart — notifying team"
      curl -X POST "$SLACK_WEBHOOK" \
        -d "{\"text\": \"✅ Redis auto-recovered after restart for $MONITOR_NAME\"}"
      exit 0
    fi
    ;;

  *"nginx"*)
    echo "Attempting Nginx config reload..."
    nginx -t && nginx -s reload 2>/dev/null
    sleep 5
    if curl -sf "${MONITOR_URL}" > /dev/null 2>&1; then
      echo "Nginx recovered after reload"
      curl -X POST "$SLACK_WEBHOOK" \
        -d "{\"text\": \"✅ Nginx auto-recovered after config reload\"}"
      exit 0
    fi
    ;;

  *"disk"*|*"storage"*)
    echo "Checking disk space..."
    DISK_USAGE=$(df / | awk 'NR==2{print $5}' | tr -d '%')
    if [ "$DISK_USAGE" -gt 90 ]; then
      echo "Disk >90% — running emergency cleanup"
      docker system prune -f --volumes 2>/dev/null || true
      journalctl --vacuum-size=500M 2>/dev/null || true
    fi
    ;;
esac

# If we reach here, auto-remediation failed — escalate:
echo "Auto-remediation unsuccessful — escalating to on-call"
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H 'Content-Type: application/json' \
  -d "{
    \"routing_key\": \"$PAGERDUTY_KEY\",
    \"event_action\": \"trigger\",
    \"dedup_key\": \"kuma-$MONITOR_NAME\",
    \"payload\": {
      \"summary\": \"$MONITOR_NAME is DOWN (auto-remediation failed)\",
      \"severity\": \"critical\",
      \"source\": \"uptime-kuma\"
    }
  }"

Tips, Gotchas, and Troubleshooting

Alertmanager Receiving Alerts But Not Routing Correctly

# Test Alertmanager routing without waiting for a real alert:
# Use amtool to simulate an alert and see which receiver it goes to:

# Install amtool:
go install github.com/prometheus/alertmanager/cmd/amtool@latest

# Check routing for a simulated alert:
amtool check-routes \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --tree
# Shows the complete routing tree

# Test with specific labels:
amtool check-routes \
  --config.file=/etc/alertmanager/alertmanager.yml \
  severity=critical \
  environment=production \
  alertname=ServiceDown
# Shows which receiver would handle this alert

# Send a test alert to verify end-to-end:
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "environment": "staging",
      "monitor_name": "test-service"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing Alertmanager routing"
    },
    "generatorURL": "http://prometheus:9090"
  }]'

# View active alerts in Alertmanager:
curl -s http://localhost:9093/api/v1/alerts | jq '[.data[] | {name: .labels.alertname, state: .status.state}]'

Grafana Dashboard Shows No Data for Monitor_Status

# Debug the data pipeline from Uptime Kuma to Grafana:

# Step 1: Confirm Uptime Kuma is exposing metrics:
curl -s https://monitor.yourdomain.com/metrics | grep monitor_status | head -3
# Should show: monitor_status{...} 1

# Step 2: Confirm Prometheus is scraping successfully:
curl -s 'http://prometheus:9090/api/v1/query?query=monitor_status' | \
  jq '.data.result | length'
# Should return > 0

# Step 3: Check the last successful scrape time:
curl -s 'http://prometheus:9090/api/v1/query?query=up{job="uptime-kuma"}' | \
  jq '.data.result[0].value'

# Step 4: Check Prometheus scrape targets:
curl -s http://prometheus:9090/api/v1/targets | \
  jq '.data.activeTargets[] | select(.labels.job == "uptime-kuma") | {health: .health, lastScrape: .lastScrape, lastError: .lastError}'

# Step 5: In Grafana query editor, check for label mismatches:
# Run: monitor_status
# Look at the actual label names in the results
# dashboard variables must match exactly: monitor_name not monitorName

# Step 6: Time range issues — Uptime Kuma metrics reset on restart
# If Prometheus shows data but Grafana shows empty:
# Check the Grafana time range — set to "Last 1 hour" for recent data
# Older data may have different label values if monitors were renamed

Uptime Kuma Metrics Cardinality Issues in Large Deployments

# With 100+ monitors, Prometheus cardinality can become a concern
# Check current cardinality:
curl -s http://prometheus:9090/api/v1/label/__name__/values | \
  jq '.data | length'

# Check how many time series uptime-kuma metrics create:
curl -s 'http://prometheus:9090/api/v1/query?query=count({job="uptime-kuma"})' | \
  jq '.data.result[0].value[1]'

# With 100 monitors × 4 metrics = 400 series — well within Prometheus limits
# At 1000+ monitors, consider:

# Option 1: Use recording rules to pre-aggregate:
curl -X POST http://prometheus:9090/-/reload  # After adding rules

# Add to prometheus/rules/uptime-kuma-recording.yml:
groups:
  - name: uptime_kuma_aggregations
    interval: 5m
    rules:
      # Pre-aggregate 30-day uptime per monitor:
      - record: monitor:uptime_30d:avg
        expr: avg_over_time(monitor_status[30d]) * 100

      # Average response time per monitor (5m):
      - record: monitor:response_time_5m:avg
        expr: avg_over_time(monitor_response_time[5m])

# Option 2: Drop high-cardinality labels before storing:
# In prometheus.yml scrape config for uptime-kuma:
# metric_relabel_configs:
#   - source_labels: [__name__]
#     regex: 'monitor_response_time'
#     target_label: __name__
#     replacement: monitor_response_time_aggregated
#   - source_labels: [monitor_hostname]
#     target_label: monitor_hostname
#     replacement: ""  # Drop hostname to reduce cardinality

Pro Tips

Use Grafana's alert annotations to mark outages on all panels simultaneously — configure Grafana to query Alertmanager for alert history and render it as annotations on your infrastructure panels. When an outage is visible as a red band across CPU, memory, and uptime panels simultaneously, root cause analysis is dramatically faster.
Throttle automated remediation scripts aggressively — a remediation script that restarts a service should have a cooldown: if it ran in the last 10 minutes, skip and escalate directly. Without throttling, a flapping service can trigger hundreds of restarts.
Export Grafana dashboards as JSON and commit to Git — your Uptime Kuma dashboards are as important as your alert rules. Treat them like code: version, review, and deploy them through the same pipeline as everything else.
Build a separate Grafana dashboard for the on-call engineer — not a comprehensive infrastructure overview, but a focused view that answers the four questions an on-call needs at 3am: what's down, how long has it been down, what were conditions right before it failed, and what's the runbook link?
Test your Alertmanager inhibition rules before you need them — inhibition rules that silence downstream alerts during root-cause failures are only valuable if they work correctly. Run quarterly chaos tests where you manually fire the root-cause alert and verify downstream alerts are properly suppressed.

Wrapping Up

This five-guide series now covers the complete Uptime Kuma lifecycle from initial deployment through enterprise observability integration. The progression: start with advanced monitors and status pages, add API automation and multi-location coverage, implement SLA tracking and SRE practices, and finally integrate with Grafana, Alertmanager, and distributed tracing for a complete enterprise observability platform.

Uptime Kuma at this integration depth stops being a standalone tool and becomes an active participant in your observability stack — enriching alerts with trace context, feeding unified dashboards alongside infrastructure metrics, and routing through Alertmanager's sophisticated deduplication and escalation logic. That's the difference between knowing things are down and understanding why, how long, and what's affected.

Need an Enterprise Observability Platform Built for Your Infrastructure?

Designing and integrating Uptime Kuma, Grafana, Prometheus, Alertmanager, distributed tracing, and automated remediation into a coherent enterprise observability platform — with proper alert routing, SLA reporting, and on-call workflows — is a significant engineering project. The sysbrix team builds complete observability platforms that give engineering organizations the visibility they need to run reliable services.

Talk to Us →

in Guides

# Grafana Monitoring Observability Prometheus Uptime Kuma

Vaultwarden Bitwarden Self-Host: High Availability, SSO Integration, Migration from Cloud Password Managers, and Compliance Auditing

Complete your Vaultwarden production deployment with a highly available multi-instance setup, SSO via OIDC, automated migration from LastPass and Bitwarden Cloud, and the compliance-grade audit logging your security team actually needs.