Skip to Content

Self-Host Grafana: High Availability, Business Metrics, On-Call Management, and SLO Tracking

Complete your self-hosted Grafana observability platform with a highly available clustered setup, business-level dashboards that connect technical metrics to outcomes, automated on-call rotation management, and SLO tracking that makes reliability commitments measurable.
Grafana setup guide

Self-Host Grafana: High Availability, Business Metrics, On-Call Management, and SLO Tracking

Three guides into this Grafana series, you have a solid observability foundation: Prometheus, Node Exporter, and infrastructure dashboards, then Loki for logs, Tempo for traces, and the full LGTM stack, and Uptime Kuma integration for availability monitoring. This fourth guide addresses the questions that matter at organizational scale: how do you keep Grafana itself highly available, how do you build dashboards that connect infrastructure metrics to business outcomes, how do you manage on-call rotation without paying for PagerDuty, and how do you track SLOs in a way your engineering leadership can actually use in quarterly planning?


Prerequisites

  • A running Grafana + Prometheus stack — see our getting started guide
  • Loki and Tempo deployed — see our LGTM stack guide
  • Grafana version 10.0+ — HA, SLO panels, and on-call features require recent releases
  • For HA: at least two servers and a PostgreSQL instance for shared Grafana state
  • At least 30 days of metrics history for meaningful SLO calculations
  • Application-level metrics exposed to Prometheus (request rates, error rates, durations)

Verify your current stack is ready for the patterns in this guide:

# Check Grafana version:
docker exec grafana grafana-cli --version

# Verify you have 30+ days of Prometheus data:
curl -s 'http://localhost:9090/api/v1/query?query=count(up)' | \
  jq '.data.result | length'

# Check if application metrics exist (needed for SLOs):
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
  jq '.data | map(select(startswith("http_requests")))'
# Should show http_requests_total or similar application metrics

# Current Grafana database backend:
docker exec grafana grafana-cli admin reset-admin-password --help 2>&1 | \
  grep -i database || echo 'Check grafana.ini for database section'

High Availability: Grafana Without a Single Point of Failure

A single Grafana instance means your entire observability platform goes dark when that container is restarted or the server has an issue — exactly when you'd most want it available. HA Grafana runs multiple instances sharing a PostgreSQL backend and a distributed session store, with a load balancer routing dashboards requests to any healthy instance.

HA Architecture

Grafana's HA model is simpler than you might expect:

  • Database — swap SQLite for PostgreSQL. All dashboard definitions, user accounts, alert rules, and data sources live here.
  • Session store — use Redis so users stay logged in when the load balancer routes them to a different instance.
  • Storage — dashboard images and PDF reports can use S3-compatible storage instead of local disk.
  • Alerting — Grafana's built-in Alertmanager handles deduplication in HA mode automatically via the database.

HA Grafana with PostgreSQL and Redis

# docker-compose.ha.yml — Deploy on EACH Grafana node
# Both nodes share the same PostgreSQL and Redis

version: '3.8'

services:
  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana_node
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      # Database — use PostgreSQL instead of SQLite:
      GF_DATABASE_TYPE: postgres
      GF_DATABASE_HOST: pg.internal.yourdomain.com:5432
      GF_DATABASE_NAME: grafana
      GF_DATABASE_USER: grafana
      GF_DATABASE_PASSWORD: ${PG_PASSWORD}
      GF_DATABASE_SSL_MODE: require

      # Session store — use Redis for cross-node sessions:
      GF_SESSION_PROVIDER: redis
      GF_SESSION_PROVIDER_CONFIG: addr=redis.internal.yourdomain.com:6379,password=${REDIS_PASSWORD},db=0

      # Security:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_SECURITY_SECRET_KEY: ${GRAFANA_SECRET_KEY}
      # Must be identical on all nodes — used for cookie signing

      # Server config:
      GF_SERVER_ROOT_URL: https://grafana.yourdomain.com
      GF_SERVER_DOMAIN: grafana.yourdomain.com

      # HA mode — enables distributed alerting:
      GF_UNIFIED_ALERTING_HA_PEERS: grafana-node1.internal:9094,grafana-node2.internal:9094
      GF_UNIFIED_ALERTING_HA_LISTEN_ADDRESS: :9094
      GF_UNIFIED_ALERTING_HA_ADVERTISE_ADDRESS: ${NODE_IP}:9094

      # Disable user signup (manage users centrally):
      GF_USERS_ALLOW_SIGN_UP: "false"

      # SMTP for alert notifications:
      GF_SMTP_ENABLED: "true"
      GF_SMTP_HOST: ${SMTP_HOST}:587
      GF_SMTP_USER: ${SMTP_USER}
      GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
      GF_SMTP_FROM_ADDRESS: [email protected]

    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

volumes:
  grafana_data:

Nginx Load Balancer and Database Migration

# First: migrate existing SQLite dashboards to PostgreSQL
# This is a one-time operation before enabling HA

# Export all dashboards from current SQLite Grafana:
curl -s http://admin:pass@localhost:3000/api/search?type=dash-db | \
  jq -r '.[].uid' | while read uid; do
    curl -s "http://admin:pass@localhost:3000/api/dashboards/uid/$uid" | \
      jq .dashboard > "backups/dashboard_${uid}.json"
    echo "Exported: $uid"
done

# Export data sources:
curl -s http://admin:pass@localhost:3000/api/datasources | \
  jq '.' > backups/datasources.json

# Start new HA Grafana pointing at PostgreSQL:
docker compose -f docker-compose.ha.yml up -d

# Wait for first-time database migration to complete:
docker compose logs -f grafana_node | grep -E '(migration|ready|started)'

# Import dashboards into the new HA instance:
jq -c '.[]' backups/datasources.json | while read ds; do
    curl -s -X POST http://admin:newpass@localhost:3000/api/datasources \
      -H 'Content-Type: application/json' \
      -d "$ds" | jq .message
done

# Import dashboards:
for f in backups/dashboard_*.json; do
    uid=$(basename $f .json | sed 's/dashboard_//')
    curl -s -X POST http://admin:newpass@localhost:3000/api/dashboards/import \
      -H 'Content-Type: application/json' \
      -d "{\"dashboard\": $(cat $f), \"overwrite\": true}" | jq .status
done

# Load balancer for Grafana HA:
# /etc/nginx/sites-available/grafana-ha
upstream grafana_nodes {
    ip_hash;  # Sticky sessions for SSO compatibility
    server grafana-node1.internal:3000 max_fails=3 fail_timeout=30s;
    server grafana-node2.internal:3000 max_fails=3 fail_timeout=30s;
}

server {
    listen 443 ssl http2;
    server_name grafana.yourdomain.com;
    ssl_certificate /etc/letsencrypt/live/grafana.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/grafana.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://grafana_nodes;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        # WebSocket for live dashboard updates:
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Business Metrics Dashboards

Infrastructure dashboards show CPU and memory. Business metrics dashboards answer the questions your CTO asks: Is revenue processing working? Are orders being fulfilled? How many paying customers are active right now? These dashboards connect your technical infrastructure to business outcomes — making them legible to non-technical stakeholders and giving on-call engineers the business context they need during incidents.

Instrumenting Business Events with Prometheus

# Add business event counters to your application code
# These are exposed via the /metrics endpoint alongside technical metrics

# Node.js example with prom-client:
const client = require('prom-client');

// Business event counters:
const ordersCreated = new client.Counter({
  name: 'business_orders_created_total',
  help: 'Total orders created',
  labelNames: ['status', 'payment_method', 'region']
});

const revenueProcessed = new client.Counter({
  name: 'business_revenue_processed_cents_total',
  help: 'Total revenue processed in cents',
  labelNames: ['currency', 'payment_method']
});

const activeSubscriptions = new client.Gauge({
  name: 'business_active_subscriptions',
  help: 'Currently active paid subscriptions',
  labelNames: ['plan_type']
});

const userSignups = new client.Counter({
  name: 'business_user_signups_total',
  help: 'Total user registrations',
  labelNames: ['source', 'plan']
});

// Instrument your business logic:
app.post('/orders', async (req, res) => {
  try {
    const order = await createOrder(req.body);
    ordersCreated.inc({ status: 'created', payment_method: order.payment_method, region: order.region });
    revenueProcessed.inc({ currency: order.currency, payment_method: order.payment_method }, order.total_cents);
    res.json(order);
  } catch (err) {
    ordersCreated.inc({ status: 'failed', payment_method: req.body.payment_method, region: req.body.region });
    throw err;
  }
});

// Python example with prometheus_client:
from prometheus_client import Counter, Gauge, Histogram

orders_created = Counter(
    'business_orders_created_total',
    'Total orders created',
    ['status', 'payment_method']
)

checkout_duration = Histogram(
    'business_checkout_duration_seconds',
    'Time spent in checkout flow',
    buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0]
)

Business Dashboard PromQL Queries

# PromQL queries for business metrics dashboards:
# Create these as panels in a "Business Overview" dashboard

# Current order rate (orders per minute, last 5 minutes):
rate(business_orders_created_total{status="created"}[5m]) * 60

# Order failure rate (% of orders failing):
rate(business_orders_created_total{status="failed"}[5m])
  /
  rate(business_orders_created_total[5m]) * 100

# Revenue rate (USD/hour):
rate(business_revenue_processed_cents_total[1h]) * 3600 / 100

# Active subscriptions by plan:
business_active_subscriptions

# User signup rate (last 24h):
sum(increase(business_user_signups_total[24h]))

# Revenue by payment method (last 24h):
sum by (payment_method) (increase(business_revenue_processed_cents_total[24h])) / 100

# Order success rate trend (7-day rolling):
sum(rate(business_orders_created_total{status="created"}[7d]))
  /
  sum(rate(business_orders_created_total[7d])) * 100

# Checkout abandonment (started but not completed):
1 - (
  rate(business_orders_created_total{status="created"}[1h])
  /
  rate(business_checkout_started_total[1h])
)

# Alert: Revenue processing has stopped
# Alert if no revenue has been processed in 30 minutes during business hours:
sum(increase(business_revenue_processed_cents_total[30m])) == 0
  and on() (hour() >= 8 and hour() <= 22)  # Business hours filter

SLO Tracking: Making Reliability Measurable

Service Level Objectives are the formal commitment your engineering team makes about reliability. Without SLO tracking in Grafana, reliability discussions are vague. With it, you can say: "Payment processing had 99.94% availability last month against a 99.9% SLO target, consuming 40% of our error budget." That's a concrete number your leadership can plan around.

Defining and Calculating SLOs

# Add Prometheus recording rules to pre-compute SLO metrics
# Recording rules make SLO queries fast even over long time ranges
# prometheus/rules/slo-rules.yml

groups:
  - name: slo_computations
    interval: 30s
    rules:

      # Payment API — Availability SLO (99.9%)
      # Good requests = any HTTP response that isn't 5xx
      - record: slo:payment_api:request_success_rate_5m
        expr: |
          sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="payment-api"}[5m]))

      # Payment API — Latency SLO (P95 < 500ms)
      - record: slo:payment_api:p95_latency_5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m]))
            by (le)
          )

      # Error budget consumption — rolling 30 days
      # Error budget = 1 - SLO target = 1 - 0.999 = 0.001
      # Budget consumed = (1 - actual_availability) / error_budget
      - record: slo:payment_api:error_budget_consumed_30d
        expr: |
          (
            1 - (
              sum(increase(http_requests_total{service="payment-api",status!~"5.."}[30d]))
              /
              sum(increase(http_requests_total{service="payment-api"}[30d]))
            )
          )
          / 0.001  # Error budget: 0.1% of requests can fail

      # Checkout service — Multi-window availability (used for burn rate alerts)
      - record: slo:checkout:availability_1h
        expr: |
          sum(rate(http_requests_total{service="checkout",status!~"5.."}[1h]))
          / sum(rate(http_requests_total{service="checkout"}[1h]))

      - record: slo:checkout:availability_6h
        expr: |
          sum(rate(http_requests_total{service="checkout",status!~"5.."}[6h]))
          / sum(rate(http_requests_total{service="checkout"}[6h]))

      - record: slo:checkout:availability_3d
        expr: |
          sum(rate(http_requests_total{service="checkout",status!~"5.."}[3d]))
          / sum(rate(http_requests_total{service="checkout"}[3d]))

      - record: slo:checkout:availability_30d
        expr: |
          sum(rate(http_requests_total{service="checkout",status!~"5.."}[30d]))
          / sum(rate(http_requests_total{service="checkout"}[30d]))

SLO Dashboard and Burn Rate Alerting

# SLO Dashboard panels — create a dedicated "SLO Status" dashboard
# All queries reference the recording rules for fast rendering

# Panel 1: Error Budget Remaining (Gauge)
# Shows how much of your monthly error budget is left
# Expression:
(1 - slo:payment_api:error_budget_consumed_30d) * 100
# Threshold colors:
# Green: >50% remaining
# Yellow: 25-50% remaining
# Red: <25% remaining (burning budget fast)

# Panel 2: Current Availability (Stat)
# Shows real-time availability vs SLO target
# Expression:
slo:payment_api:request_success_rate_5m * 100
# With threshold at 99.9 (the SLO target)

# Panel 3: 30-Day Availability Timeline (Time series)
# Shows availability trend over the month
# Expression:
sum_over_time(slo:payment_api:request_success_rate_5m[30d:5m]) / count_over_time(slo:payment_api:request_success_rate_5m[30d:5m]) * 100

# Panel 4: Burn Rate (burn rate > 1 = burning budget faster than monthly budget)
# If burn rate > 14.4: you'll exhaust your 30-day budget in 50 hours
(1 - slo:payment_api:request_success_rate_5m) / (1 - 0.999)

# Multi-window Burn Rate Alert Rules:
# These catch both fast burns (sudden outage) and slow burns (degraded service)
# In Grafana Alerting → Alert Rules:

# Fast burn alert (page now — exhausts budget in 1 hour):
# 1h burn rate > 14.4 AND 5m burn rate > 14.4
(
  (1 - slo:checkout:availability_1h) / (1 - 0.999) > 14.4
)
and
(
  (1 - slo:checkout:availability_6h) / (1 - 0.999) > 14.4
)

# Slow burn alert (page in business hours — will exhaust budget in 6 days):
# 6h burn rate > 6 AND 3d burn rate > 6
(
  (1 - slo:checkout:availability_6h) / (1 - 0.999) > 6
)
and
(
  (1 - slo:checkout:availability_3d) / (1 - 0.999) > 6
)

On-Call Management with Grafana OnCall

Grafana OnCall is an open-source on-call management system that integrates natively with Grafana's unified alerting. It handles rotation schedules, escalation policies, and alert routing — replacing PagerDuty or OpsGenie for teams who want to keep their entire observability stack self-hosted.

Deploying Grafana OnCall

# Add Grafana OnCall to your monitoring stack:
# docker-compose.yml additions:

  oncall-engine:
    image: grafana/oncall:latest
    container_name: oncall_engine
    restart: unless-stopped
    command: uwsgi --ini uwsgi.ini
    environment:
      DATABASE_TYPE: postgresql
      DATABASE_HOST: postgres
      DATABASE_PORT: 5432
      DATABASE_NAME: oncall
      DATABASE_USER: oncall
      DATABASE_PASSWORD: ${ONCALL_DB_PASSWORD}
      BROKER_TYPE: redis
      REDIS_URI: redis://:${REDIS_PASSWORD}@redis:6379/3
      SECRET_KEY: ${ONCALL_SECRET_KEY}
      GRAFANA_API_URL: http://grafana:3000
      GRAFANA_ONCALL_DJANGO_SETTINGS_MODULE: settings.prod_without_db_initialization
      # Telegram/Slack notifications:
      TELEGRAM_TOKEN: ${TELEGRAM_BOT_TOKEN:-}
      SLACK_CLIENT_ID: ${SLACK_CLIENT_ID:-}
      SLACK_CLIENT_SECRET: ${SLACK_CLIENT_SECRET:-}
    depends_on:
      - postgres
      - redis
    networks:
      - monitoring

  oncall-celery:
    image: grafana/oncall:latest
    container_name: oncall_celery
    restart: unless-stopped
    command: celery -A engine worker -l info -c 2
    environment:
      # Same as engine
      DATABASE_TYPE: postgresql
      DATABASE_HOST: postgres
      DATABASE_PORT: 5432
      DATABASE_NAME: oncall
      DATABASE_USER: oncall
      DATABASE_PASSWORD: ${ONCALL_DB_PASSWORD}
      BROKER_TYPE: redis
      REDIS_URI: redis://:${REDIS_PASSWORD}@redis:6379/3
      SECRET_KEY: ${ONCALL_SECRET_KEY}
    depends_on:
      - oncall-engine
    networks:
      - monitoring

# Create OnCall database:
docker exec postgres psql -U postgres -c "CREATE DATABASE oncall OWNER oncall;"

docker compose up -d oncall-engine oncall-celery

# Install OnCall plugin in Grafana:
# Grafana → Administration → Plugins → search "Grafana OnCall" → Install
# Or via CLI:
docker exec grafana grafana-cli plugins install grafana-oncall-app
docker compose restart grafana

Configuring On-Call Rotations and Escalation Policies

# Configure OnCall via Grafana OnCall plugin UI:
# Grafana → OnCall → Users → Connect notification channels
# Each team member connects: Telegram, Slack, Email, Phone (via webhook)

# Example on-call schedule setup via OnCall API:
ONCALL_URL="http://localhost:8080"  # OnCall engine
ONCALL_TOKEN="your-oncall-api-token"

# Create an on-call schedule (weekly rotation):
curl -X POST "${ONCALL_URL}/api/v1/schedules/" \
  -H "Authorization: ${ONCALL_TOKEN}" \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Backend On-Call",
    "type": "web",
    "time_zone": "UTC",
    "shifts": [{
      "name": "Week 1",
      "type": "recurrent_event",
      "duration": 604800,
      "rotation_start": "2026-04-06T00:00:00Z",
      "frequency": "weekly",
      "interval": 1,
      "by_day": ["MO"],
      "users": ["USER_ID_ALICE", "USER_ID_BOB", "USER_ID_CAROL"]
    }]
  }' | jq .id

# Create escalation policy:
curl -X POST "${ONCALL_URL}/api/v1/escalation_policies/" \
  -H "Authorization: ${ONCALL_TOKEN}" \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Backend Escalation",
    "steps": [
      {
        "type": "notify_on_call_from_schedule",
        "notify_on_call_from_schedule": "SCHEDULE_ID",
        "wait_delay": "00:00:00"  # Immediately notify on-call engineer
      },
      {
        "type": "wait",
        "wait_delay": "00:15:00"  # Wait 15 minutes
      },
      {
        "type": "notify_persons",
        "persons_to_notify": ["MANAGER_USER_ID"],  # Escalate to manager
        "wait_delay": "00:00:00"
      }
    ]
  }' | jq .id

# Connect Grafana alerts to OnCall:
# Grafana → Alerting → Contact Points → Add contact point
# Type: Grafana OnCall
# Select: your escalation policy
# Now Grafana alerts automatically page on-call through OnCall

Advanced Dashboard Techniques

Provisioning Dashboards as Code

# Provision dashboards via YAML config — version-controlled and reproducible
# grafana/provisioning/dashboards/default.yaml

cat > grafana/provisioning/dashboards/default.yaml << 'EOF'
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Production'
    folderUid: 'production'
    type: file
    disableDeletion: true      # Prevent UI deletions of provisioned dashboards
    editable: false             # Read-only in UI (editable = false)
    updateIntervalSeconds: 10
    allowUiUpdates: false
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: true  # Subdirectories become folders

  - name: 'slos'
    orgId: 1
    folder: 'SLOs'
    type: file
    disableDeletion: true
    options:
      path: /etc/grafana/provisioning/dashboards/slos
EOF

# Create a parameterized dashboard JSON and commit to Git:
# Every team member can view dashboards without Grafana access
# Changes go through pull request review

# Export a dashboard for version control:
curl -s http://admin:pass@localhost:3000/api/dashboards/uid/payment-overview | \
  jq '{dashboard: .dashboard, overwrite: true}' > \
  grafana/provisioning/dashboards/payment-overview.json

# Mount provisioning directory in Grafana:
# In docker-compose.yml:
# volumes:
#   - ./grafana/provisioning:/etc/grafana/provisioning:ro

# Apply changes (Grafana watches the directory):
docker compose restart grafana
# Or: Grafana auto-reloads provisioned dashboards every 10s

Tips, Gotchas, and Troubleshooting

HA Instances Not Synchronizing Alert State

# Symptom: Alert fires on node1, resolved on node2, but node1 still shows firing
# Cause: Alerting HA peer discovery not working

# Check HA peer connectivity:
docker exec grafana_node curl -s http://grafana-node2.internal:9094/
# Should return something — if connection refused, peers can't reach each other

# Verify the HA advertise addresses are correct:
docker exec grafana_node env | grep -i 'HA\|ALERTING'
# GF_UNIFIED_ALERTING_HA_ADVERTISE_ADDRESS should be THIS node's IP, not 0.0.0.0

# Check HA status in Grafana logs:
docker logs grafana_node | grep -iE '(ha|gossip|peer|cluster)'
# Should show: "Starting ha scheduler" and "joining cluster"

# Test by checking both nodes show the same alert state:
curl -s http://grafana-node1.internal:3000/api/alertmanager/grafana/api/v2/alerts | \
  jq 'length'
curl -s http://grafana-node2.internal:3000/api/alertmanager/grafana/api/v2/alerts | \
  jq 'length'
# Should be identical

# If firewall is blocking port 9094 (alerting HA port):
sudo ufw allow 9094/tcp
# From the monitoring network to both Grafana nodes

SLO Calculations Return NaN or Unexpected Values

# NaN in SLO queries usually means the denominator is 0 (no requests in window)

# Debug by querying the raw metric:
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{service="payment-api"}[5m]))' | jq .

# If the result is empty: service label doesn't match your actual metric labels
# Check actual label names:
curl -s 'http://localhost:9090/api/v1/label/service/values' | jq .data

# If service label doesn't exist, check what labels your metrics have:
curl -s 'http://localhost:9090/api/v1/query?query=http_requests_total[5m]' | \
  jq '.data.result[0].metric'
# Shows all labels on the metric — use the correct label name in your SLO query

# Prevent division by zero with or vector(0):
(
  sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
  or vector(0)
)
/
(
  sum(rate(http_requests_total{service="payment-api"}[5m]))
  or vector(1)  # Default to 100% if no traffic
)

# If recording rules aren't updating:
curl -s http://localhost:9090/api/v1/rules | \
  jq '.data.groups[] | select(.name=="slo_computations") | .evaluationTime'
# Should be < 30s (our evaluation interval)

OnCall Not Sending Notifications

# Check OnCall engine logs:
docker logs oncall_engine --tail 50 | grep -iE '(error|notification|telegram|slack)'

# Check Celery worker is processing tasks:
docker logs oncall_celery --tail 20 | grep -iE '(task|error|received)'

# Verify the integration is connected to Grafana:
curl -s ${ONCALL_URL}/api/v1/integrations/ \
  -H "Authorization: ${ONCALL_TOKEN}" | jq '[.[] | {name: .name, type: .type}]'

# Test notification delivery:
curl -X POST ${ONCALL_URL}/api/v1/send_demo_alert/ \
  -H "Authorization: ${ONCALL_TOKEN}" \
  -d '{"integration_id": "YOUR_INTEGRATION_ID"}'
# Should trigger a test notification through your notification channels

# Check user notification preferences are set:
curl -s ${ONCALL_URL}/api/v1/users/me/ \
  -H "Authorization: ${ONCALL_TOKEN}" | \
  jq '.notification_rules'
# Each user needs at least one notification rule (Telegram, email, etc.)

# If Telegram bot isn't responding:
# Verify the bot token:
curl -s https://api.telegram.org/bot${TELEGRAM_TOKEN}/getMe | jq .ok
# Should return: true

Pro Tips

  • Start with 99.0% SLO targets and tighten from actual data — beginning with a 99.9% SLO target on a service that's historically 98.5% available creates immediate violations. Start with achievable targets, measure actual performance for 90 days, then set targets that represent meaningful improvement rather than aspirational fiction.
  • Use Grafana annotations for deployments and incidents — annotate every production deployment on your dashboards. When you see a spike in error rate, you can immediately tell whether it correlates with a deploy. Set this up via CI/CD: after each successful deploy, curl -X POST http://grafana:3000/api/annotations -d '{"text": "Deploy: v1.2.3", "tags": ["deploy"]}'
  • Build your SLO dashboard before your first major incident, not after — SLO tracking takes time to calibrate. The first time you need to report on reliability to leadership shouldn't be the first time you're building the calculation. Get it running in the background for at least a full month before it matters.
  • Connect OnCall's schedule calendar to your team calendar — OnCall can export ICS calendar files. Have every on-call engineer subscribe to the OnCall calendar in Google Calendar or Outlook so they see their on-call weeks alongside their regular work schedule, not just in a monitoring tool they rarely open.
  • In HA mode, test dashboard persistence by stopping one node while editing — create a new dashboard, save it, then stop that Grafana node. On the other node, verify the dashboard exists. If it doesn't, your PostgreSQL connection isn't working correctly — the dashboard was only saved to local state.

Wrapping Up

The four Grafana guides now cover the complete observability platform: infrastructure metrics and dashboards, the full LGTM stack with logs and traces, external availability monitoring integration, and this guide's HA clustering, business metrics, SLO tracking, and on-call management.

SLO tracking is the capability that most transforms how engineering teams talk about reliability. When you can show leadership a dashboard with error budget remaining, burn rate, and month-over-month SLO performance — all derived from production metrics — reliability stops being a vague feeling and becomes a measurable property of your system that you can commit to, track, and improve deliberately.


Need a Complete Observability Platform Built for Your Engineering Organization?

Designing HA Grafana with proper SLO tracking for your specific services, business metrics dashboards that connect technical metrics to the outcomes your leadership cares about, and self-hosted on-call management that integrates with your existing alerting — the sysbrix team builds observability platforms for engineering organizations that need to make reliability commitments they can actually keep.

Talk to Us →
Self-Host Grafana: Log Aggregation with Loki, Distributed Tracing with Tempo, and the Complete LGTM Stack
Go beyond basic metrics — learn how to extend your self-hosted Grafana monitoring with Loki for centralized log aggregation, Tempo for distributed tracing, Mimir for long-term metrics retention, and unified alerting that correlates signals across all three pillars of observability.