Self-Host Grafana: High Availability, Business Metrics, On-Call Management, and SLO Tracking
Three guides into this Grafana series, you have a solid observability foundation: Prometheus, Node Exporter, and infrastructure dashboards, then Loki for logs, Tempo for traces, and the full LGTM stack, and Uptime Kuma integration for availability monitoring. This fourth guide addresses the questions that matter at organizational scale: how do you keep Grafana itself highly available, how do you build dashboards that connect infrastructure metrics to business outcomes, how do you manage on-call rotation without paying for PagerDuty, and how do you track SLOs in a way your engineering leadership can actually use in quarterly planning?
Prerequisites
- A running Grafana + Prometheus stack — see our getting started guide
- Loki and Tempo deployed — see our LGTM stack guide
- Grafana version 10.0+ — HA, SLO panels, and on-call features require recent releases
- For HA: at least two servers and a PostgreSQL instance for shared Grafana state
- At least 30 days of metrics history for meaningful SLO calculations
- Application-level metrics exposed to Prometheus (request rates, error rates, durations)
Verify your current stack is ready for the patterns in this guide:
# Check Grafana version:
docker exec grafana grafana-cli --version
# Verify you have 30+ days of Prometheus data:
curl -s 'http://localhost:9090/api/v1/query?query=count(up)' | \
jq '.data.result | length'
# Check if application metrics exist (needed for SLOs):
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
jq '.data | map(select(startswith("http_requests")))'
# Should show http_requests_total or similar application metrics
# Current Grafana database backend:
docker exec grafana grafana-cli admin reset-admin-password --help 2>&1 | \
grep -i database || echo 'Check grafana.ini for database section'
High Availability: Grafana Without a Single Point of Failure
A single Grafana instance means your entire observability platform goes dark when that container is restarted or the server has an issue — exactly when you'd most want it available. HA Grafana runs multiple instances sharing a PostgreSQL backend and a distributed session store, with a load balancer routing dashboards requests to any healthy instance.
HA Architecture
Grafana's HA model is simpler than you might expect:
- Database — swap SQLite for PostgreSQL. All dashboard definitions, user accounts, alert rules, and data sources live here.
- Session store — use Redis so users stay logged in when the load balancer routes them to a different instance.
- Storage — dashboard images and PDF reports can use S3-compatible storage instead of local disk.
- Alerting — Grafana's built-in Alertmanager handles deduplication in HA mode automatically via the database.
HA Grafana with PostgreSQL and Redis
# docker-compose.ha.yml — Deploy on EACH Grafana node
# Both nodes share the same PostgreSQL and Redis
version: '3.8'
services:
grafana:
image: grafana/grafana-oss:latest
container_name: grafana_node
restart: unless-stopped
ports:
- "3000:3000"
environment:
# Database — use PostgreSQL instead of SQLite:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: pg.internal.yourdomain.com:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_PASSWORD: ${PG_PASSWORD}
GF_DATABASE_SSL_MODE: require
# Session store — use Redis for cross-node sessions:
GF_SESSION_PROVIDER: redis
GF_SESSION_PROVIDER_CONFIG: addr=redis.internal.yourdomain.com:6379,password=${REDIS_PASSWORD},db=0
# Security:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_SECURITY_SECRET_KEY: ${GRAFANA_SECRET_KEY}
# Must be identical on all nodes — used for cookie signing
# Server config:
GF_SERVER_ROOT_URL: https://grafana.yourdomain.com
GF_SERVER_DOMAIN: grafana.yourdomain.com
# HA mode — enables distributed alerting:
GF_UNIFIED_ALERTING_HA_PEERS: grafana-node1.internal:9094,grafana-node2.internal:9094
GF_UNIFIED_ALERTING_HA_LISTEN_ADDRESS: :9094
GF_UNIFIED_ALERTING_HA_ADVERTISE_ADDRESS: ${NODE_IP}:9094
# Disable user signup (manage users centrally):
GF_USERS_ALLOW_SIGN_UP: "false"
# SMTP for alert notifications:
GF_SMTP_ENABLED: "true"
GF_SMTP_HOST: ${SMTP_HOST}:587
GF_SMTP_USER: ${SMTP_USER}
GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
GF_SMTP_FROM_ADDRESS: [email protected]
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
grafana_data:
Nginx Load Balancer and Database Migration
# First: migrate existing SQLite dashboards to PostgreSQL
# This is a one-time operation before enabling HA
# Export all dashboards from current SQLite Grafana:
curl -s http://admin:pass@localhost:3000/api/search?type=dash-db | \
jq -r '.[].uid' | while read uid; do
curl -s "http://admin:pass@localhost:3000/api/dashboards/uid/$uid" | \
jq .dashboard > "backups/dashboard_${uid}.json"
echo "Exported: $uid"
done
# Export data sources:
curl -s http://admin:pass@localhost:3000/api/datasources | \
jq '.' > backups/datasources.json
# Start new HA Grafana pointing at PostgreSQL:
docker compose -f docker-compose.ha.yml up -d
# Wait for first-time database migration to complete:
docker compose logs -f grafana_node | grep -E '(migration|ready|started)'
# Import dashboards into the new HA instance:
jq -c '.[]' backups/datasources.json | while read ds; do
curl -s -X POST http://admin:newpass@localhost:3000/api/datasources \
-H 'Content-Type: application/json' \
-d "$ds" | jq .message
done
# Import dashboards:
for f in backups/dashboard_*.json; do
uid=$(basename $f .json | sed 's/dashboard_//')
curl -s -X POST http://admin:newpass@localhost:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-d "{\"dashboard\": $(cat $f), \"overwrite\": true}" | jq .status
done
# Load balancer for Grafana HA:
# /etc/nginx/sites-available/grafana-ha
upstream grafana_nodes {
ip_hash; # Sticky sessions for SSO compatibility
server grafana-node1.internal:3000 max_fails=3 fail_timeout=30s;
server grafana-node2.internal:3000 max_fails=3 fail_timeout=30s;
}
server {
listen 443 ssl http2;
server_name grafana.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/grafana.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/grafana.yourdomain.com/privkey.pem;
location / {
proxy_pass http://grafana_nodes;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket for live dashboard updates:
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Business Metrics Dashboards
Infrastructure dashboards show CPU and memory. Business metrics dashboards answer the questions your CTO asks: Is revenue processing working? Are orders being fulfilled? How many paying customers are active right now? These dashboards connect your technical infrastructure to business outcomes — making them legible to non-technical stakeholders and giving on-call engineers the business context they need during incidents.
Instrumenting Business Events with Prometheus
# Add business event counters to your application code
# These are exposed via the /metrics endpoint alongside technical metrics
# Node.js example with prom-client:
const client = require('prom-client');
// Business event counters:
const ordersCreated = new client.Counter({
name: 'business_orders_created_total',
help: 'Total orders created',
labelNames: ['status', 'payment_method', 'region']
});
const revenueProcessed = new client.Counter({
name: 'business_revenue_processed_cents_total',
help: 'Total revenue processed in cents',
labelNames: ['currency', 'payment_method']
});
const activeSubscriptions = new client.Gauge({
name: 'business_active_subscriptions',
help: 'Currently active paid subscriptions',
labelNames: ['plan_type']
});
const userSignups = new client.Counter({
name: 'business_user_signups_total',
help: 'Total user registrations',
labelNames: ['source', 'plan']
});
// Instrument your business logic:
app.post('/orders', async (req, res) => {
try {
const order = await createOrder(req.body);
ordersCreated.inc({ status: 'created', payment_method: order.payment_method, region: order.region });
revenueProcessed.inc({ currency: order.currency, payment_method: order.payment_method }, order.total_cents);
res.json(order);
} catch (err) {
ordersCreated.inc({ status: 'failed', payment_method: req.body.payment_method, region: req.body.region });
throw err;
}
});
// Python example with prometheus_client:
from prometheus_client import Counter, Gauge, Histogram
orders_created = Counter(
'business_orders_created_total',
'Total orders created',
['status', 'payment_method']
)
checkout_duration = Histogram(
'business_checkout_duration_seconds',
'Time spent in checkout flow',
buckets=[0.5, 1.0, 2.5, 5.0, 10.0, 30.0]
)
Business Dashboard PromQL Queries
# PromQL queries for business metrics dashboards:
# Create these as panels in a "Business Overview" dashboard
# Current order rate (orders per minute, last 5 minutes):
rate(business_orders_created_total{status="created"}[5m]) * 60
# Order failure rate (% of orders failing):
rate(business_orders_created_total{status="failed"}[5m])
/
rate(business_orders_created_total[5m]) * 100
# Revenue rate (USD/hour):
rate(business_revenue_processed_cents_total[1h]) * 3600 / 100
# Active subscriptions by plan:
business_active_subscriptions
# User signup rate (last 24h):
sum(increase(business_user_signups_total[24h]))
# Revenue by payment method (last 24h):
sum by (payment_method) (increase(business_revenue_processed_cents_total[24h])) / 100
# Order success rate trend (7-day rolling):
sum(rate(business_orders_created_total{status="created"}[7d]))
/
sum(rate(business_orders_created_total[7d])) * 100
# Checkout abandonment (started but not completed):
1 - (
rate(business_orders_created_total{status="created"}[1h])
/
rate(business_checkout_started_total[1h])
)
# Alert: Revenue processing has stopped
# Alert if no revenue has been processed in 30 minutes during business hours:
sum(increase(business_revenue_processed_cents_total[30m])) == 0
and on() (hour() >= 8 and hour() <= 22) # Business hours filter
SLO Tracking: Making Reliability Measurable
Service Level Objectives are the formal commitment your engineering team makes about reliability. Without SLO tracking in Grafana, reliability discussions are vague. With it, you can say: "Payment processing had 99.94% availability last month against a 99.9% SLO target, consuming 40% of our error budget." That's a concrete number your leadership can plan around.
Defining and Calculating SLOs
# Add Prometheus recording rules to pre-compute SLO metrics
# Recording rules make SLO queries fast even over long time ranges
# prometheus/rules/slo-rules.yml
groups:
- name: slo_computations
interval: 30s
rules:
# Payment API — Availability SLO (99.9%)
# Good requests = any HTTP response that isn't 5xx
- record: slo:payment_api:request_success_rate_5m
expr: |
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
# Payment API — Latency SLO (P95 < 500ms)
- record: slo:payment_api:p95_latency_5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m]))
by (le)
)
# Error budget consumption — rolling 30 days
# Error budget = 1 - SLO target = 1 - 0.999 = 0.001
# Budget consumed = (1 - actual_availability) / error_budget
- record: slo:payment_api:error_budget_consumed_30d
expr: |
(
1 - (
sum(increase(http_requests_total{service="payment-api",status!~"5.."}[30d]))
/
sum(increase(http_requests_total{service="payment-api"}[30d]))
)
)
/ 0.001 # Error budget: 0.1% of requests can fail
# Checkout service — Multi-window availability (used for burn rate alerts)
- record: slo:checkout:availability_1h
expr: |
sum(rate(http_requests_total{service="checkout",status!~"5.."}[1h]))
/ sum(rate(http_requests_total{service="checkout"}[1h]))
- record: slo:checkout:availability_6h
expr: |
sum(rate(http_requests_total{service="checkout",status!~"5.."}[6h]))
/ sum(rate(http_requests_total{service="checkout"}[6h]))
- record: slo:checkout:availability_3d
expr: |
sum(rate(http_requests_total{service="checkout",status!~"5.."}[3d]))
/ sum(rate(http_requests_total{service="checkout"}[3d]))
- record: slo:checkout:availability_30d
expr: |
sum(rate(http_requests_total{service="checkout",status!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="checkout"}[30d]))
SLO Dashboard and Burn Rate Alerting
# SLO Dashboard panels — create a dedicated "SLO Status" dashboard
# All queries reference the recording rules for fast rendering
# Panel 1: Error Budget Remaining (Gauge)
# Shows how much of your monthly error budget is left
# Expression:
(1 - slo:payment_api:error_budget_consumed_30d) * 100
# Threshold colors:
# Green: >50% remaining
# Yellow: 25-50% remaining
# Red: <25% remaining (burning budget fast)
# Panel 2: Current Availability (Stat)
# Shows real-time availability vs SLO target
# Expression:
slo:payment_api:request_success_rate_5m * 100
# With threshold at 99.9 (the SLO target)
# Panel 3: 30-Day Availability Timeline (Time series)
# Shows availability trend over the month
# Expression:
sum_over_time(slo:payment_api:request_success_rate_5m[30d:5m]) / count_over_time(slo:payment_api:request_success_rate_5m[30d:5m]) * 100
# Panel 4: Burn Rate (burn rate > 1 = burning budget faster than monthly budget)
# If burn rate > 14.4: you'll exhaust your 30-day budget in 50 hours
(1 - slo:payment_api:request_success_rate_5m) / (1 - 0.999)
# Multi-window Burn Rate Alert Rules:
# These catch both fast burns (sudden outage) and slow burns (degraded service)
# In Grafana Alerting → Alert Rules:
# Fast burn alert (page now — exhausts budget in 1 hour):
# 1h burn rate > 14.4 AND 5m burn rate > 14.4
(
(1 - slo:checkout:availability_1h) / (1 - 0.999) > 14.4
)
and
(
(1 - slo:checkout:availability_6h) / (1 - 0.999) > 14.4
)
# Slow burn alert (page in business hours — will exhaust budget in 6 days):
# 6h burn rate > 6 AND 3d burn rate > 6
(
(1 - slo:checkout:availability_6h) / (1 - 0.999) > 6
)
and
(
(1 - slo:checkout:availability_3d) / (1 - 0.999) > 6
)
On-Call Management with Grafana OnCall
Grafana OnCall is an open-source on-call management system that integrates natively with Grafana's unified alerting. It handles rotation schedules, escalation policies, and alert routing — replacing PagerDuty or OpsGenie for teams who want to keep their entire observability stack self-hosted.
Deploying Grafana OnCall
# Add Grafana OnCall to your monitoring stack:
# docker-compose.yml additions:
oncall-engine:
image: grafana/oncall:latest
container_name: oncall_engine
restart: unless-stopped
command: uwsgi --ini uwsgi.ini
environment:
DATABASE_TYPE: postgresql
DATABASE_HOST: postgres
DATABASE_PORT: 5432
DATABASE_NAME: oncall
DATABASE_USER: oncall
DATABASE_PASSWORD: ${ONCALL_DB_PASSWORD}
BROKER_TYPE: redis
REDIS_URI: redis://:${REDIS_PASSWORD}@redis:6379/3
SECRET_KEY: ${ONCALL_SECRET_KEY}
GRAFANA_API_URL: http://grafana:3000
GRAFANA_ONCALL_DJANGO_SETTINGS_MODULE: settings.prod_without_db_initialization
# Telegram/Slack notifications:
TELEGRAM_TOKEN: ${TELEGRAM_BOT_TOKEN:-}
SLACK_CLIENT_ID: ${SLACK_CLIENT_ID:-}
SLACK_CLIENT_SECRET: ${SLACK_CLIENT_SECRET:-}
depends_on:
- postgres
- redis
networks:
- monitoring
oncall-celery:
image: grafana/oncall:latest
container_name: oncall_celery
restart: unless-stopped
command: celery -A engine worker -l info -c 2
environment:
# Same as engine
DATABASE_TYPE: postgresql
DATABASE_HOST: postgres
DATABASE_PORT: 5432
DATABASE_NAME: oncall
DATABASE_USER: oncall
DATABASE_PASSWORD: ${ONCALL_DB_PASSWORD}
BROKER_TYPE: redis
REDIS_URI: redis://:${REDIS_PASSWORD}@redis:6379/3
SECRET_KEY: ${ONCALL_SECRET_KEY}
depends_on:
- oncall-engine
networks:
- monitoring
# Create OnCall database:
docker exec postgres psql -U postgres -c "CREATE DATABASE oncall OWNER oncall;"
docker compose up -d oncall-engine oncall-celery
# Install OnCall plugin in Grafana:
# Grafana → Administration → Plugins → search "Grafana OnCall" → Install
# Or via CLI:
docker exec grafana grafana-cli plugins install grafana-oncall-app
docker compose restart grafana
Configuring On-Call Rotations and Escalation Policies
# Configure OnCall via Grafana OnCall plugin UI:
# Grafana → OnCall → Users → Connect notification channels
# Each team member connects: Telegram, Slack, Email, Phone (via webhook)
# Example on-call schedule setup via OnCall API:
ONCALL_URL="http://localhost:8080" # OnCall engine
ONCALL_TOKEN="your-oncall-api-token"
# Create an on-call schedule (weekly rotation):
curl -X POST "${ONCALL_URL}/api/v1/schedules/" \
-H "Authorization: ${ONCALL_TOKEN}" \
-H 'Content-Type: application/json' \
-d '{
"name": "Backend On-Call",
"type": "web",
"time_zone": "UTC",
"shifts": [{
"name": "Week 1",
"type": "recurrent_event",
"duration": 604800,
"rotation_start": "2026-04-06T00:00:00Z",
"frequency": "weekly",
"interval": 1,
"by_day": ["MO"],
"users": ["USER_ID_ALICE", "USER_ID_BOB", "USER_ID_CAROL"]
}]
}' | jq .id
# Create escalation policy:
curl -X POST "${ONCALL_URL}/api/v1/escalation_policies/" \
-H "Authorization: ${ONCALL_TOKEN}" \
-H 'Content-Type: application/json' \
-d '{
"name": "Backend Escalation",
"steps": [
{
"type": "notify_on_call_from_schedule",
"notify_on_call_from_schedule": "SCHEDULE_ID",
"wait_delay": "00:00:00" # Immediately notify on-call engineer
},
{
"type": "wait",
"wait_delay": "00:15:00" # Wait 15 minutes
},
{
"type": "notify_persons",
"persons_to_notify": ["MANAGER_USER_ID"], # Escalate to manager
"wait_delay": "00:00:00"
}
]
}' | jq .id
# Connect Grafana alerts to OnCall:
# Grafana → Alerting → Contact Points → Add contact point
# Type: Grafana OnCall
# Select: your escalation policy
# Now Grafana alerts automatically page on-call through OnCall
Advanced Dashboard Techniques
Provisioning Dashboards as Code
# Provision dashboards via YAML config — version-controlled and reproducible
# grafana/provisioning/dashboards/default.yaml
cat > grafana/provisioning/dashboards/default.yaml << 'EOF'
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Production'
folderUid: 'production'
type: file
disableDeletion: true # Prevent UI deletions of provisioned dashboards
editable: false # Read-only in UI (editable = false)
updateIntervalSeconds: 10
allowUiUpdates: false
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: true # Subdirectories become folders
- name: 'slos'
orgId: 1
folder: 'SLOs'
type: file
disableDeletion: true
options:
path: /etc/grafana/provisioning/dashboards/slos
EOF
# Create a parameterized dashboard JSON and commit to Git:
# Every team member can view dashboards without Grafana access
# Changes go through pull request review
# Export a dashboard for version control:
curl -s http://admin:pass@localhost:3000/api/dashboards/uid/payment-overview | \
jq '{dashboard: .dashboard, overwrite: true}' > \
grafana/provisioning/dashboards/payment-overview.json
# Mount provisioning directory in Grafana:
# In docker-compose.yml:
# volumes:
# - ./grafana/provisioning:/etc/grafana/provisioning:ro
# Apply changes (Grafana watches the directory):
docker compose restart grafana
# Or: Grafana auto-reloads provisioned dashboards every 10s
Tips, Gotchas, and Troubleshooting
HA Instances Not Synchronizing Alert State
# Symptom: Alert fires on node1, resolved on node2, but node1 still shows firing
# Cause: Alerting HA peer discovery not working
# Check HA peer connectivity:
docker exec grafana_node curl -s http://grafana-node2.internal:9094/
# Should return something — if connection refused, peers can't reach each other
# Verify the HA advertise addresses are correct:
docker exec grafana_node env | grep -i 'HA\|ALERTING'
# GF_UNIFIED_ALERTING_HA_ADVERTISE_ADDRESS should be THIS node's IP, not 0.0.0.0
# Check HA status in Grafana logs:
docker logs grafana_node | grep -iE '(ha|gossip|peer|cluster)'
# Should show: "Starting ha scheduler" and "joining cluster"
# Test by checking both nodes show the same alert state:
curl -s http://grafana-node1.internal:3000/api/alertmanager/grafana/api/v2/alerts | \
jq 'length'
curl -s http://grafana-node2.internal:3000/api/alertmanager/grafana/api/v2/alerts | \
jq 'length'
# Should be identical
# If firewall is blocking port 9094 (alerting HA port):
sudo ufw allow 9094/tcp
# From the monitoring network to both Grafana nodes
SLO Calculations Return NaN or Unexpected Values
# NaN in SLO queries usually means the denominator is 0 (no requests in window)
# Debug by querying the raw metric:
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{service="payment-api"}[5m]))' | jq .
# If the result is empty: service label doesn't match your actual metric labels
# Check actual label names:
curl -s 'http://localhost:9090/api/v1/label/service/values' | jq .data
# If service label doesn't exist, check what labels your metrics have:
curl -s 'http://localhost:9090/api/v1/query?query=http_requests_total[5m]' | \
jq '.data.result[0].metric'
# Shows all labels on the metric — use the correct label name in your SLO query
# Prevent division by zero with or vector(0):
(
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
or vector(0)
)
/
(
sum(rate(http_requests_total{service="payment-api"}[5m]))
or vector(1) # Default to 100% if no traffic
)
# If recording rules aren't updating:
curl -s http://localhost:9090/api/v1/rules | \
jq '.data.groups[] | select(.name=="slo_computations") | .evaluationTime'
# Should be < 30s (our evaluation interval)
OnCall Not Sending Notifications
# Check OnCall engine logs:
docker logs oncall_engine --tail 50 | grep -iE '(error|notification|telegram|slack)'
# Check Celery worker is processing tasks:
docker logs oncall_celery --tail 20 | grep -iE '(task|error|received)'
# Verify the integration is connected to Grafana:
curl -s ${ONCALL_URL}/api/v1/integrations/ \
-H "Authorization: ${ONCALL_TOKEN}" | jq '[.[] | {name: .name, type: .type}]'
# Test notification delivery:
curl -X POST ${ONCALL_URL}/api/v1/send_demo_alert/ \
-H "Authorization: ${ONCALL_TOKEN}" \
-d '{"integration_id": "YOUR_INTEGRATION_ID"}'
# Should trigger a test notification through your notification channels
# Check user notification preferences are set:
curl -s ${ONCALL_URL}/api/v1/users/me/ \
-H "Authorization: ${ONCALL_TOKEN}" | \
jq '.notification_rules'
# Each user needs at least one notification rule (Telegram, email, etc.)
# If Telegram bot isn't responding:
# Verify the bot token:
curl -s https://api.telegram.org/bot${TELEGRAM_TOKEN}/getMe | jq .ok
# Should return: true
Pro Tips
- Start with 99.0% SLO targets and tighten from actual data — beginning with a 99.9% SLO target on a service that's historically 98.5% available creates immediate violations. Start with achievable targets, measure actual performance for 90 days, then set targets that represent meaningful improvement rather than aspirational fiction.
- Use Grafana annotations for deployments and incidents — annotate every production deployment on your dashboards. When you see a spike in error rate, you can immediately tell whether it correlates with a deploy. Set this up via CI/CD: after each successful deploy,
curl -X POST http://grafana:3000/api/annotations -d '{"text": "Deploy: v1.2.3", "tags": ["deploy"]}' - Build your SLO dashboard before your first major incident, not after — SLO tracking takes time to calibrate. The first time you need to report on reliability to leadership shouldn't be the first time you're building the calculation. Get it running in the background for at least a full month before it matters.
- Connect OnCall's schedule calendar to your team calendar — OnCall can export ICS calendar files. Have every on-call engineer subscribe to the OnCall calendar in Google Calendar or Outlook so they see their on-call weeks alongside their regular work schedule, not just in a monitoring tool they rarely open.
- In HA mode, test dashboard persistence by stopping one node while editing — create a new dashboard, save it, then stop that Grafana node. On the other node, verify the dashboard exists. If it doesn't, your PostgreSQL connection isn't working correctly — the dashboard was only saved to local state.
Wrapping Up
The four Grafana guides now cover the complete observability platform: infrastructure metrics and dashboards, the full LGTM stack with logs and traces, external availability monitoring integration, and this guide's HA clustering, business metrics, SLO tracking, and on-call management.
SLO tracking is the capability that most transforms how engineering teams talk about reliability. When you can show leadership a dashboard with error budget remaining, burn rate, and month-over-month SLO performance — all derived from production metrics — reliability stops being a vague feeling and becomes a measurable property of your system that you can commit to, track, and improve deliberately.
Need a Complete Observability Platform Built for Your Engineering Organization?
Designing HA Grafana with proper SLO tracking for your specific services, business metrics dashboards that connect technical metrics to the outcomes your leadership cares about, and self-hosted on-call management that integrates with your existing alerting — the sysbrix team builds observability platforms for engineering organizations that need to make reliability commitments they can actually keep.
Talk to Us →