Self-Host Grafana: Log Aggregation with Loki, Distributed Tracing with Tempo, and the Complete LGTM Stack
The first Grafana guide covered the essentials: Prometheus, Node Exporter, cAdvisor, and building dashboards with PromQL. This guide completes the observability picture with the full LGTM stack — Loki for log aggregation so you can search logs across every container without SSH, Tempo for distributed tracing that shows exactly where a slow API request spends its time, Mimir for long-term metrics storage beyond Prometheus's default retention, and unified alerting that correlates metrics, logs, and traces to surface root causes instead of just symptoms. If you want to understand why something broke, not just that it broke, this is what you need.
Prerequisites
- A running Grafana + Prometheus stack — see our Grafana getting started guide
- Docker and Docker Compose v2 on your monitoring server
- At least 4GB RAM — the full LGTM stack is significantly heavier than Prometheus alone
- At least 50GB free disk — logs and traces accumulate quickly
- Applications instrumented with OpenTelemetry (for tracing — covered in this guide)
- The Grafana dashboard running and accessible via HTTPS
Verify your existing stack is healthy before adding new components:
cd ~/monitoring
docker compose ps
# Verify Grafana is running and has data:
curl -s http://admin:password@localhost:3000/api/health | jq .database
# Should return: "ok"
# Verify Prometheus is scraping successfully:
curl -s 'http://localhost:9090/api/v1/query?query=up' | \
jq '.data.result | length'
# Should return your number of targets
# Check available disk space:
df -h /
# Need at least 20GB free before adding Loki
Loki: Centralized Log Aggregation
Loki is Grafana's log aggregation system. It works like Prometheus but for logs: instead of scraping metrics endpoints, log shippers (Promtail or Alloy) tail log files and container outputs, then push to Loki. Logs are stored with label-based indexing and queried with LogQL. The critical difference from Elasticsearch: Loki doesn't index log content — only the labels. This makes it dramatically cheaper to run at scale.
Adding Loki and Promtail to Your Stack
# Add Loki and Promtail to your existing docker-compose.yml:
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- loki_data:/loki
- ./loki/loki-config.yml:/etc/loki/local-config.yaml:ro
command: -config.file=/etc/loki/local-config.yaml
networks:
- monitoring
healthcheck:
test: ["CMD-SHELL", "wget --quiet --tries=1 --spider http://localhost:3100/ready || exit 1"]
interval: 30s
timeout: 10s
retries: 5
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
volumes:
# Tail Docker container logs:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /run/docker.sock:/run/docker.sock:ro
- ./loki/promtail-config.yml:/etc/promtail/config.yml:ro
command: -config.file=/etc/promtail/config.yml
networks:
- monitoring
user: root # Required to read Docker socket and container logs
volumes:
loki_data:
driver: local
Loki and Promtail Configuration
mkdir -p loki
# loki/loki-config.yml
cat > loki/loki-config.yml << 'EOF'
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://alertmanager:9093
limits_config:
# Retain logs for 30 days:
retention_period: 720h
# Reject individual log lines larger than 1MB:
max_line_size: 1MB
# Per-stream ingestion rate limits:
ingestion_rate_mb: 16
ingestion_burst_size_mb: 32
compactor:
working_directory: /loki/compactor
retention_enabled: true
EOF
# loki/promtail-config.yml
cat > loki/promtail-config.yml << 'EOF'
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Scrape all Docker container logs automatically:
- job_name: docker-containers
docker_sd_configs:
- host: unix:///run/docker.sock
refresh_interval: 5s
relabel_configs:
# Use the container name as the app label:
- source_labels: [__meta_docker_container_name]
regex: /(.*)
target_label: app
# Preserve the container ID:
- source_labels: [__meta_docker_container_id]
target_label: container_id
# Add environment label from container label:
- source_labels: [__meta_docker_container_label_env]
target_label: env
pipeline_stages:
# Parse JSON logs (many apps log structured JSON):
- json:
expressions:
level: level
message: message
timestamp: time
# Extract log level for filtering:
- labels:
level:
# Drop debug logs to reduce storage cost:
- drop:
source: level
value: debug
# Also tail system logs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog
pipeline_stages:
- regex:
expression: '(?P\S+\s+\S+\s+\S+) (?P\S+) (?P\S+): (?P.*)'
- labels:
service:
EOF
docker compose up -d loki promtail
docker compose logs -f loki promtail | head -30
Querying Logs with LogQL
# Add Loki as a data source in Grafana:
# Connections → Data Sources → Add data source → Loki
# URL: http://loki:3100
# Save & test
# Essential LogQL queries for the Explore view:
# All logs from a specific container in the last hour:
{app="nginx"}
# Error logs across ALL containers:
{app=~".+"} |= "error" | json | level="error"
# Logs from your API with parsing:
{app="api"} | json | status >= 500
# Count errors per minute over the last hour:
sum(rate({app=~".+"} |= "error" [1m])) by (app)
# Find logs around a specific timestamp (useful with traces):
{app="payment-api"}
| json
| line_format "{{.timestamp}} {{.level}} {{.message}}"
# Find all logs for a specific request trace ID:
{app=~".+"} |= "trace_id=abc123"
# Log volume rate per service (for volume dashboard):
sum by (app) (rate({app=~".+"}[5m]))
# Verify Loki is receiving logs:
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq .data
# Should list: app, container_id, env, level, etc.
Tempo: Distributed Tracing
Logs tell you what happened. Metrics tell you how often. Traces tell you where time was spent. A trace follows a single request through your entire system — API gateway, backend service, database query, cache lookup — showing exactly which component added latency and how services depend on each other.
Adding Tempo to Your Stack
# Add Tempo to docker-compose.yml:
tempo:
image: grafana/tempo:latest
container_name: tempo
restart: unless-stopped
ports:
- "3200:3200" # Tempo HTTP API
- "4317:4317" # OTLP gRPC (OpenTelemetry)
- "4318:4318" # OTLP HTTP (OpenTelemetry)
- "9411:9411" # Zipkin compatibility
- "14268:14268" # Jaeger HTTP thrift
volumes:
- tempo_data:/var/tempo
- ./tempo/tempo-config.yml:/etc/tempo/tempo.yaml:ro
command: -config.file=/etc/tempo/tempo.yaml
networks:
- monitoring
volumes:
tempo_data:
driver: local
---
# tempo/tempo-config.yml
mkdir -p tempo
cat > tempo/tempo-config.yml << 'EOF'
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
zipkin:
endpoint: 0.0.0.0:9411
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 336h # 14 days of traces
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
metrics_generator:
# Generate RED metrics (Rate, Errors, Duration) from traces:
registry:
external_labels:
source: tempo
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
generate_native_histograms: both
EOF
docker compose up -d tempo
docker compose logs tempo --tail 20
# Verify Tempo is accepting traces:
curl -s http://localhost:3200/ready
# Should return: ready
Instrumenting Applications with OpenTelemetry
# Instrument a Node.js application with OpenTelemetry
# This auto-instruments HTTP, Express, database drivers, etc.
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http
# Create tracing.js — load BEFORE your application code:
cat > tracing.js << 'EOF'
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'my-api',
[SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
// Send traces to Tempo:
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://tempo:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
})],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
EOF
# Start your app with tracing:
node -r ./tracing.js server.js
# Or in package.json:
# "start": "node -r ./tracing.js server.js"
# Add to your Dockerfile:
# ENV NODE_OPTIONS="--require ./tracing.js"
# For Python applications:
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument \
--service_name my-python-api \
--traces_exporter otlp \
--exporter_otlp_endpoint http://tempo:4318/v1/traces \
python app.py
Connecting Traces to Logs (TraceID Correlation)
# Add trace context to your application logs so you can
# jump from a trace span directly to the logs for that request
# Node.js: inject trace ID into log output:
const { trace, context } = require('@opentelemetry/api');
// Custom logger that includes trace context:
function log(level, message, extra = {}) {
const span = trace.getActiveSpan();
const traceContext = span ? {
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId,
} : {};
console.log(JSON.stringify({
level,
message,
timestamp: new Date().toISOString(),
service: process.env.SERVICE_NAME,
...traceContext,
...extra
}));
}
// Usage:
app.get('/api/orders/:id', async (req, res) => {
log('info', 'Processing order request', { order_id: req.params.id });
// The log now contains trace_id that matches the Tempo trace
// In Grafana, you can click a trace span and jump directly
// to the correlated logs in Loki
});
# Configure Grafana to link Loki logs to Tempo traces:
# Loki Data Source → Derived Fields:
# Name: TraceID
# Regex: trace_id=(\w+)
# URL: http://tempo:3200/api/traces/$${__value.raw}
# Internal link: Tempo data source
# Now in the Explore view:
# When you see a log line with a trace_id,
# a "Tempo" link appears — click it to jump to the full trace
Mimir: Long-Term Metrics Storage
Prometheus's default storage is excellent for recent data but limited in retention — beyond 15-30 days, storage costs escalate and query performance degrades. Grafana Mimir is a horizontally scalable, long-term storage backend that Prometheus remote-writes to, enabling years of metrics retention with fast queries.
Adding Mimir for Extended Retention
# Add Mimir to docker-compose.yml (single-binary mode for self-hosted):
mimir:
image: grafana/mimir:latest
container_name: mimir
restart: unless-stopped
ports:
- "9009:9009" # Mimir HTTP API
volumes:
- mimir_data:/data
- ./mimir/mimir-config.yml:/etc/mimir/mimir.yaml:ro
command: --config.file=/etc/mimir/mimir.yaml
networks:
- monitoring
volumes:
mimir_data:
driver: local
---
# mimir/mimir-config.yml
mkdir -p mimir
cat > mimir/mimir-config.yml << 'EOF'
# Single-binary mode — simple, no clustering
target: all,alertmanager
server:
http_listen_port: 9009
grpc_listen_port: 9095
log_level: info
mimir_config:
blocks_retention_period: 8760h # 365 days of metrics
blocks_storage:
backend: filesystem
filesystem:
dir: /data/blocks
tsdb:
dir: /data/tsdb
compactor:
data_dir: /data/compactor
distributor:
ring:
instance_addr: 127.0.0.1
kvstore:
store: memberlist
ingester:
ring:
instance_addr: 127.0.0.1
kvstore:
store: memberlist
replication_factor: 1
ruler_storage:
backend: filesystem
filesystem:
dir: /data/rules
alertmanager_storage:
backend: filesystem
filesystem:
dir: /data/alertmanager
memberlist:
bind_port: 7946
join_members: []
EOF
# Configure Prometheus to remote_write to Mimir:
# Add to prometheus/prometheus.yml:
cat >> prometheus/prometheus.yml << 'EOF'
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true # Send exemplars (links from metrics to traces)
EOF
docker compose up -d mimir
docker compose restart prometheus # Pick up new remote_write config
# Verify Mimir is receiving metrics:
curl -s http://localhost:9009/ready
# Should return: ready
# Add Mimir as a Prometheus-compatible data source in Grafana:
# Data Sources → Add → Prometheus
# URL: http://mimir:9009/prometheus
# Name: Mimir (Long-term)
# Now you can query metrics going back 12 months in Grafana
Unified Alerting: Correlating Metrics, Logs, and Traces
Individual alerts on individual signals are noisy and incomplete. An alert that fires for a high error rate doesn't tell you which service, which endpoint, or why. Grafana's unified alerting system lets you write alert rules that query any data source — Prometheus, Loki, Tempo — and combine them into actionable notifications that include context from all three signals.
Multi-Signal Alert Rules
# Grafana Unified Alerting rules via API
# (Configure in Grafana UI: Alerting → Alert Rules → New rule)
# Rule 1: High error rate with log context
# This alert fires when error rate > 5% AND includes a link to relevant logs
# Alert rule YAML (for Grafana-managed alerts):
cat << 'EOF'
apiVersion: 1
groups:
- name: Application Alerts
folder: Production
interval: 1m
rules:
# High API error rate:
- uid: api-error-rate
title: High API Error Rate
condition: C
data:
- refId: A
queryType: range
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus-uid
model:
expr: >
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
* 100
- refId: B
queryType: reduce
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
type: reduce
conditions:
- evaluator:
type: gt
params: [5.0] # Alert if > 5% error rate
reducer: {type: max}
query: {params: [A]}
noDataState: NoData
execErrState: Error
for: 5m
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $values.A }}% (threshold: 5%)"
# Link to Loki logs for this service at the time of alert:
runbook: "https://wiki.yourdomain.com/runbooks/{{ $labels.service }}"
labels:
severity: warning
team: backend
# Database slow queries (Loki-based alert):
- uid: db-slow-queries
title: Database Slow Queries Detected
condition: C
data:
- refId: A
queryType: range
datasourceUid: loki-uid
model:
expr: >-
sum(rate({app="postgresql"} |= "duration" | regexp `duration=(?P\d+)ms` | duration > 1000 [5m]))
annotations:
summary: "PostgreSQL slow queries detected"
description: "Multiple queries taking >1000ms in the last 5 minutes"
EOF
# Apply the rules via Grafana API:
curl -X POST http://admin:password@localhost:3000/api/ruler/grafana/api/v1/rules/Production \
-H 'Content-Type: application/yaml' \
--data-binary @alert-rules.yaml
Integrating with the Uptime Kuma Observability Stack
For teams running Uptime Kuma alongside Grafana, the combined setup creates a complete observability platform. Uptime Kuma handles external availability checks while Grafana covers internal metrics, logs, and traces. For the complete integration pattern, see our guide on connecting Uptime Kuma to Grafana with Alertmanager routing.
# Add Uptime Kuma metrics to your Prometheus scrape config:
# prometheus/prometheus.yml — add to scrape_configs:
- job_name: 'uptime-kuma'
scrape_interval: 30s
static_configs:
- targets: ['uptime-kuma:3001']
metrics_path: /metrics
# Useful correlation dashboard queries:
# When Uptime Kuma shows a service is down AND Grafana shows the error:
# 1. Which service is down right now? (from Uptime Kuma):
monitor_status == 0
# 2. What's the error rate for the same service? (from Prometheus):
rate(http_requests_total{service="$service",status=~"5.."}[5m])
# 3. What do the logs show during the outage? (from Loki):
{app="$service"} | json | level="error"
# Build a unified dashboard that overlays all three:
# - Uptime Kuma availability as status annotations on the timeline
# - Prometheus error rate as the main graph
# - Loki log volume as a bar chart at the bottom
# - When you see a dip in availability, click to expand logs below
Tips, Gotchas, and Troubleshooting
Loki Not Ingesting Docker Logs
# Check Promtail is running and can reach Loki:
docker logs promtail --tail 30 | grep -iE '(error|warn|send|failed)'
# Verify Promtail can read the Docker socket:
docker exec promtail ls -la /run/docker.sock
# Should show the socket file — if permission denied, add user: root to Promtail service
# Check what Promtail is discovering:
curl -s http://localhost:9080/targets | jq '.[] | select(.health != "up")'
# Shows any failing scrape targets
# Verify logs are reaching Loki:
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq .data
# Should show your labels: app, container_id, env, level, etc.
# If empty, Promtail isn't pushing logs
# Check Loki ingestion rate:
curl -s 'http://localhost:3100/metrics' | grep loki_ingester_streams_total
# Common fix: Promtail needs to run as root to read Docker socket:
# In docker-compose.yml under promtail:
# user: root
# security_opt:
# - no-new-privileges:true
# Test log ingestion manually:
curl -X POST http://localhost:3100/loki/api/v1/push \
-H 'Content-Type: application/json' \
-d '{"streams":[{"stream":{"app":"test"},"values":[["'$(date +%s)'000000000","test log entry"]]}]}'
# Then query: {app="test"} in Grafana Explore
Tempo Not Receiving Traces
# Check Tempo is running:
curl -s http://localhost:3200/ready
# Should return: ready
# Check if traces are being received:
curl -s http://localhost:3200/metrics | grep tempo_distributor_spans_received_total
# Should increment after sending test traces
# Send a test trace directly to Tempo:
curl -X POST http://localhost:4318/v1/traces \
-H 'Content-Type: application/json' \
-d '{
"resourceSpans": [{
"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]},
"scopeSpans": [{
"spans": [{
"traceId": "aaaaaaaabbbbbbbbccccccccdddddddd",
"spanId": "eeeeeeeeffffffff",
"name": "test-span",
"kind": 1,
"startTimeUnixNano": "1000000",
"endTimeUnixNano": "2000000"
}]
}]
}]
}'
# Then search for it in Grafana → Explore → Tempo:
# Service Name: test-service
# If app can't reach Tempo:
# The OTLP endpoint must be reachable from the app container
# For apps on the monitoring network:
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4318
# For apps on a different Docker network:
# Connect the app's network to the monitoring network:
docker network connect monitoring your-app-container
Mimir Remote Write Failing
# Check Prometheus remote_write status:
curl -s http://localhost:9090/api/v1/status/config | jq '.data | contains("remote_write")'
# Check remote_write errors in Prometheus logs:
docker logs prometheus --tail 30 | grep -iE '(remote_write|failed|error)'
# Monitor remote_write queue health:
curl -s 'http://localhost:9090/metrics' | grep 'prometheus_remote_storage'
# Key metrics:
# prometheus_remote_storage_samples_pending — samples waiting to be sent
# prometheus_remote_storage_failed_samples_total — samples that failed
# prometheus_remote_storage_sent_bytes_total — bytes successfully sent
# If remote_write is lagging (pending > 10000):
# Increase remote_write queue capacity in prometheus.yml:
# remote_write:
# - url: http://mimir:9009/api/v1/push
# queue_config:
# max_samples_per_send: 10000
# max_shards: 10
# capacity: 50000
# Verify Mimir is accepting the writes:
curl -s http://localhost:9009/api/v1/query?query=up | jq '.status'
# Should return: success if Prometheus data is in Mimir
Pro Tips
- Use Grafana Alloy instead of Promtail + node_exporter separately — Grafana Alloy is the unified collector that replaces Promtail, Grafana Agent, and various other individual shippers. One process collects metrics, logs, and traces, with a single configuration file. For new deployments it's the cleaner choice; for existing setups it's worth migrating when you have time.
- Drop debug and trace logs at the Promtail level, not at Loki — debug logs from verbose applications can flood Loki and spike costs. Add a
dropstage in Promtail's pipeline to discard debug/trace level logs before they reach Loki. This is dramatically cheaper than ingesting and then not querying them. - Set Loki stream limits per-application to prevent one noisy service from overwhelming others — a misconfigured application that logs every request body can exhaust your Loki ingestion budget. Set
per_stream_rate_limit: 5MBandper_stream_rate_limit_burst: 20MBin the Loki limits_config to cap any single stream's ingestion rate. - Use Tempo's service graph to find hidden dependencies — the Tempo service graph (Explore → Tempo → Service Graph) auto-generates a visual map of how your services call each other, derived from traces. This is often the first time teams discover an unexpected service dependency that's creating latency.
- Co-locate your monitoring stack on a separate server from what it monitors — if your monitoring stack is on the same server as your production applications, a resource spike on the production apps degrades your ability to monitor the crisis. A dedicated $20/month monitoring server prevents this coupling.
Wrapping Up
The complete LGTM stack — Loki for logs, Grafana for visualization, Tempo for traces, Mimir for long-term metrics — gives you genuine observability rather than just monitoring. Monitoring tells you something is wrong. Observability lets you ask arbitrary questions about your system's behavior and get answers from the data. That distinction is what separates teams that debug production issues in minutes from teams that spend hours.
Start by adding Loki and Promtail — centralized logs are immediately valuable to every developer on your team. Add Tempo when you have multi-service applications where latency is hard to attribute. Add Mimir when your Prometheus retention limit starts causing problems with historical analysis or on-call investigations.
Together with the foundational Grafana guide covering Prometheus, Node Exporter, and dashboard building, these two guides give you a complete, self-hosted observability platform that costs a fraction of commercial alternatives and keeps all your operational data on infrastructure you control.
Need a Complete Observability Platform Designed for Your Infrastructure?
Designing the LGTM stack for your specific infrastructure — with proper retention policies, log sampling strategies, application instrumentation, and unified alerting that actually surfaces root causes — the sysbrix team builds observability platforms for engineering teams that need to understand their systems, not just watch green dots turn red.
Talk to Us →