Skip to Content

Production Guide: Deploy Grafana Loki on Kubernetes with Helm, Ingress NGINX, and Cert-Manager

Run a resilient Loki stack with TLS, S3 object storage, secure secrets, and operational guardrails for real production workloads.

Centralized logs are easy to prototype and surprisingly hard to operate in production. Teams often start with node-level files, add ad hoc shipping, and then discover too late that query latency, retention cost, and noisy high-cardinality labels can make incident response slower instead of faster.

This guide shows a production-first Loki deployment on Kubernetes using Helm, Ingress NGINX, and cert-manager. The target environment is a multi-node Ubuntu cluster with object storage for chunks and indexes, persistent volumes for stateful components, and strict handling for credentials. The same pattern works in managed Kubernetes as long as your ingress class and storage classes are adjusted.

The objective is not just to get Loki running. The objective is to get predictable operations: controlled retention, sane resource isolation, safe upgrades, and fast verification steps that help you trust the platform during a real outage.

Architecture and flow overview

In this architecture, logs are scraped by Promtail from Kubernetes pods and sent to Loki through internal services. Loki stores indexes and chunks in S3-compatible object storage to separate compute from durable log data. Ingress NGINX exposes the query endpoint over HTTPS, with cert-manager issuing and rotating certificates automatically.

The deployment keeps read and write paths explicit. Write components ingest and flush chunks; read components serve queries and cache hot results. This separation helps performance tuning because ingestion bursts and query spikes can be scaled independently. It also reduces noisy-neighbor effects when dashboard users run broad range queries during incidents.

Operationally, you should treat label design as a first-class architecture decision. Over-labeling creates cardinality explosions and can multiply storage costs. Under-labeling makes correlation harder. A practical baseline is namespace, app, pod, and environment labels, with high-cardinality values normalized or dropped before ingestion.

Prerequisites

  • Kubernetes cluster (v1.27+) with at least 3 worker nodes and kubectl admin access.
  • Helm 3.13+ installed locally.
  • Ingress NGINX controller installed and set as active ingress class.
  • cert-manager installed with a ClusterIssuer (for example Let's Encrypt).
  • S3-compatible object storage bucket and IAM/access credentials dedicated to Loki.
  • DNS record for loki.example.com pointed to your ingress load balancer.
  • A dedicated Kubernetes namespace (recommended: observability).

Step-by-step deployment

Step 1: Prepare namespace and baseline policies

Create a dedicated namespace and apply basic quota/limit policies before deploying charts. This prevents surprise resource contention when ingestion spikes.

kubectl create namespace observability
kubectl label namespace observability app.kubernetes.io/part-of=observability

cat <<'EOF' | kubectl apply -n observability -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: observability-quota
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
EOF

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 2: Add Helm repositories and pin chart versions

Pinning chart versions protects you from unplanned upstream changes. Track upgrades intentionally in change windows.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm search repo grafana/loki -l | head -n 20

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 3: Create Loki values file with production defaults

Use object storage and explicit read/write/backend replicas. Keep retention and ingestion limits in values so they remain auditable in Git.

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 3
  storage:
    type: s3
    bucketNames:
      chunks: loki-prod-chunks
      ruler: loki-prod-ruler
      admin: loki-prod-admin
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      region: us-east-1
      secretAccessKey: ${LOKI_S3_SECRET_KEY}
      accessKeyId: ${LOKI_S3_ACCESS_KEY}
      s3ForcePathStyle: false
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20
    max_label_value_length: 2048
    retention_period: 744h

singleBinary:
  replicas: 0

backend:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3

gateway:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - host: loki.example.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: loki-tls
        hosts:
          - loki.example.com
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod

monitoring:
  serviceMonitor:
    enabled: true

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 4: Handle secrets safely

Never hardcode S3 keys in Helm values. Store them in Kubernetes secrets (or External Secrets) and reference them through environment variables. Rotate keys on a fixed schedule.

kubectl -n observability create secret generic loki-s3   --from-literal=LOKI_S3_ACCESS_KEY='REPLACE_ME'   --from-literal=LOKI_S3_SECRET_KEY='REPLACE_ME'

kubectl -n observability get secret loki-s3 -o yaml

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 5: Deploy Loki stack

Install into the observability namespace and wait for workloads to become ready before exposing traffic.

helm upgrade --install loki grafana/loki   -n observability   -f loki-values.yaml   --set-file loki.extraEnvFromSecret=/dev/null

kubectl -n observability rollout status deploy/loki-gateway --timeout=300s
kubectl -n observability get pods -l app.kubernetes.io/name=loki -o wide

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 6: Deploy Promtail for Kubernetes log collection

Promtail ships pod logs to Loki and should apply relabeling to avoid cardinality blowups. Start with conservative labels and add more only when justified by search needs.

helm upgrade --install promtail grafana/promtail   -n observability   --set config.clients[0].url=http://loki-gateway.observability.svc.cluster.local/loki/api/v1/push

kubectl -n observability get daemonset promtail

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 7: Verify TLS, ingestion, and query path

Verification should test control plane and data plane. Confirm certificate issuance, endpoint health, and successful log writes/queries.

kubectl -n observability get certificate,challenge,order
curl -I https://loki.example.com/ready

kubectl -n observability run loggen --rm -it --image=busybox --restart=Never --   sh -lc 'for i in $(seq 1 50); do echo "loki smoke test $i"; sleep 1; done'

curl -G -s 'https://loki.example.com/loki/api/v1/query'   --data-urlencode 'query={namespace="observability"} |= "loki smoke test"' | jq .status

If the copy button does not work in your browser/editor, select the code block and copy manually.

Step 8: Operational hardening and backups

Production readiness requires disaster recovery drills and budget controls. Test object-store restore, set alerting for ingestion lag, and cap expensive broad queries.

# Example: retention and compactor checks
kubectl -n observability logs deploy/loki-backend | grep -i -E 'compactor|retention|error' | tail -n 30

# Capacity snapshot
kubectl -n observability top pod | sort -k3 -h

# Suggested alerts:
# - Loki request error ratio > 2%
# - Ingestion rate saturation > 80%
# - Object storage errors > 0

If the copy button does not work in your browser/editor, select the code block and copy manually.

Configuration and secrets handling best practices

Store chart values in Git, but never commit live credentials. If you use External Secrets (Vault, AWS Secrets Manager, GCP Secret Manager), map secrets into the namespace and rotate without redeploying the full stack.

Use separate credentials per environment (dev/staging/prod) and restrict S3 IAM scope to Loki buckets only. Block wildcard object permissions where possible. Audit access logs regularly; log platform credentials are high-value because they can expose sensitive events from every service.

For compliance-heavy environments, add namespace network policies so only Promtail and trusted observability components can talk to Loki gateway/service endpoints. Combine this with RBAC restrictions for query access to reduce accidental data exposure.

Verification checklist

  • Availability: all Loki read/write/backend pods Ready and gateway responding on /ready.
  • Security: TLS certificate valid, auto-renew path healthy, and secret objects restricted to observability operators.
  • Data path: synthetic logs visible within expected latency (typically under 30 seconds).
  • Cost controls: retention period and query limits enforced; no uncontrolled high-cardinality labels entering ingestion.
  • Recovery: documented restore test from object storage performed and timed.

Common issues and fixes

Queries are slow after deployment

Check index/chunk storage latency first. S3 endpoint or DNS misconfiguration often causes read stalls. Validate bucket region, endpoint, and Loki schema settings.

High memory usage on read components

Reduce query parallelism and enforce dashboard query ranges. Add caching where appropriate and watch for wildcard-heavy labels.

No logs from specific namespaces

Inspect Promtail relabel rules and service account permissions. Ensure the DaemonSet can read node log paths and Kubernetes metadata.

TLS certificate not issued

Review cert-manager ClusterIssuer, DNS records, and ingress annotations. Most failures are challenge propagation or wrong ingressClassName.

Unexpected storage growth

Audit label cardinality and retention settings. Drop unneeded labels at ingestion and verify compactor activity.

Intermittent 5xx from gateway

Check upstream service endpoints and resource pressure. Burst ingestion without limits can starve query traffic.

FAQ

Can I run Loki in single-binary mode in production?

For small environments, yes, but distributed mode is recommended once multiple teams or high query concurrency are involved. It scales read/write independently and improves fault tolerance.

Do I need object storage or can I use only PVCs?

Object storage is strongly recommended for durability and operational flexibility. PVC-only designs are harder to scale and recover during node/storage disruptions.

How much retention should I set initially?

Start from compliance and incident-response requirements, then model cost. Many teams begin with 30 days hot retention and export longer-term archives separately.

What labels should I avoid?

Avoid highly dynamic values (request IDs, timestamps, random hashes) as labels. Keep labels low-cardinality and move dynamic fields into log body content.

How do I secure multi-tenant access?

Use network policies, auth in front of gateway, and role-separated dashboards. For strict tenancy, isolate environments or use dedicated Loki tenants with policy controls.

How should I upgrade with minimal risk?

Pin chart versions, test in staging with replayed query load, then roll through production during low-traffic windows. Keep rollback values and backup validation ready.

Related guides

Explore these implementation patterns from our Guides library:

Talk to us

If you want support designing or hardening your observability platform, we can help with architecture, migration planning, and production readiness.

Contact Us

Production Guide: Deploy Uptime Kuma with Docker Compose + NGINX + PostgreSQL on Ubuntu
A production-oriented deployment with TLS termination, durable storage, backup strategy, and operational guardrails for reliable uptime monitoring.