Production Guide: Deploy Grafana Loki with Kubernetes + Helm + ingress-nginx on Ubuntu

Log management usually becomes a production problem long before teams plan for it. A few services become dozens, incidents start spanning multiple systems, and engineers waste critical time SSHing into hosts to grep rotating files that may already be gone. Loki is a practical answer when you want centralized, cost-conscious logs that work naturally with Grafana and Kubernetes labels. In this guide, you will deploy Grafana Loki on Ubuntu-based Kubernetes using Helm and ingress-nginx, with a production-oriented setup focused on retention, secure exposure, secret handling, and operational reliability.

The workflow is intentionally built for platform and DevOps teams that need repeatable operations under pressure: deterministic install steps, clear rollback points, practical verification commands, and troubleshooting based on real failure modes. You will deploy Loki with object storage support, wire authentication for ingestion and query paths, front access through ingress-nginx and TLS, and establish runbook-friendly checks for uptime, indexing, and query health.

Architecture and flow overview

This deployment uses Kubernetes as the runtime control plane, Helm as the release and configuration layer, ingress-nginx as the north-south traffic gateway, and Loki in scalable mode for durable ingestion and query operations. Promtail (or Alloy) agents ship logs from nodes and workloads into the distributor. The distributor validates and routes log streams to ingesters, which buffer and flush chunks to object storage. Index metadata is maintained to support fast label and time-range queries. Querier and query-frontend components serve Grafana and API consumers.

At a high level, the request and data flow is:

Workload stdout/stderr and node logs are collected by agents.
Agents push entries to Loki distributor over authenticated endpoints.
Ingesters batch and persist chunks to object storage.
Index and query components resolve labels/time windows for retrieval.
Grafana dashboards and incident responders query via ingress TLS endpoint.

This separation matters in production: ingestion spikes and query spikes can be scaled independently, storage can be tuned without reworking pipeline logic, and ingress policy remains centralized with your existing Kubernetes edge controls.

Prerequisites

Ubuntu host(s) running a healthy Kubernetes cluster (v1.28+ recommended).
kubectl with cluster-admin access.
helm v3.13+ installed on the operator machine.
ingress-nginx controller deployed and reachable.
A DNS record for your Loki endpoint (for example loki.sysbrix.internal).
TLS strategy in place (cert-manager or pre-provisioned secret).
Object storage bucket and credentials for long-term retention.

kubectl version --short
helm version
kubectl get nodes -o wide
kubectl -n ingress-nginx get pods

If copy is unavailable in your browser, select the block and copy manually.

Step-by-step deployment

1) Create namespace and baseline policies

Start with a dedicated namespace so quotas, RBAC, and network policies remain isolated from unrelated workloads. Labeling early also helps governance tooling and policy engines classify the stack correctly.

kubectl create namespace loki
kubectl label namespace loki app=loki tier=observability env=prod

If copy is unavailable in your browser, select the block and copy manually.

2) Create object storage and auth secrets

Never hardcode credentials in Helm values committed to Git. Store secrets in Kubernetes and reference them from chart values. For production, use external secret managers (Vault, ESO, cloud KMS-backed secret stores), but this baseline keeps sensitive data out of plain manifests and shell history wherever possible.

kubectl -n loki create secret generic loki-s3 \
  --from-literal=AWS_ACCESS_KEY_ID='replace-me' \
  --from-literal=AWS_SECRET_ACCESS_KEY='replace-me' \
  --from-literal=S3_BUCKET='replace-me' \
  --from-literal=S3_REGION='us-east-1'

kubectl -n loki create secret generic loki-basic-auth \
  --from-literal=LOKI_USERNAME='loki_reader' \
  --from-literal=LOKI_PASSWORD='replace-with-strong-password'

If copy is unavailable in your browser, select the block and copy manually.

3) Add Helm repository and pin chart version

Pinning avoids surprise behavior changes from upstream releases during routine redeploys. In production, chart upgrades should be explicit, tested in staging, and rolled with rollback plans.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm search repo grafana/loki -l | head -n 20

If copy is unavailable in your browser, select the block and copy manually.

4) Create production values file

The values file below enables scalable mode, configures ingress, references pre-created secrets, and sets conservative resource defaults. Adjust replicas and limits based on workload volume and query concurrency. Keep this file in your infra repository and track changes through pull requests.

loki:
  auth_enabled: true
  commonConfig:
    replication_factor: 2
  storage:
    type: s3
    bucketNames:
      chunks: ${S3_BUCKET}
      ruler: ${S3_BUCKET}
      admin: ${S3_BUCKET}
    s3:
      region: ${S3_REGION}
  schemaConfig:
    configs:
      - from: "2025-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

deploymentMode: SimpleScalable

read:
  replicas: 2
write:
  replicas: 2
backend:
  replicas: 2

singleBinary:
  replicas: 0

gateway:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - host: loki.sysbrix.internal
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: loki-tls
        hosts:
          - loki.sysbrix.internal

If copy is unavailable in your browser, select the block and copy manually.

5) Deploy Loki release with Helm

Render and lint templates before install so obvious schema errors are caught early. Use --atomic to roll back automatically if resources fail readiness during install windows.

helm upgrade --install loki grafana/loki \
  --namespace loki \
  --version 6.6.3 \
  --values ./values-loki-prod.yaml \
  --atomic \
  --timeout 15m

If copy is unavailable in your browser, select the block and copy manually.

6) Configure ingestion clients (Promtail or Alloy)

Your cluster is only useful if logs actually arrive. Configure agents to push to the gateway endpoint with tenant and authentication headers if multi-tenant mode is enabled. In single-tenant mode, keep auth strict anyway; unauthenticated log ingestion is an incident waiting to happen.

clients:
  - url: https://loki.sysbrix.internal/loki/api/v1/push
    basic_auth:
      username: loki_reader
      password: ${LOKI_PASSWORD}
    external_labels:
      cluster: prod-main
      environment: production

If copy is unavailable in your browser, select the block and copy manually.

7) Expose Grafana data source and enforce RBAC

For operational safety, bind query permissions through role-based access. Incident responders may need broad read scope, while application teams should usually be scoped by namespace labels to avoid accidental data exposure.

kubectl -n loki get pods
kubectl -n loki get ingress
kubectl -n loki logs deploy/loki-read --tail=100
kubectl auth can-i get pods --as system:serviceaccount:loki:default -n loki

If copy is unavailable in your browser, select the block and copy manually.

8) Add retention and query guardrails

Without guardrails, expensive wildcard queries and excessive retention can quietly become your biggest observability cost driver. Set retention windows by environment and workload criticality, then define query limits so one ad-hoc dashboard cannot exhaust resources during incidents.

loki:
  limits_config:
    retention_period: 720h
    max_query_parallelism: 16
    max_query_series: 5000
    split_queries_by_interval: 15m
    ingestion_rate_mb: 8
    ingestion_burst_size_mb: 16

If copy is unavailable in your browser, select the block and copy manually.

Configuration and secrets handling best practices

Treat logging infrastructure credentials as production secrets. Rotate object storage and basic-auth credentials on a schedule and after personnel changes. Use external secret operators where available so credentials are not manually copied into cluster commands. Restrict secret access to dedicated service accounts and avoid broad namespace-wide permissions. Keep Loki values in version control, but never store static plaintext secrets in the repository.

For compliance-sensitive environments, enforce encrypted transport on every path: agent to gateway, ingress to backend services, and storage endpoints. Add Kubernetes network policies so only expected namespaces can push logs. Also add per-team label conventions (team, service, env) early; consistent labels are what make your logs useful at 3 AM during an outage.

# Example: rotate basic auth secret and restart gateway
kubectl -n loki delete secret loki-basic-auth
kubectl -n loki create secret generic loki-basic-auth \
  --from-literal=LOKI_USERNAME='loki_reader' \
  --from-literal=LOKI_PASSWORD='new-strong-password'
kubectl -n loki rollout restart deploy/loki-gateway

If copy is unavailable in your browser, select the block and copy manually.

Verification checklist

All Loki read/write/backend pods are ready with zero crash loops.
Ingress endpoint presents valid TLS certificate and resolves publicly/internal as intended.
Agent pipelines can push sample logs and receive HTTP 204 responses.
Grafana can query labels and recent logs without auth or timeout failures.
Object storage bucket receives chunks and index objects as expected.
Retention and query limits are enforced in live behavior.

kubectl -n loki get pods -o wide
kubectl -n loki get svc,ingress
curl -I https://loki.sysbrix.internal/ready
curl -u loki_reader:'replace-with-strong-password' \
  -G 'https://loki.sysbrix.internal/loki/api/v1/query' \
  --data-urlencode 'query={namespace="loki"}'

If copy is unavailable in your browser, select the block and copy manually.

Common issues and fixes

Ingress returns 502/503 even though pods are running

This usually indicates service or port mismatches, not TLS itself. Confirm ingress backend service names and target ports match chart output. Also verify readiness probes are passing before expecting stable ingress responses.

No logs appear for application namespaces

Check the agent DaemonSet first. Missing hostPath mounts, namespace filters, or denied egress to Loki are common root causes. Validate by sending a test log line and tracing client response status codes.

Queries are slow during incidents

Most slowdowns come from unbounded queries over broad time ranges. Enforce query splitting, cap parallelism, and train responders to start with narrow labels and short windows before broad expansion.

Storage costs rise unexpectedly

Validate retention policy and confirm old chunks are actually expiring. Large cardinality labels and verbose debug logs can multiply storage consumption quickly; review label design and sampling policies with application teams.

Frequent out-of-memory on write components

Increase ingester resources gradually and evaluate chunk settings. Extremely bursty workloads may need higher write replicas and stricter ingestion limits to prevent cascading instability.

Authentication errors from Grafana data source

Double-check credentials, tenant headers, and TLS trust chain. If credentials were rotated recently, force data source save/test again and restart any sidecars caching old settings.

FAQ

Can I run Loki without object storage in production?

You can, but it is not recommended for durable production operations. Local disk modes simplify setup but reduce resilience and make retention/scale operations harder. Object storage gives better durability and operational flexibility.

Should I use Promtail or Grafana Alloy for ingestion?

Both can work. Promtail is straightforward for many teams, while Alloy provides broader pipeline capabilities in unified telemetry estates. Choose based on your current observability roadmap and operational maturity.

How much retention should we start with?

A common starting point is 30 days for broad platform logs and longer windows only for regulated workloads. Start conservative, then tune based on actual incident and audit needs to control cost growth.

Is multi-tenancy necessary for internal platform teams?

Not always. If you have strict team boundaries or compliance requirements, multi-tenancy helps. For smaller teams, single tenant with strong auth and label conventions may be simpler while still secure.

Can I deploy Loki and Grafana in separate clusters?

Yes. Many organizations centralize Loki while keeping Grafana near each environment. Ensure secure network connectivity, low-latency query paths where possible, and clear ownership for upgrades and incident response.

What is the safest upgrade approach?

Pin chart versions, test in staging with production-like load, back up critical configuration, and perform canary or maintenance-window upgrades. Always verify ingestion and query health before declaring completion.

How do we prevent sensitive data from entering logs?

Implement redaction in application middleware and ingestion pipelines. Enforce secure coding reviews for logging statements and add detection alerts for known secret patterns so leaks are caught early.

Related guides on SysBrix

Talk to us

If you want this implemented with hardened defaults, observability, and tested recovery playbooks, our team can help.

in Guides

# Guides Helm Kubernetes Loki Observability

Production Guide: Deploy Temporal with Kubernetes + Helm + PostgreSQL + ingress-nginx on Ubuntu

A production-first deployment playbook with security controls, resilient operations, and practical verification steps.