Skip to Content

How to Deploy Kestra with Docker Compose and Caddy for Production Workflows

A production-ready guide to running Kestra reliably with TLS, queue workers, secrets handling, observability, and failure recovery.

When teams move from ad-hoc cron jobs to event-driven orchestration, they need a platform that coordinates APIs, retries, schedules, and notifications without fragile glue scripts. This guide deploys Kestra for a practical production scenario: running business-critical workflows with recoverable retries and clear audit trails.

Tool + deployment: kestra|docker-compose+caddy

We use Docker Compose and Caddy so you get a clean operational baseline with HTTPS, reverse proxying, durable state, and predictable troubleshooting. The goal is not a demo stack—it is a resilient setup you can run, monitor, and evolve safely.

Architecture and Flow Overview

The architecture is split into edge, control, execution, and state layers. Caddy handles TLS and inbound HTTP policy, Kestra orchestrates workflows, workers execute tasks, and PostgreSQL stores durable run metadata.

  • Edge: Caddy with automatic certificates and secure headers.
  • Control plane: Kestra API/UI and scheduler.
  • Execution: worker tasks with retries and backoff.
  • State: PostgreSQL with backup and restore procedures.

Traffic flows from users and service webhooks to Caddy, then into Kestra. Kestra records each run state in Postgres and dispatches task execution. Workers report progress back to the API, making incident response straightforward because state and logs are centralized.

In production, this separation matters. If the edge layer has a temporary issue, run state is still persisted. If workers are overloaded, orchestration metadata remains intact and queued tasks can resume after scaling. This gives you operational leverage under failure instead of opaque single-container behavior.

Prerequisites

  • Ubuntu 22.04/24.04 server (2 vCPU minimum, 4+ vCPU recommended).
  • Domain DNS record pointing to the host (e.g., flows.example.com).
  • Docker Engine and Docker Compose v2.
  • Open inbound 80/443 for TLS issuance and renewal.
  • Strong Postgres password and long random Kestra secret.

Before cutover, verify clock sync, disk thresholds, and snapshot policy. Workflow systems often fail when peripheral operations assumptions—time, storage, and credential hygiene—are not handled upfront.

Step-by-Step Deployment

1) Create host directories

sudo mkdir -p /opt/kestra/{data,caddy,backups}
cd /opt/kestra
sudo chown -R $USER:$USER /opt/kestra

If the copy button does not work in your browser, manually copy the command block above.

2) Define environment variables

cat > /opt/kestra/.env <<'EOF'
KESTRA_HOST=flows.example.com
POSTGRES_DB=kestra
POSTGRES_USER=kestra
POSTGRES_PASSWORD=CHANGE_ME_STRONG_DB_PASSWORD
KESTRA_SECRET=CHANGE_ME_LONG_RANDOM_SECRET
TZ=America/Chicago
EOF

If the copy button does not work in your browser, manually copy the command block above.

Store real values in a secret manager and keep this file out of version control. During rotations, update secrets and restart services in a controlled window.

3) Create docker-compose.yml

version: '3.9'
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - ./data/postgres:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
      interval: 10s
      timeout: 5s
      retries: 10
    restart: unless-stopped

  kestra:
    image: kestra/kestra:latest
    command: server local
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      KESTRA_CONFIGURATION: |
        datasources:
          postgres:
            url: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
            driverClassName: org.postgresql.Driver
            username: ${POSTGRES_USER}
            password: ${POSTGRES_PASSWORD}
        kestra:
          server:
            basic-auth:
              enabled: true
              username: admin
              password: ${KESTRA_SECRET}
          queue:
            type: postgres
          storage:
            type: local
            local:
              base-path: /app/storage
    volumes:
      - ./data/storage:/app/storage
    restart: unless-stopped

  caddy:
    image: caddy:2.8
    ports:
      - "80:80"
      - "443:443"
    env_file: .env
    volumes:
      - ./caddy/Caddyfile:/etc/caddy/Caddyfile:ro
      - ./data/caddy:/data
      - ./data/config:/config
    depends_on:
      - kestra
    restart: unless-stopped

If the copy button does not work in your browser, manually copy the command block above.

4) Configure Caddy

{$KESTRA_HOST} {
  encode gzip zstd
  reverse_proxy kestra:8080
  header {
    Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    X-Content-Type-Options "nosniff"
    X-Frame-Options "SAMEORIGIN"
    Referrer-Policy "strict-origin-when-cross-origin"
  }
}

If the copy button does not work in your browser, manually copy the command block above.

5) Start and validate services

cd /opt/kestra
docker compose --env-file .env up -d
docker compose ps
docker compose logs --tail=80 caddy kestra postgres

If the copy button does not work in your browser, manually copy the command block above.

At this point, confirm that the UI is reachable over HTTPS and certificate renewal succeeds. If TLS fails, verify DNS records and ensure no other service is binding to ports 80 or 443.

For production hardening, pin image versions after initial validation instead of continuously tracking latest tags. Maintain a changelog for upgrades and test critical flows in staging before broad rollout.

Configuration and Secrets Handling

Separate runtime credentials from workflow definitions. Keep database and admin secrets at infrastructure level, and use scoped credentials for downstream systems invoked by flows. This reduces blast radius when one integration key is compromised.

Recommended controls include: explicit timeout defaults, retry limits to prevent runaway loops, namespace-level governance for team ownership, and periodic credential rotation with runbook validation. Also configure access control so only authorized users can modify production workflows.

When onboarding teams, define clear conventions for workflow IDs, namespaces, and version comments. Strong naming discipline reduces debugging time and helps responders quickly identify which service owns a failing flow. Add owner tags and escalation hints to every critical workflow so on-call engineers can route incidents without triage delays.

For regulatory environments, maintain immutable audit exports for execution metadata. A simple monthly export with checksum verification is often enough to satisfy audit requests without expensive data warehousing projects. The key is repeatability and documented ownership.

Operational Runbook and Day-2 Practices

Day-2 operations determine whether the platform remains useful after the first month. Define a lightweight on-call runbook with explicit checks: queue depth thresholds, database health, certificate age, and the top failing workflows by namespace. Keep this list short enough that responders can execute it in minutes.

Set weekly maintenance windows for dependency updates, stale flow cleanup, and credential-age reviews. Every update should have a rollback note and a fast validation checklist. In practice, this prevents small breakages from stacking into multi-hour incidents. If your team supports multiple business units, establish namespace ownership and escalation paths so failures route to the right engineers quickly.

Finally, run game-day drills for realistic failure cases: database restart, worker crash loop, DNS break, and downstream API throttling. The objective is confidence, not perfection. Each drill should produce at least one concrete improvement in alerts, docs, or deployment defaults.

Verification

Use a smoke test workflow and validate the platform at application and infrastructure levels. A complete verification pass should confirm route availability, auth policy, queue behavior, and post-failure recoverability.

id: smoke_test
namespace: ops.healthchecks
triggers:
  - id: every_15m
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "*/15 * * * *"
tasks:
  - id: ping_api
    type: io.kestra.plugin.core.http.Request
    uri: https://httpbin.org/status/200
    method: GET
  - id: log_status
    type: io.kestra.plugin.core.log.Log
    message: "Status: {{ outputs.ping_api.code }}"

If the copy button does not work in your browser, manually copy the command block above.

curl -I https://flows.example.com
docker compose exec postgres pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}
docker compose logs --tail=120 kestra

If the copy button does not work in your browser, manually copy the command block above.

Track queue depth, failed runs, median runtime, and API latency. Alert on sustained backlog growth and repeated retry exhaustion so incidents are detected before business SLAs are breached. During incident reviews, correlate workflow-level failures with infrastructure events to separate application defects from platform capacity constraints.

Common Issues and Fixes

TLS does not issue

Check DNS propagation, open ports 80/443, and conflicting reverse proxies.

Queue latency spikes

Increase worker capacity, set task concurrency limits, and isolate heavy flows.

Database connection saturation

Tune connection pools and inspect bursty workflows creating too many sessions.

Secrets appear in logs

Mask values, avoid echoing env vars, and enforce redaction in shell-based tasks.

After upgrade, plugins fail

Pin tested versions, stage upgrades, and validate critical workflows before production rollout.

Unexpected retry storms

Use bounded retries with exponential backoff and circuit-breaker style guard conditions.

FAQ

Can I use SQLite instead of Postgres?

For local testing yes, but production reliability requires Postgres for durability and concurrency.

How many workers should I run at launch?

Start conservatively, then scale from queue and runtime metrics instead of guesses.

Is Docker Compose enough for enterprise use?

It can be, if you enforce backup, patching, observability, and recovery discipline.

How do I rotate secrets with minimal disruption?

Use staged rotation windows and restart components in sequence while monitoring active runs.

How often should I back up Postgres?

At least daily logical backups plus snapshot policy aligned to your RPO requirements.

What should I monitor first?

Queue depth, failed runs, execution latency, API availability, and certificate renewal status.

Can I add SSO later?

Yes—start with basic auth and migrate to SSO once identity and role boundaries are defined.

Related Guides

Continue with these internal guides:

Talk to us

If you want support designing or hardening your workflow platform, we can help with architecture, migration planning, and production readiness.

Contact Us

Open WebUI Setup Guide: Deploy Your Own Private ChatGPT with Local Models and Team Access
Learn how to self-host Open WebUI with Ollama for completely private AI chat that runs on your own hardware, supports multiple users, and works with both local and cloud LLM providers.