Teams that centralize workflow orchestration often hit the same wall: ad hoc scripts spread across CI runners, cron jobs, and serverless glue code eventually become hard to audit, hard to retry, and risky to scale. Kestra is a strong option for consolidating jobs, scheduling, event handling, and operational visibility into one platform that engineering and operations teams can run themselves.
This guide walks through a production-ready deployment of Kestra using Docker Compose + Traefik + PostgreSQL on Ubuntu. Youβll set up isolated services, persistent storage, TLS termination, secret management, health checks, and a repeatable validation workflow. The focus is practical operations: what to configure, how to verify it, and what usually breaks first in real environments.
Architecture and flow overview
The deployment follows a layered pattern that keeps responsibilities clear and reduces blast radius during incidents:
- Application layer: Kestra web/API and background workers.
- Data layer: PostgreSQL for durable metadata and execution state.
- Edge layer: Traefik for TLS termination, routing, and optional middleware controls.
- Host layer: Ubuntu hardening, UFW policy, log rotation, and backup jobs.
Operationally, users and API clients reach Kestra via HTTPS through Traefik. Internal traffic between services stays on a private Docker network. Backups run from the host, and recovery drills validate that restored state can replay scheduled workloads cleanly.
Prerequisites
- Ubuntu 22.04/24.04 VM with at least 4 vCPU, 8 GB RAM, and fast SSD storage.
- DNS A record for
kestra.example.compointing to your server. - Docker Engine + Docker Compose plugin installed.
- Ports 22, 80, and 443 allowed from trusted networks.
- A secure mailbox for TLS/Letβs Encrypt notifications.
- Time sync enabled (chrony/systemd-timesyncd) to avoid certificate and scheduling drift.
Step-by-step deployment
1) Prepare host and base packages
Start by updating packages and installing baseline tooling for diagnostics, backups, and TLS lifecycle operations.
sudo apt update && sudo apt -y upgrade
sudo apt -y install ca-certificates curl jq unzip ufw fail2ban
sudo timedatectl set-timezone America/Chicago
sudo mkdir -p /opt/kestra/{{traefik,postgres,data,backup}}
sudo chown -R $USER:$USER /opt/kestraIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
2) Create environment file for secrets
Keep credentials in an environment file owned by root and never commit this file to version control.
cat >/opt/kestra/.env <<'EOF'
POSTGRES_DB=kestra
POSTGRES_USER=kestra
POSTGRES_PASSWORD=replace-with-strong-random-password
KESTRA_BASIC_AUTH_USER=opsadmin
KESTRA_BASIC_AUTH_PASSWORD=replace-with-long-passphrase
[email protected]
DOMAIN=kestra.example.com
EOF
chmod 600 /opt/kestra/.envIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
3) Create Docker Compose stack
This compose file separates edge, database, and application services with explicit restart behavior and health checks.
cat >/opt/kestra/docker-compose.yml <<'EOF'
services:
traefik:
image: traefik:v3.1
command:
- --providers.docker=true
- --providers.docker.exposedbydefault=false
- --entrypoints.web.address=:80
- --entrypoints.websecure.address=:443
- --certificatesresolvers.le.acme.email=${LETSENCRYPT_EMAIL}
- --certificatesresolvers.le.acme.storage=/letsencrypt/acme.json
- --certificatesresolvers.le.acme.tlschallenge=true
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /opt/kestra/traefik:/letsencrypt
restart: unless-stopped
postgres:
image: postgres:16
environment:
POSTGRES_DB: ${POSTGRES_DB}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- /opt/kestra/postgres:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 12
restart: unless-stopped
kestra:
image: kestra/kestra:latest
depends_on:
postgres:
condition: service_healthy
env_file: /opt/kestra/.env
environment:
KESTRA_CONFIGURATION: |
datasources:
postgres:
url: jdbc:postgresql://postgres:5432/${POSTGRES_DB}
driverClassName: org.postgresql.Driver
username: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
kestra:
repository:
type: postgres
queue:
type: postgres
storage:
type: local
local:
basePath: /app/storage
server:
basicAuth:
enabled: true
username: ${KESTRA_BASIC_AUTH_USER}
password: ${KESTRA_BASIC_AUTH_PASSWORD}
volumes:
- /opt/kestra/data:/app/storage
labels:
- traefik.enable=true
- traefik.http.routers.kestra.rule=Host(`${DOMAIN}`)
- traefik.http.routers.kestra.entrypoints=websecure
- traefik.http.routers.kestra.tls.certresolver=le
- traefik.http.services.kestra.loadbalancer.server.port=8080
restart: unless-stopped
EOFIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
4) Start stack and verify health
cd /opt/kestra
docker compose --env-file .env up -d
docker compose ps
docker compose logs --tail=100 postgres
docker compose logs --tail=100 kestraIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
5) Harden firewall and brute-force protection
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable
sudo systemctl enable --now fail2banIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
Configuration and secret-handling best practices
For production, treat orchestration credentials as high-value assets. Keep secrets in a dedicated secret manager where possible (Vault, AWS Secrets Manager, or SOPS-encrypted files), and inject them at runtime rather than baking them into images. If you must use .env, limit file permissions and rotate values on a regular schedule.
Use separate credentials for database access, API automation, and human administration. Avoid sharing admin credentials across teams. Add an approval policy for high-impact changes to flows, and define break-glass access with auditing enabled.
For change safety, deploy updates with a staging environment first, validate flow execution semantics, and then promote to production. Keep Compose and image tags explicit to avoid accidental major-version jumps that alter behavior.
Operations, backups, and lifecycle management
Running orchestration in production is less about first deployment and more about disciplined lifecycle management. Define service-level objectives for workflow success rate, median execution latency, and queue drain time during incident scenarios. These objectives help your team decide when to scale vertically, when to split workloads, and when to tune retry behavior for noisy dependencies.
Backups should include both PostgreSQL and the local workflow storage directory. Keep at least one off-host encrypted copy so host compromise or disk failure does not remove both primary and backup data. A practical baseline is daily full backups retained for 14β30 days, plus weekly immutable snapshots for compliance and post-incident forensics.
Patch management matters for reliability and security. Track upstream releases for Kestra, PostgreSQL, and Traefik, then test upgrades against a staging clone using representative flows. Establish a predictable maintenance window and communicate expected impact to stakeholders in advance. After upgrades, rerun smoke tests and compare execution metrics against pre-upgrade baselines.
For governance, limit who can create production schedules and who can edit secrets. Add audit review checkpoints for privileged changes and require peer review for workflow definitions that touch billing, identity systems, or customer-facing APIs.
cat >/opt/kestra/backup.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /opt/kestra/.env
STAMP=$(date +%F-%H%M)
mkdir -p /opt/kestra/backup/$STAMP
cd /opt/kestra
docker compose exec -T postgres pg_dump -U "$POSTGRES_USER" "$POSTGRES_DB" > /opt/kestra/backup/$STAMP/kestra.sql
tar -czf /opt/kestra/backup/$STAMP/storage.tar.gz -C /opt/kestra data
find /opt/kestra/backup -maxdepth 1 -type d -mtime +21 -exec rm -rf {} +
EOF
chmod 700 /opt/kestra/backup.sh
sudo ln -sf /opt/kestra/backup.sh /etc/cron.daily/kestra-backupIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
Verification checklist
- HTTPS endpoint resolves and returns valid TLS chain.
- Login succeeds with basic auth and expected role permissions.
- Database health remains
healthyunder normal load. - A sample scheduled workflow executes and writes expected logs.
- Restart drill confirms service recovery and job state persistence.
curl -I https://kestra.example.com
cd /opt/kestra && docker compose ps
cd /opt/kestra && docker compose exec -T postgres psql -U ${POSTGRES_USER} -d ${POSTGRES_DB} -c "select now();"
cd /opt/kestra && docker compose logs --tail=200 kestraIf the copy button does not work in your browser/editor, select the code block manually and copy with Ctrl/Cmd+C.
Common issues and fixes
Traefik certificate not issuing
Check DNS propagation first, then ensure port 443 is reachable from the public internet. If ACME storage file permissions are wrong, Traefik cannot persist cert state. Confirm /opt/kestra/traefik/acme.json exists and is writable by the container.
Application starts but cannot reach PostgreSQL
Most failures come from wrong credentials, typo in DB host, or race conditions before DB is healthy. Keep depends_on: condition: service_healthy and validate pg_isready checks.
Slow UI during peak execution windows
Profile queue depth and DB I/O. Move to larger instance class, tune PostgreSQL shared buffers, and split worker-heavy flows if a single host is saturating CPU or disk throughput.
Unexpected workflow failures after image update
Pin image tags and review release notes before upgrades. Run a smoke-test suite of representative workflows in staging, then roll forward in production during a low-risk maintenance window.
FAQ
Can I run Kestra behind Cloudflare instead of direct public ingress?
Yes. Keep origin locked down to Cloudflare egress ranges, enforce Full (Strict) TLS, and still maintain host-level firewall controls. Validate websocket behavior for interactive UI features.
How often should I back up PostgreSQL and workflow storage?
At minimum, run nightly full backups plus more frequent WAL or incremental snapshots for lower RPO. Test restore monthly and after major schema changes.
Is Docker Compose enough for enterprise production?
For many teams, yesβespecially for single-region moderate workloads. Once you need multi-zone HA, autoscaling workers, and policy-heavy governance, evaluate Kubernetes.
What is the safest way to rotate credentials?
Create new credentials, deploy updated secrets, verify service health, then revoke old secrets. Avoid in-place overwrite without rollback checkpoints.
How do I monitor failed workflows proactively?
Export logs/metrics to your monitoring stack (Prometheus/Grafana, ELK, or OpenTelemetry pipelines), then alert on failure rate spikes, queue growth, and latency percentiles.
Can I use external managed PostgreSQL instead of a containerized DB?
Absolutely. Managed PostgreSQL can improve durability and reduce operational load. Ensure private networking, TLS connections, and version compatibility before migration.
Related guides
- Production Guide: Deploy Outline with Docker Compose, Caddy, PostgreSQL, and Redis on Ubuntu
- Deploy Grafana with Docker Compose and Traefik on Ubuntu: Production-Ready Observability Guide
- Deploy Nextcloud with Docker Compose, Nginx, and Redis on Ubuntu (Production Guide)
Talk to us
Need help deploying a production-ready Outline knowledge platform, integrating SSO, or building secure backup and upgrade runbooks for your team? We can help with architecture, hardening, migration, and operational readiness.