Windmill Self-Host Setup: High Availability, Multi-Workspace Governance, Custom Runtimes, and Disaster Recovery
The first three guides in this Windmill series covered the complete operational picture: deployment and basic scripts, Git sync, worker groups, and enterprise workflows, and AI workflows, integrations, and production scaling. This fourth guide addresses the operational resilience requirements that enterprise deployments can't skip: high availability so a server failure doesn't stop your automations, multi-workspace governance for managing Windmill across multiple teams and business units, custom Docker runtimes for workloads with specialized dependencies, and a tested disaster recovery process your team can execute under pressure.
Prerequisites
- A production Windmill deployment with PostgreSQL and worker groups — see our advanced guide
- At least two VPS instances or cloud VMs for the HA setup
- A PostgreSQL instance accessible from all Windmill nodes (managed PostgreSQL recommended)
- A load balancer (Nginx, HAProxy, or cloud load balancer)
- Docker and Docker Compose v2 on all nodes
- At least 4 vCPU and 8GB RAM per node for HA production deployments
Verify your current deployment is stable before attempting HA migration:
# Check current Windmill is healthy:
docker compose ps
# Verify job queue is empty (quiesce before migration):
docker exec windmill_db psql -U windmill windmill \
-c "SELECT COUNT(*) as queued FROM queue;"
# Ideally this returns 0 before starting HA migration
# Check PostgreSQL version and connection:
docker exec windmill_db psql -U windmill -c "SELECT version();"
# List all current workspaces:
wmill workspace list
# Export current workspace for safety:
wmill sync pull --yes
git add -A && git commit -m "Pre-HA migration backup" && git push
echo "Pre-migration backup complete"
High Availability: Running Multiple Windmill Server Nodes
A single Windmill server is a single point of failure for every automation your business depends on. If the server is rebooted, updated, or crashes — all scheduled jobs miss their windows, all in-flight jobs fail, and teams can't access their scripts or apps. HA deployment eliminates this.
HA Architecture Overview
Windmill's HA model is straightforward because all state lives in PostgreSQL:
- Multiple server nodes — serve the UI, API, and webhook endpoints. Any node can handle any request.
- Multiple worker nodes — pull jobs from PostgreSQL queue. Workers are stateless and horizontally scalable.
- Shared PostgreSQL — the single source of truth. HA here means managed PostgreSQL with automatic failover.
- Load balancer — distributes HTTP traffic across server nodes. Removes failed nodes automatically via health checks.
Multi-Node Docker Compose Setup
# docker-compose.node1.yml — Deploy on Server 1
# docker-compose.node2.yml — Identical, deploy on Server 2
# Both connect to shared PostgreSQL on a dedicated DB server
version: '3.8'
services:
# Windmill Server — handles UI, API, webhooks
windmill_server:
image: ghcr.io/windmill-labs/windmill:main
container_name: windmill_server
restart: unless-stopped
ports:
- "8000:8000"
environment:
# Both nodes point to the SAME PostgreSQL:
DATABASE_URL: postgresql://windmill:${PG_PASSWORD}@pg.internal.yourdomain.com:5432/windmill
BASE_URL: https://windmill.yourdomain.com
MODE: server
# Node identity (helps with distributed lock debugging):
INSTANCE_NAME: node-1 # Change to node-2 on the second server
# Job coordination:
ZOMBIE_JOB_TIMEOUT: 300
RESTART_ZOMBIE_JOBS: true
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/version"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
- windmill_server_logs:/tmp/windmill/logs
# Default worker — runs standard jobs
windmill_worker:
image: ghcr.io/windmill-labs/windmill:main
container_name: windmill_worker
restart: unless-stopped
environment:
DATABASE_URL: postgresql://windmill:${PG_PASSWORD}@pg.internal.yourdomain.com:5432/windmill
MODE: worker
WORKER_GROUP: default
NUM_WORKERS: 8
SLEEP_QUEUE: 50
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- worker_cache:/tmp/windmill/cache # Cache Python/TS deps between jobs
deploy:
resources:
limits:
cpus: '4.0'
memory: 4G
# Native worker — for jobs that need system access
windmill_worker_native:
image: ghcr.io/windmill-labs/windmill:main
container_name: windmill_worker_native
restart: unless-stopped
environment:
DATABASE_URL: postgresql://windmill:${PG_PASSWORD}@pg.internal.yourdomain.com:5432/windmill
MODE: worker
WORKER_GROUP: native
NUM_WORKERS: 4
SLEEP_QUEUE: 100
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- worker_cache:/tmp/windmill/cache
volumes:
windmill_server_logs:
worker_cache:
Nginx Load Balancer for Windmill HA
# /etc/nginx/sites-available/windmill-ha
# Deploy on a dedicated load balancer server (or use a cloud LB)
upstream windmill_servers {
# IP hash for session affinity (Windmill uses server-side sessions)
ip_hash;
server node1.internal.yourdomain.com:8000 max_fails=3 fail_timeout=30s;
server node2.internal.yourdomain.com:8000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name windmill.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/windmill.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/windmill.yourdomain.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
# Large file uploads (scripts, SQL files):
client_max_body_size 100M;
# Windmill uses WebSocket for live job output:
location /ws {
proxy_pass http://windmill_servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_set_header Host $host;
proxy_read_timeout 3600s;
}
location / {
proxy_pass http://windmill_servers;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
}
}
server {
listen 80;
server_name windmill.yourdomain.com;
return 301 https://$host$request_uri;
}
# Health check endpoint (for monitoring):
server {
listen 8080;
location /health {
access_log off;
proxy_pass http://windmill_servers/api/version;
}
}
sudo nginx -t && sudo systemctl reload nginx
# Test HA failover:
# docker compose -f docker-compose.node1.yml stop windmill_server
# curl https://windmill.yourdomain.com/api/version
# Should still return 200 — served by node2
Verifying HA Is Working
#!/bin/bash
# test-ha-failover.sh — Run quarterly to verify HA works as expected
set -euo pipefail
WINDMILL_URL="https://windmill.yourdomain.com"
NODE1_URL="http://node1.internal.yourdomain.com:8000"
NODE2_URL="http://node2.internal.yourdomain.com:8000"
echo "=== Windmill HA Failover Test ==="
# Step 1: Verify both nodes are healthy
for node_url in "$NODE1_URL" "$NODE2_URL"; do
STATUS=$(curl -sf "${node_url}/api/version" | jq -r .version 2>/dev/null || echo "FAILED")
echo "Node ${node_url}: ${STATUS}"
done
# Step 2: Submit a test job through the load balancer:
JOB_RESULT=$(wmill script run u/admin/health_check --timeout 30 2>/dev/null)
echo "Job via LB: $JOB_RESULT"
# Step 3: Stop Node 1 server:
echo "Stopping Node 1 server..."
ssh node1.internal.yourdomain.com 'docker stop windmill_server'
# Step 4: Verify the load balancer routes to Node 2:
sleep 10
FAILOVER_VERSION=$(curl -sf "${WINDMILL_URL}/api/version" | jq -r .version)
echo "Version via failover (Node 2): $FAILOVER_VERSION"
# Step 5: Submit another job — should succeed via Node 2:
JOB_RESULT2=$(wmill script run u/admin/health_check --timeout 30 2>/dev/null)
echo "Job via failover: $JOB_RESULT2"
# Step 6: Restore Node 1:
echo "Restoring Node 1..."
ssh node1.internal.yourdomain.com 'docker start windmill_server'
sleep 15
# Step 7: Verify both nodes healthy again:
for node_url in "$NODE1_URL" "$NODE2_URL"; do
STATUS=$(curl -sf "${node_url}/api/version" | jq -r .version 2>/dev/null || echo "FAILED")
echo "Node ${node_url}: ${STATUS}"
done
echo "=== HA Test Complete ==="
Multi-Workspace Governance
Large organizations running Windmill for multiple teams need proper isolation and governance. The data engineering team's workflows shouldn't be able to accidentally access the finance team's database credentials. The platform team needs visibility across workspaces without individual teams interfering with each other.
Workspace Architecture for Multiple Teams
# Create separate workspaces for each team or business unit:
# Option A: Via Windmill UI:
# Super Admin → Workspaces → Create Workspace
# Option B: Via API (for scripted provisioning):
WINDMILL_URL="https://windmill.yourdomain.com"
SUPERADMIN_TOKEN="your-superadmin-token"
create_workspace() {
local workspace_id="$1"
local workspace_name="$2"
local admin_email="$3"
curl -X POST "${WINDMILL_URL}/api/workspaces/create" \
-H "Authorization: Bearer ${SUPERADMIN_TOKEN}" \
-H 'Content-Type: application/json' \
-d "{
\"id\": \"${workspace_id}\",
\"name\": \"${workspace_name}\",
\"username\": \"${admin_email}\"
}" | jq .id
}
# Provision workspaces for each team:
create_workspace "engineering" "Engineering Platform" "[email protected]"
create_workspace "data-team" "Data & Analytics" "[email protected]"
create_workspace "finance-ops" "Finance Operations" "[email protected]"
create_workspace "devops" "DevOps & Infrastructure" "[email protected]"
# Verify workspaces created:
curl -s "${WINDMILL_URL}/api/workspaces/list" \
-H "Authorization: Bearer ${SUPERADMIN_TOKEN}" | \
jq '[.[] | {id: .id, name: .name}]'
Cross-Workspace Policy and Governance Script
#!/usr/bin/env python3
# windmill-governance.py
# Enforces governance policies across all Windmill workspaces
# - Ensures workspace admins are set correctly
# - Audits scripts with external network access
# - Generates cross-workspace resource usage reports
# Run weekly via cron
import requests
import json
import os
from datetime import datetime
WINDMILL_URL = os.environ["WINDMILL_URL"]
SUPERADMIN_TOKEN = os.environ["WINDMILL_SUPERADMIN_TOKEN"]
HEADERS = {"Authorization": f"Bearer {SUPERADMIN_TOKEN}"}
def get_all_workspaces() -> list:
resp = requests.get(f"{WINDMILL_URL}/api/workspaces/list", headers=HEADERS)
resp.raise_for_status()
return resp.json()
def get_workspace_users(workspace_id: str) -> list:
resp = requests.get(
f"{WINDMILL_URL}/api/w/{workspace_id}/users/list",
headers=HEADERS
)
return resp.json() if resp.status_code == 200 else []
def get_workspace_jobs(workspace_id: str, days: int = 30) -> dict:
resp = requests.get(
f"{WINDMILL_URL}/api/w/{workspace_id}/jobs/list",
headers=HEADERS,
params={"per_page": 1} # We just want the count from headers
)
if resp.status_code != 200:
return {"count": 0}
# Get approximate count from the response
jobs = resp.json()
return {"count": len(jobs) if isinstance(jobs, list) else 0}
def get_workspace_scripts(workspace_id: str) -> list:
resp = requests.get(
f"{WINDMILL_URL}/api/w/{workspace_id}/scripts/list",
headers=HEADERS,
params={"per_page": 100}
)
return resp.json() if resp.status_code == 200 else []
# Generate governance report:
print(f"Windmill Governance Report — {datetime.now().strftime('%Y-%m-%d')}")
print("=" * 70)
workspaces = get_all_workspaces()
print(f"Total workspaces: {len(workspaces)}")
for ws in workspaces:
ws_id = ws.get("id")
ws_name = ws.get("name", ws_id)
users = get_workspace_users(ws_id)
admins = [u for u in users if u.get("role") == "Admin"]
scripts = get_workspace_scripts(ws_id)
print(f"\n📁 {ws_name} ({ws_id})")
print(f" Users: {len(users)} | Admins: {len(admins)} | Scripts: {len(scripts)}")
if not admins:
print(" ⚠️ WARNING: No admin user configured for this workspace")
if len(admins) == 1:
print(f" ⚠️ Single admin ({admins[0].get('email')}) — consider adding a backup admin")
# Check for scripts that haven't been updated recently (potential stale automation):
import re
from datetime import timezone, timedelta
stale_threshold = datetime.now(timezone.utc) - timedelta(days=180)
stale_scripts = [
s for s in scripts
if s.get("created_at") and
datetime.fromisoformat(s["created_at"].replace("Z", "+00:00")) < stale_threshold
]
if stale_scripts:
print(f" 📋 {len(stale_scripts)} scripts not modified in 180+ days (review for staleness)")
# Run monthly for governance reporting:
# 0 9 1 * * python3 /opt/scripts/windmill-governance.py | mail -s "Monthly Windmill Governance" [email protected]
Custom Docker Runtimes for Specialized Workloads
Some workflows need dependencies that can't be installed via pip or npm — proprietary libraries, GPU drivers, CUDA toolkits, legacy system dependencies, or specific OS packages. Windmill's Docker execution mode lets individual scripts run in custom container images with exactly the environment they need.
Building Custom Windmill Worker Images
# Build a custom image for data science workloads
# (numpy, pandas, scikit-learn, matplotlib pre-installed)
# This avoids re-installing heavy packages on every job run
# Dockerfile.datascience:
cat > Dockerfile.datascience << 'EOF'
FROM ghcr.io/windmill-labs/windmill:main
# Install Python data science packages globally
# So scripts don't need to re-install them each run
USER root
# Install system dependencies:
RUN apt-get update && apt-get install -y \
build-essential \
libpq-dev \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
# Pre-install heavy Python packages:
RUN pip3 install --no-cache-dir \
numpy==1.26.4 \
pandas==2.2.0 \
scikit-learn==1.4.0 \
matplotlib==3.8.0 \
plotly==5.19.0 \
pyarrow==15.0.0 \
psycopg2-binary==2.9.9
# Pre-install from internal PyPI mirror (for proprietary packages):
RUN pip3 install --no-cache-dir \
--index-url https://pypi.internal.yourdomain.com/simple/ \
internal-data-utils==2.1.0
USER windmill
EOF
# Build and push to your internal registry:
docker build -t git.yourdomain.com/platform/windmill-datascience:1.0 \
-f Dockerfile.datascience .
docker push git.yourdomain.com/platform/windmill-datascience:1.0
# Deploy as a custom worker group:
# In docker-compose.yml, add:
worker_datascience:
image: git.yourdomain.com/platform/windmill-datascience:1.0
container_name: windmill_worker_datascience
restart: unless-stopped
environment:
DATABASE_URL: ${DATABASE_URL}
MODE: worker
WORKER_GROUP: datascience
NUM_WORKERS: 3
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Scripts assigned to the 'datascience' worker group
# automatically run in this custom environment
Docker-in-Docker Scripts for Container Operations
# Windmill scripts can use Docker directly via the Docker socket
# mount (already configured in the worker Compose service)
# u/devops/rebuild_and_deploy_service.py
# A script that builds a Docker image from Git and deploys it
# requirements:
# docker
# gitpython
import docker
import git
import os
import tempfile
import shutil
from pathlib import Path
def main(
repo_url: str,
branch: str = "main",
image_name: str = "",
registry: str = "git.yourdomain.com",
deploy_command: str = ""
) -> dict:
"""
Clones a repo, builds a Docker image, pushes to registry,
and optionally runs a deploy command.
"""
if not image_name:
# Derive image name from repo URL:
image_name = repo_url.split("/")[-1].replace(".git", "")
client = docker.from_env()
logs = []
with tempfile.TemporaryDirectory() as tmpdir:
# Clone the repository:
logs.append(f"Cloning {repo_url}@{branch}...")
repo = git.Repo.clone_from(repo_url, tmpdir, branch=branch, depth=1)
commit_sha = repo.head.commit.hexsha[:8]
# Build the image:
tag = f"{registry}/{image_name}:{commit_sha}"
latest_tag = f"{registry}/{image_name}:latest"
logs.append(f"Building image: {tag}")
build_output = client.api.build(
path=tmpdir,
tag=tag,
rm=True,
pull=True
)
for line in build_output:
decoded = line.decode('utf-8').strip()
if decoded and 'stream' in decoded:
import json
try:
msg = json.loads(decoded).get('stream', '').strip()
if msg:
logs.append(msg)
except Exception:
pass
# Tag as latest:
client.images.get(tag).tag(latest_tag)
# Push to registry:
logs.append(f"Pushing {tag}...")
push_output = client.images.push(registry + "/" + image_name, commit_sha)
client.images.push(registry + "/" + image_name, "latest")
# Optional deploy command:
if deploy_command:
logs.append(f"Running deploy: {deploy_command}")
result = client.containers.run(
"alpine",
deploy_command,
remove=True,
network_mode="host"
)
logs.append(result.decode('utf-8').strip())
return {
"status": "deployed",
"image": tag,
"commit": commit_sha,
"log_lines": len(logs),
"last_log_line": logs[-1] if logs else ""
}
Disaster Recovery: Tested Restore Procedures
A Windmill disaster recovery plan that's never been tested is a hypothesis, not a plan. This section covers what to back up, how to back it up automatically, and the exact steps to restore from total server loss — with timing so you know your actual RTO.
Complete Backup Strategy
#!/bin/bash
# /opt/scripts/backup-windmill.sh
# Complete Windmill backup: PostgreSQL + Git sync export
# Recovery point objective: daily at 2am
set -euo pipefail
DATE=$(date +%Y-%m-%d-%H%M)
BACKUP_DIR="/opt/backups/windmill"
S3_BUCKET="s3://your-backup-bucket/windmill"
WINDMILL_URL="https://windmill.yourdomain.com"
PG_CONTAINER="windmill_db"
PG_USER="windmill"
PG_DB="windmill"
mkdir -p "$BACKUP_DIR"
echo "[$(date -u)] Starting Windmill backup: ${DATE}"
# Step 1: PostgreSQL backup (all Windmill state — users, scripts, schedules, secrets)
echo "[1/4] Backing up PostgreSQL..."
docker exec "$PG_CONTAINER" pg_dump \
-U "$PG_USER" \
--format=custom \
--compress=9 \
"$PG_DB" > "${BACKUP_DIR}/windmill-db-${DATE}.pgdump"
echo "PostgreSQL backup size: $(du -sh ${BACKUP_DIR}/windmill-db-${DATE}.pgdump | cut -f1)"
# Step 2: Git sync export (scripts, flows, apps in plaintext)
# This is your most human-readable backup — can be browsed without restoring
echo "[2/4] Exporting workspace via Git sync..."
WINDMILL_TOKEN=$(docker exec windmill_db psql -U windmill windmill \
-t -c "SELECT token FROM token WHERE label = 'backup-token' LIMIT 1;" | tr -d ' \n')
wmill workspace switch production
wmill sync pull --yes --output-dir "${BACKUP_DIR}/workspace-${DATE}"
tar czf "${BACKUP_DIR}/workspace-${DATE}.tar.gz" \
-C "${BACKUP_DIR}" "workspace-${DATE}"
rm -rf "${BACKUP_DIR}/workspace-${DATE}"
# Step 3: Worker cache (optional — saves time on restore, not critical)
# echo "[3/4] Backing up worker dependency cache..."
# docker run --rm -v windmill_worker_cache:/cache \
# -v ${BACKUP_DIR}:/backup alpine \
# tar czf /backup/worker-cache-${DATE}.tar.gz /cache
# Step 4: Upload to S3
echo "[3/4] Uploading to S3..."
aws s3 cp "${BACKUP_DIR}/windmill-db-${DATE}.pgdump" \
"${S3_BUCKET}/db/windmill-db-${DATE}.pgdump" \
--storage-class STANDARD_IA
aws s3 cp "${BACKUP_DIR}/workspace-${DATE}.tar.gz" \
"${S3_BUCKET}/workspace/workspace-${DATE}.tar.gz"
# Step 5: Cleanup local backups older than 7 days
echo "[4/4] Cleaning up old local backups..."
find "$BACKUP_DIR" -name 'windmill-db-*.pgdump' -mtime +7 -delete
find "$BACKUP_DIR" -name 'workspace-*.tar.gz' -mtime +7 -delete
echo "[$(date -u)] Backup complete"
# Add to crontab:
# 0 2 * * * /opt/scripts/backup-windmill.sh >> /var/log/windmill-backup.log 2>&1
Full Restore Procedure
#!/bin/bash
# restore-windmill.sh
# Full Windmill restore from backup
# Usage: ./restore-windmill.sh windmill-db-2026-04-09-0200.pgdump
# Estimated RTO: 15-30 minutes
set -euo pipefail
BACKUP_FILE="${1:-}"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 "
echo ""
echo "Available backups:"
aws s3 ls s3://your-backup-bucket/windmill/db/ | tail -10
exit 1
fi
START_TIME=$(date +%s)
BACKUP_DIR="/tmp/windmill-restore"
mkdir -p "$BACKUP_DIR"
echo "[$(date -u)] Starting Windmill restore from: $BACKUP_FILE"
# Step 1: Download backup if not local
if [ ! -f "$BACKUP_FILE" ]; then
echo "[1/6] Downloading backup from S3..."
aws s3 cp "s3://your-backup-bucket/windmill/db/${BACKUP_FILE}" \
"${BACKUP_DIR}/${BACKUP_FILE}"
BACKUP_FILE="${BACKUP_DIR}/${BACKUP_FILE}"
fi
# Step 2: Stop all Windmill services:
echo "[2/6] Stopping Windmill services..."
docker compose -f docker-compose.node1.yml stop windmill_server windmill_worker windmill_worker_native 2>/dev/null || true
docker compose -f docker-compose.node2.yml stop windmill_server windmill_worker windmill_worker_native 2>/dev/null || true
echo "All Windmill services stopped"
# Step 3: Drop and recreate database:
echo "[3/6] Dropping and recreating database..."
docker exec windmill_db psql -U postgres \
-c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'windmill' AND pid <> pg_backend_pid();"
docker exec windmill_db psql -U postgres \
-c "DROP DATABASE IF EXISTS windmill; CREATE DATABASE windmill OWNER windmill;"
# Step 4: Restore from backup:
echo "[4/6] Restoring PostgreSQL database..."
cat "$BACKUP_FILE" | docker exec -i windmill_db pg_restore \
-U windmill \
-d windmill \
--no-owner \
--no-privileges \
--exit-on-error
echo "Database restored successfully"
# Step 5: Start services:
echo "[5/6] Starting Windmill services..."
docker compose -f docker-compose.node1.yml up -d windmill_server windmill_worker windmill_worker_native
# Wait for server to be healthy:
for i in {1..24}; do
if curl -sf http://localhost:8000/api/version > /dev/null 2>&1; then
echo "Windmill server is healthy"
break
fi
echo "Waiting for Windmill server... (${i}/24)"
sleep 5
done
# Step 6: Verify restore:
echo "[6/6] Verifying restore..."
JOB_COUNT=$(docker exec windmill_db psql -U windmill windmill \
-t -c "SELECT COUNT(*) FROM v_completed_job WHERE created_at > NOW() - INTERVAL '24 hours';" | tr -d ' \n')
SCRIPT_COUNT=$(docker exec windmill_db psql -U windmill windmill \
-t -c "SELECT COUNT(*) FROM script;" | tr -d ' \n')
USER_COUNT=$(docker exec windmill_db psql -U windmill windmill \
-t -c "SELECT COUNT(*) FROM password;" | tr -d ' \n')
END_TIME=$(date +%s)
ELAPSED=$(( END_TIME - START_TIME ))
echo ""
echo "=== Restore Complete ==="
echo "Duration: ${ELAPSED} seconds"
echo "Scripts restored: ${SCRIPT_COUNT}"
echo "Users restored: ${USER_COUNT}"
echo "Recent jobs (24h): ${JOB_COUNT}"
echo "Windmill URL: https://windmill.yourdomain.com"
echo "Verify login and run a test script before declaring restore complete."
Tips, Gotchas, and Troubleshooting
Split-Brain After Network Partition (Both Nodes Think They're Primary)
# Windmill's server nodes are stateless — split-brain isn't really possible
# because ALL state lives in PostgreSQL, not the nodes themselves
# If nodes can't reach PostgreSQL, they simply fail health checks and stop serving
# However, if you see inconsistent job results or duplicate job execution:
# Check for zombie jobs:
docker exec windmill_db psql -U windmill windmill << 'EOF'
SELECT id, workspace_id, script_path, running, started_at
FROM queue
WHERE running = true
AND started_at < NOW() - INTERVAL '10 minutes';
EOF
# Windmill marks long-running jobs as zombie and re-queues them
# Check the ZOMBIE_JOB_TIMEOUT env var — default is 5 minutes
# If zombie job cleanup is too aggressive (jobs get re-queued when still running):
# Increase timeout for long-running jobs:
# ZOMBIE_JOB_TIMEOUT=1800 # 30 minutes
# Force cleanup of stuck zombie jobs:
docker exec windmill_db psql -U windmill windmill \
-c "UPDATE queue SET running = false WHERE running = true AND started_at < NOW() - INTERVAL '30 minutes';"
Custom Worker Image Not Pulling from Private Registry
# If Docker worker can't pull from your internal registry:
# 1. Log in to the registry on the host:
docker login git.yourdomain.com
# 2. Verify credentials are stored:
cat ~/.docker/config.json | jq '.auths | keys'
# 3. Mount Docker config into the worker container:
# In docker-compose.yml for the custom worker:
worker_datascience:
image: git.yourdomain.com/platform/windmill-datascience:1.0
volumes:
- /root/.docker:/root/.docker:ro # Mount host Docker credentials
- /var/run/docker.sock:/var/run/docker.sock
# 4. For Gitea registry, ensure the token has 'read:packages' permission:
# Gitea → Settings → Applications → Generate Token
# Check: read:package
# 5. Test pull from inside a running worker:
docker exec windmill_worker docker pull git.yourdomain.com/platform/windmill-datascience:1.0
# Should succeed — if not, credentials aren't mounted correctly
PostgreSQL Connection Pool Exhaustion Under Load
# In HA with 2 server nodes + multiple worker groups, connection count can spike
# Monitor active PostgreSQL connections:
docker exec windmill_db psql -U windmill windmill \
-c "SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;"
# If connections are near max_connections (200 default):
# Add PgBouncer as a connection pooler between Windmill and PostgreSQL:
# Add to docker-compose.yml:
pgbouncer:
image: edoburu/pgbouncer:latest
container_name: pgbouncer
restart: unless-stopped
environment:
DATABASE_URL: postgresql://windmill:${PG_PASSWORD}@postgres:5432/windmill
POOL_MODE: transaction
MAX_CLIENT_CONN: 500
DEFAULT_POOL_SIZE: 25
SERVER_RESET_QUERY: DISCARD ALL
ports:
- "6432:5432"
# Update Windmill to connect via PgBouncer:
# DATABASE_URL: postgresql://windmill:${PG_PASSWORD}@pgbouncer:5432/windmill
# Note: Windmill uses prepared statements — use POOL_MODE=session or
# session mode for Windmill server, transaction mode for workers
Pro Tips
- Run the DR restore test on a staging environment quarterly — not a "let's see if the backup files exist" check, but a full restore to a blank server with timing. Document the actual RTO. If it takes 45 minutes and your SLA says 30, fix it before it matters.
- Use PgBouncer even if you don't need it now — adding a connection pooler when you're already experiencing connection exhaustion means doing it under pressure. Deploying it proactively when you set up HA is 30 minutes of work that prevents a 3am incident.
- Tag your custom Docker images with both a content hash and a date —
windmill-datascience:2026-04-09-abc123tells you when it was built and what commit it contains.:latesttells you nothing useful during an incident. - Keep superadmin token in Vaultwarden, not in your Windmill workspace — the superadmin token grants access to all workspaces. It should live in your password manager (Vaultwarden), not in a Windmill variable where workspace admins could access it.
- Test the governance script's stale script detection against known examples — create a test script with an old modification date and verify the governance report flags it. False negatives in audit scripts are worse than no audit script, because they create false confidence.
Wrapping Up
The four Windmill guides together cover the complete enterprise deployment lifecycle: deployment and basics, Git sync and production patterns, AI workflows and scaling, and this guide's HA clustering, multi-workspace governance, custom runtimes, and disaster recovery.
The operational maturity work in this guide — tested failover, documented restore procedures, governance automation, connection pooling — is what separates a Windmill deployment that your organization actually depends on from one that's convenient until something goes wrong. Do the HA and DR work before you need it, not after.
Need Enterprise Windmill Infrastructure Designed and Deployed?
Designing HA Windmill with proper failover, multi-workspace governance for multiple business units, custom runtime infrastructure, and tested disaster recovery — the sysbrix team builds Windmill deployments that engineering organizations can stake business-critical automations on, not just development convenience tools.
Talk to Us →