Network teams usually outgrow spreadsheets and wiki pages long before they have time to stand up a reliable source of truth. A common failure mode is simple: infrastructure grows, hand-maintained data drifts, and automation jobs begin making decisions based on stale assumptions. This guide shows a production-oriented path to deploy NetBox on Kubernetes with Helm while keeping PostgreSQL external for cleaner scaling, clearer backup ownership, and safer upgrades. You will set up secrets, persistent storage, ingress, health checks, and validation workflows that keep NetBox dependable in day-to-day operations.
Architecture and flow overview
This deployment model separates responsibilities so each component can be operated with the right lifecycle. Kubernetes runs the NetBox web and worker workloads, while PostgreSQL runs as a managed external service (or a separately operated HA cluster) with its own backup and maintenance policy. Redis remains in-cluster for caching and queue needs. Ingress terminates TLS and forwards traffic to the NetBox service.
The practical flow is: define namespace and baseline policies, provision secrets, deploy Redis and NetBox via Helm values, run database migrations, validate UI and API behavior, then baseline monitoring and backup checks. That sequence reduces first-day risk because each stage has a clear rollback point.
Prerequisites
- Kubernetes cluster (1.25+) with kubectl access and cluster-admin or delegated namespace admin rights.
- Helm 3 installed locally or in your CI runner.
- An external PostgreSQL instance (recommended 14+) reachable from the cluster.
- A DNS record for your NetBox endpoint (for example
netbox.example.com). - TLS strategy: cert-manager + ACME or a pre-provisioned TLS secret.
- Storage class for persistent volumes where needed.
Step-by-step deployment
1) Create namespace and baseline objects
kubectl create namespace netbox
kubectl -n netbox create configmap deployment-context --from-literal=owner=platform-team --from-literal=service=netbox
If the copy button does not work in your browser, manually select the code block and copy it.
Keeping a tiny context ConfigMap sounds trivial, but it helps incident responders quickly identify ownership when they are triaging alerts at 2 AM. Small conventions like this reduce mean-time-to-recovery in real environments.
2) Create secrets for NetBox and database connectivity
export NETBOX_SECRET_KEY="$(openssl rand -base64 48 | tr -d '\n')"
export POSTGRES_HOST="postgres-prod.internal"
export POSTGRES_DB="netbox"
export POSTGRES_USER="netbox_app"
export POSTGRES_PASSWORD="REPLACE_WITH_REAL_SECRET"
kubectl -n netbox create secret generic netbox-secrets --from-literal=secret_key="$NETBOX_SECRET_KEY" --from-literal=db_host="$POSTGRES_HOST" --from-literal=db_name="$POSTGRES_DB" --from-literal=db_user="$POSTGRES_USER" --from-literal=db_password="$POSTGRES_PASSWORD"
If the copy button does not work in your browser, manually select the code block and copy it.
Do not hard-code these values in Git. In production, replace this direct secret creation with your preferred secret manager integration (External Secrets Operator, Vault, or Sealed Secrets) so rotations are auditable and repeatable.
3) Add chart repository and prepare values
helm repo add netbox-community https://netbox-community.github.io/netbox-chart/
helm repo update
If the copy button does not work in your browser, manually select the code block and copy it.
release:
name: netbox
ingress:
enabled: true
className: nginx
hosts:
- host: netbox.example.com
paths:
- /
tls:
- secretName: netbox-tls
hosts:
- netbox.example.com
postgresql:
enabled: false
redis:
enabled: true
externalDatabase:
host: postgres-prod.internal
port: 5432
database: netbox
username: netbox_app
existingSecretName: netbox-secrets
existingSecretPasswordKey: db_password
superuser:
enabled: true
existingSecret: netbox-admin
extraConfig:
- values:
SECRET_KEY: "__from_secret__"
ALLOWED_HOSTS: ["netbox.example.com"]
METRICS_ENABLED: true
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2
memory: 2Gi
persistence:
enabled: true
accessMode: ReadWriteOnce
size: 20Gi
If the copy button does not work in your browser, manually select the code block and copy it.
The key production choice above is postgresql.enabled: false. You avoid coupling database lifecycle to application release cadence, which makes upgrades safer and backup ownership clearer.
4) Create admin secret and deploy chart
kubectl -n netbox create secret generic netbox-admin [email protected] --from-literal=password='REPLACE_STRONG_PASSWORD' --from-literal=api_token='REPLACE_LONG_RANDOM_TOKEN'
helm upgrade --install netbox netbox-community/netbox -n netbox -f values.yaml --wait --timeout 10m
If the copy button does not work in your browser, manually select the code block and copy it.
The --wait flag gives you immediate deployment feedback instead of silent partial rollouts. If this step fails, fix health checks now before trying any content or object imports.
5) Run migrations and collect static assets checks
kubectl -n netbox get pods
kubectl -n netbox logs deploy/netbox --tail=120
# If needed for manual migration checks:
kubectl -n netbox exec deploy/netbox -- /opt/netbox/netbox/manage.py migrate --check
If the copy button does not work in your browser, manually select the code block and copy it.
Most chart versions handle migrations automatically, but explicit checks help during version jumps or custom plugin upgrades. Validate this before exposing NetBox broadly to internal users.
6) Configure ingress and DNS validation
kubectl -n netbox get ingress
kubectl -n netbox describe ingress netbox
# Validate DNS from a controlled host
dig +short netbox.example.com
curl -I https://netbox.example.com
If the copy button does not work in your browser, manually select the code block and copy it.
If DNS is correct but HTTPS fails, check certificate provisioning events before changing app configs. Many first-run incidents are ingress/TLS issues, not NetBox issues.
Configuration and secrets handling best practices
For production, enforce a secret lifecycle policy: secrets should be versioned in your secret backend, rotated on schedule, and rotated immediately after incident response or team changes. Avoid placing secret values in plain values files or CI logs.
Keep environment-specific values in separate files (values-dev.yaml, values-stage.yaml, values-prod.yaml) and require pull-request review for all changes to ingress, auth, plugin lists, and resource limits. This provides an operational audit trail and lowers configuration drift risk.
For plugin-heavy environments, pin chart and application versions intentionally. Test upgrades in a non-production namespace with a sanitized database snapshot. Upgrade rehearsals catch migration edge cases, plugin API incompatibilities, and worker queue regressions before they impact production.
Finally, define data retention and backup restore objectives explicitly. A backup is only useful when restore procedures are practiced. Run quarterly restore tests and document who owns each step, including DNS cutover and post-restore integrity checks.
Verification checklist
- Login page and dashboard load over HTTPS without mixed-content warnings.
- You can create, edit, and delete a test object (for example, a Device Role).
- Background jobs complete and queue backlog remains stable.
- API token authentication works from a controlled automation host.
- PostgreSQL connection counts and query latency remain within expected range.
- Backup job status is green and the most recent snapshot is restorable.
# Simple API smoke test
export NB_TOKEN='REPLACE_API_TOKEN'
curl -sS https://netbox.example.com/api/dcim/sites/ -H "Authorization: Token ${NB_TOKEN}" -H "Accept: application/json" | jq '.count'
If the copy button does not work in your browser, manually select the code block and copy it.
Common issues and fixes
Pods crash-loop after upgrade
Usually this is a migration mismatch, invalid plugin config, or missing secret key. Check startup logs first, then compare running chart values to the expected release bundle. Roll back quickly if needed and test upgrade again in staging.
Ingress returns 502/504
Confirm service and endpoints are healthy, then verify ingress backend target and timeout settings. If upstream connections are timing out, inspect resource pressure and worker readiness probes.
Intermittent DB connection errors
Validate network policy egress rules, DNS resolver stability, and PostgreSQL max connection settings. Consider a connection pooler if worker bursts are causing spikes.
Slow UI under normal load
Profile query-heavy pages, tune PostgreSQL indexes where needed, and verify persistent volume IOPS. UI slowness is often database latency disguised as app latency.
Secrets drift between environments
Move to a centralized secret manager and enforce policy checks in CI so deployments fail when required keys are missing or malformed.
FAQ
Should I run PostgreSQL inside the same Helm release for speed?
For production, no. Keep PostgreSQL external so database upgrades, backups, and failover are handled independently from app rollouts.
Can I start with one NetBox replica and scale later?
Yes. Start with one replica while validating plugins and workload patterns, then scale web and worker deployments after baseline monitoring is in place.
What is the safest way to rotate NetBox secrets?
Rotate through your secret backend, deploy to staging first, and roll production during a low-risk window. Validate logins, API auth, and background tasks after rotation.
How do I make upgrades predictable?
Pin chart/app versions, test on a staging copy of production data, and use a written runbook with explicit rollback criteria.
Do I need dedicated monitoring for NetBox?
Yes. Track pod readiness, response latency, queue depth, database health, ingress errors, and backup success to catch failure early.
What is the minimum backup policy I should enforce?
Daily snapshots plus tested restore drills. A backup policy without restore drills is incomplete and risky.
Can this setup support automation-heavy environments?
Yes, as long as API rate, worker capacity, and database throughput are tuned for your automation volume and object growth.
Related internal guides
If you are building an end-to-end internal platform, these guides can help you standardize adjacent services and deployment patterns:
- Production Guide: Deploy Metabase with Docker Compose + Nginx + PostgreSQL on Ubuntu
- Production Guide: Deploy Uptime Kuma with Rootless Podman + systemd + Caddy on Ubuntu
- Production Guide: Deploy AFFiNE with Docker Compose + Caddy + PostgreSQL on Ubuntu
Talk to us
If you want support designing or hardening your NetBox platform, we can help with architecture, migration planning, and production readiness.