Troubleshooting

Make sure the TLS secret custom-ingress-cert exists in the srdp namespace (kubectl get secret custom-ingress-cert -n srdp).
For production, Traefik uses Let's Encrypt via the ACME TLS challenge; verify the certResolver and ACME email are set in values-prod.yaml.
Regenerate with mkcert if the hostnames or IP changed (local dev).
Confirm your hosts file points auth/marimo/quarto/dagster.<domain> to the Traefik IP.

Check storage and DB connectivity: kubectl describe pod <name> -n srdp.
PostgreSQL runs in-cluster; verify the service is up and its PVC is bound:
Local (standalone): service db-postgresql, PVC data-db-postgresql-0
Production (replication): service db-postgresql-primary, PVC data-db-postgresql-primary-0

Traefik needs a LoadBalancer-capable environment. Verify your Scaleway account quotas and that the service type is LoadBalancer (see values-prod.yaml).

Ensure global.domain, zitadel.zitadel.configmapConfig.ExternalDomain, and the oauth2-proxy.extraArgs cookie/whitelist domains all match the URL you are using.
Re-check the Zitadel client ID/secret and redirect URIs.

Import the mkcert root CA (printed during mkcert -install) or trust kubernetes/certs/selfsigned.crt locally while developing.

Let's Encrypt blocks nip.io frequently and requires public reachability on ports 80/443. Open those ports on the load balancer/security group, or temporarily point Traefik to the staging CA until production issuance succeeds.

If the Postgres DB already contains Zitadel data, the login-client PAT will not be recreated. Use a fresh database (or drop the existing schema) before re-running the chart.

Cause: zitadel-db.primary.initdb.scripts only run on first PostgreSQL initialization. If you reused an old PVC, the dagster role/database may be missing.
Quick fix (keeps existing data) — adjust the pod name and password for your environment:
Local: kubectl -n srdp exec -i db-postgresql-0 -- bash -lc "export PGPASSWORD='srdpTest123'; psql -h 127.0.0.1 -U postgres -d postgres"
Production: kubectl -n srdp exec -i db-postgresql-primary-0 -- bash -lc "export PGPASSWORD='<your-postgres-password>'; psql -h 127.0.0.1 -U postgres -d postgres"
Then run:
- CREATE ROLE dagster LOGIN PASSWORD '<your-dagster-password>'; (or ALTER ROLE ... if it exists)
- CREATE DATABASE dagster OWNER dagster; (if missing)
- GRANT ALL PRIVILEGES ON DATABASE dagster TO dagster;
kubectl -n srdp rollout restart deploy/srdp-dagster-webserver deploy/srdp-dagster-daemon
Clean reset option (local dev): just local-delete to remove PVCs, then just local-deploy to let init scripts recreate databases from scratch.

Cause: The default imagePullPolicy is IfNotPresent. If you rebuild an image with the same tag (e.g. v1.0), Kubernetes will keep using the cached version.
Fix: Bump the image tag (e.g. v1.0 → v1.1) in both the build command and values-prod.yaml, then redeploy. Alternatively, set imagePullPolicy: Always in your values file, but this is slower for routine deployments.

Symptom: tofu destroy fails with Private Network must be empty to be deleted.
Cause: Kapsule creates a Scaleway-managed Load Balancer when a type: LoadBalancer service (Traefik) is deployed via Helm. This LB is not tracked by OpenTofu, so it remains attached to the private network after the cluster is deleted.
Fix: The updated just prod-destroy handles this automatically. If you already hit this error, delete the LB via the Scaleway API: bash source ./secrets.sh curl -s -H "X-Auth-Token: $SCW_SECRET_KEY" "https://api.scaleway.com/lb/v1/zones/nl-ams-1/lbs" | python3 -m json.tool curl -X DELETE -H "X-Auth-Token: $SCW_SECRET_KEY" "https://api.scaleway.com/lb/v1/zones/nl-ams-1/lbs/<LB_ID>?release_ip=true" Wait ~30 seconds, then retry tofu destroy.

The Traefik PVC is ReadWriteOnce; if an old pod still holds it, new pods stay in Init with a multi-attach warning. Delete the old Traefik pod (or the PVC if needed) so the new pod can mount /data.
Symptom: Your Zitadel project and apps are missing after restarting the containers.
Cause: You ran docker compose down -v, which removes all persistent volumes.
Solution: Only use docker compose down to stop the containers. Do not use the -v flag unless you intend to reset all data.