Troubleshooting¶
Common issues and their fixes, grouped by component.
CTFHive app (ctfapp)¶
ModuleNotFoundError: No module named 'provisioner' when starting the control plane¶
Symptom: The CTFHive control plane crashes at import time with:
Cause: The provisioner package lives at the repository root, but when
gunicorn wsgi:app or flask run is invoked from inside
CTF_Saas_CTRL_Pane/, the repo root is not on sys.path.
Fix: The ctrlapp/__init__.py app factory bootstraps sys.path
automatically at import time, so the normal entrypoints work without manual
intervention:
_REPO_ROOT = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
if _REPO_ROOT not in sys.path:
sys.path.insert(0, _REPO_ROOT)
This is idempotent. If you still see the error, ensure you are launching from the correct working directory:
# From the repo root:
gunicorn "CTF_Saas_CTRL_Pane.ctrlapp:create_app()"
# OR from inside CTF_Saas_CTRL_Pane/:
cd CTF_Saas_CTRL_Pane
gunicorn "ctrlapp:create_app()"
For pytest, the conftest.py in CTF_Saas_CTRL_Pane/ adds the repo root to
sys.path. Run the control-plane tests from the CTF_Saas_CTRL_Pane/
directory:
Running pytest from the repo root without a conftest.py that handles the
path can reproduce the error.
Redis is down — app behaviour¶
Symptom: Warning messages in logs; rate limiting may not work correctly; flag lookup is slower.
Cause: REDIS_URL points to an unreachable Redis instance.
Behaviour: The app degrades gracefully:
- Caching falls back to
SimpleCache(in-memory, per-process, no TTL sharing). - Rate limiting falls back to in-memory counters.
- Flag lookup falls back to the database (
TeamFlagrows), then on-the-fly re-derivation.
Fix for development: Start Redis locally:
Fix for production: Ensure REDIS_URL is correct and Redis is reachable
before starting the app. Do not run CTFHive in production without Redis — the
in-memory rate-limit fallback allows workers × limit requests (see below).
Rate limiting is not working under multiple Gunicorn workers¶
Symptom: Clients can submit significantly more requests than configured
limits suggest. For example, LOGIN_RATE_LIMIT=10 per minute allows 40
attempts per minute with 4 workers.
Cause: RATELIMIT_STORAGE_URI=memory:// (the default) stores counters
in-process memory. Each Gunicorn worker has its own counter, so the effective
limit is workers × configured_limit.
Fix: Set a Redis backend for rate limiting:
Use a different Redis DB number than REDIS_URL to avoid key collisions
(though there is no technical requirement to separate them).
Labs/containers are not spawning¶
Symptom: Clicking "Start Lab" returns an error or silently does nothing. Container instances do not appear in the admin panel.
Cause (most common): LAB_ENABLED is false (the default).
Fix:
Ensure the process running the CTFHive app has permission to access the Docker socket. If running in Docker Compose, mount the socket:
Security
Mounting the Docker socket gives the app root-equivalent access to the
host. In production, use the forge-dockerd-proxy mTLS proxy (see
Architecture docs) instead of exposing the socket directly.
Other causes:
DISPATCH_INTERNAL_URLis unreachable → setDISPATCH_USE_REMOTE=falseto use the local Docker fallback.LAB_REQUIRE_PINNED_IMAGES=trueand the challenge image is not digest-pinned (image@sha256:...) → pin the image or setLAB_REQUIRE_PINNED_IMAGES=falseduring development.
Email features are silently disabled¶
Symptom: Email verification links are never sent; password reset emails do not arrive. No error is raised in the app.
Cause: MAILTRAP_API_KEY is empty (the default). The email service checks
for the API key before attempting delivery and skips sending silently.
Fix: Set the Mailtrap API key:
Alternatively, disable email-dependent features during development:
Changing ADMIN_KEY breaks all flags and the audit chain¶
Symptom: After rotating ADMIN_KEY, all flag submissions fail. The audit
chain verify_chain() returns (False, <id>) for every row.
Cause: ADMIN_KEY is the HMAC key for both flag derivation
(HMAC-SHA3-256) and the application audit chain (HMAC-SHA256). Changing it
means:
- Every previously-derived flag is no longer reproducible.
- Every historical audit chain signature cannot be re-verified.
Fix:
For flags — flush Redis and regenerate all TeamFlag rows before reopening the event:
# 1. Flush the Redis flag cache (adjust DB number as needed)
redis-cli -n 0 KEYS "team_flags:*" | xargs redis-cli -n 0 DEL
# 2. Delete all TeamFlag rows
flask shell -c "from ctfapp.extensions import db; from ctfapp.models.submission import TeamFlag; db.session.query(TeamFlag).delete(); db.session.commit()"
# 3. Re-generate flags for all active principals
flask shell -c "
from ctfapp.extensions import db
from ctfapp.models.principal import Principal
from ctfapp.services.flag_engine import generate_flags_for_principal
for p in Principal.query.filter_by(active=True).all():
generate_flags_for_principal(p)
db.session.commit()
"
For the audit chain — the historical chain is permanently broken after key
rotation. Archive the old log for the record and accept that
verify_chain() will not validate entries signed with the old key. Future
entries will form a new valid chain from the current _last_sig.
CTFHive control plane¶
Control-plane docker build produces an empty image¶
Symptom: docker build -f CTF_Saas_CTRL_Pane/Dockerfile . succeeds but
the image contains no application code.
Cause: CTF_Saas_CTRL_Pane/Dockerfile and
CTF_Saas_CTRL_Pane/Dockerfile.dev are empty files (0 bytes). They were
created as stubs and have not yet been written.
Workaround: Run the control plane directly via Gunicorn:
cd CTF_Saas_CTRL_Pane
uv run gunicorn "ctrlapp:create_app()" \
--bind 0.0.0.0:5001 \
--workers 2 \
--timeout 60
Or via flask run for development:
Status
Dockerfiles for the control plane are not yet implemented. Track progress or contribute at the repository.
ModuleNotFoundError: No module named 'provisioner' (control plane context)¶
See the CTFHive app section above. The fix is identical.
Provisioner¶
Provisioner dry-run vs real provisioning¶
Symptom: Running a provision does nothing — no Linode server is created, no DNS record appears.
Cause: LINODE_API_TOKEN is empty (the default). The ProvisionService
selects FakeExecutor when the token is absent, which logs steps but does
not make real API calls.
Fix: Set a real Linode API token to enable live provisioning:
LINODE_API_TOKEN=your-linode-api-token
LINODE_REGION=us-east
LINODE_PLAN=g6-standard-2
LINODE_IMAGE=linode/debian12
Always test with FakeExecutor (empty token, or pass --dry-run when the
CLI supports it) before provisioning real infrastructure.
Audit chain verification fails for provisioner log¶
Symptom: AuditLog.verify_chain() returns (False, N) for an existing
log file.
Causes and fixes:
| Cause | Fix |
|---|---|
PROVISION_AUDIT_SECRET changed |
The chain is permanently broken for entries signed with the old secret. Archive the old file and start a new chain. |
| Log file was manually edited | Any modification to a JSONL line changes the signed body, breaking the chain from that line onward. Do not edit log files. |
| Log file truncated or corrupted | Partial writes during a crash can corrupt the last line. verify_chain() returns the 0-based index of the first bad line. Lines before that index are still valid. |
To identify the first bad line:
from provisioner.audit import AuditLog
import os
log = AuditLog(
path="runs/tenant-abc/audit.jsonl",
secret=os.environ["PROVISION_AUDIT_SECRET"],
)
ok, bad_idx = log.verify_chain()
if not ok:
print(f"Chain broken at line index {bad_idx}")
General¶
App boots but all pages return 500¶
Check: Run with APP_ENV=development and DEBUG=true to see the
traceback. Common causes:
- Missing database tables — Run migrations:
flask db upgradeorflask shell -c "from ctfapp.extensions import db; db.create_all()". - Unreachable database — Check
DATABASE_URLand that PostgreSQL is running. - Bad
ENCRYPTION_KEY— If the key changed after data was written, decryption will raise exceptions. See the key-rotation section under environment variables.
Flag submissions always return "wrong"¶
Checklist:
ADMIN_KEYhas not changed since the challenge was first seeded.- The principal's
team_secrethas not been re-generated (only happens onPrincipalrow re-creation). Challenge.flag_prefixmatches what was used to derive the stored flag.- Redis is available; if not, the fallback path should still work but check for exceptions in the logs.
- The submitted string has no leading/trailing whitespace (the engine calls
.strip()but browser autofill can occasionally add non-breaking spaces).
CSRF errors on form submission¶
Cause: WTF_CSRF_TIME_LIMIT (default 3600 seconds) exceeded, or the page
was cached and the CSRF token is stale.
Fix: Hard-refresh the page. For production, ensure session cookies are
correctly scoped and SECRET_KEY has not changed mid-session.