Provisioning a tenant¶
The provisioner/ package automates the full setup of a CTFHive edge node:
system update, package installation, security hardening, service bring-up, TLS
configuration, and health verification — all over SSH, all audited.
Cloud provider: Linode only
The only implemented cloud backend is provisioner/cloud/linode.py
(LinodeProvider). CLAUDE.md refers to Hetzner Cloud — no Hetzner
implementation exists. The Linode v4 API client is what you use today.
The CloudProvider base class (provisioner/cloud/base.py) is the
extension point for adding other providers.
For control-plane context on when provisioning is triggered (Stripe webhook → tenant lifecycle), see Control plane overview.
For the audit chain used to record every provisioning command, see Audit chain reference.
Core concepts¶
TenantSpec — what you provision¶
TenantSpec (in provisioner/models.py) is the Pydantic v2 model that
describes a complete tenant. Build one from validated webhook data before
calling the orchestrator.
| Field | Type | Notes |
|---|---|---|
name |
str |
Human-readable name; becomes the DNS slug |
base_domain |
str |
e.g. ctfhive.us |
tier |
"solo" | "standard" | "pro" |
Informational for now; gates future feature flags |
servers |
list[ServerSpec] |
At least one server required |
admin_email |
str |
Caddy's ACME contact address |
authorized_key |
str |
Operator SSH public key (ed25519/rsa/ecdsa) |
ssh_port |
int |
Default 22; phase 3 moves sshd to this port |
server_secret |
str |
≥16 chars; used for per-team flag derivation HMAC |
registry_user |
str |
Default "forge" |
registry_password |
str |
≥8 chars |
wall_clock_cap_seconds |
int |
Default 300; range 60–1800; abort + rollback if exceeded |
TenantSpec.slug is derived automatically from name by slugify_name().
ServerSpec — a target node¶
Each server in TenantSpec.servers is a ServerSpec:
| Field | Type | Notes |
|---|---|---|
role |
"app" | "registry" | "vpn" | "edge" | "all-in-one" |
Controls which packages and services are installed |
ssh |
SSHCredentials |
Connection details |
label |
str |
Human label for logs and audit records |
provider_id |
str \| None |
Cloud instance ID (e.g. Linode instance ID) |
region |
str \| None |
Cloud region |
plan |
str \| None |
Cloud plan / type |
SSHCredentials:
| Field | Default | Notes |
|---|---|---|
host |
— | Hostname or IP |
port |
22 |
Initial SSH port (before phase 3 moves it) |
user |
"root" |
|
private_key_path |
None |
Path to private key file |
password |
None |
Password auth; key auth is strongly preferred |
Role package matrix¶
Phase 2 installs role-scoped packages. Here is what each role gets:
| Role | Docker | Caddy | WireGuard |
|---|---|---|---|
app |
Yes | Yes | — |
registry |
Yes | — | — |
vpn |
— | — | Yes |
edge |
Yes | Yes | Yes |
all-in-one |
Yes | Yes | Yes |
The phase pipeline¶
The orchestrator (provisioner/orchestrator.py:provision_tenant()) runs
phases over each server in role-dependency order:
vpn → registry → edge → all-in-one → app.
Within each server, phases run in this fixed order:
Phase 1 — system (phase1_system.py)¶
apt-get update + dist-upgrade + autoremove. Brings the freshly-booted
image fully up to date before anything is installed. Idempotent.
Phase 2 — packages (phase2_packages.py)¶
Installs the base package set (ca-certificates, curl, gnupg, ufw,
fail2ban, apache2-utils, jq, git) plus role-scoped extras. Docker and
Caddy are installed via guarded one-liners (command -v docker || ...) so
re-runs are no-ops.
Phase 3 — hardening (phase3_hardening.py)¶
- SSH: writes
/etc/ssh/sshd_config.d/99-forge.conf(non-standard port, password auth off, key auth only,MaxAuthTries 3), runssshd -tto validate beforesystemctl reload ssh. A validation failure aborts before the reload so you cannot be locked out. - UFW:
default deny incoming, then allows the SSH port first, then role-specific ports (80/tcp,443/tcp,51820/udpas applicable). - sysctl: writes
/etc/sysctl.d/99-forge.conf(rp_filter, syncookies, no redirects/source-routing,kptr_restrict=2,ptrace_scope=1, ASLR).
Phase 4 — services (phase4_services.py)¶
Brings up the services each role needs:
- VPN (roles:
vpn,edge,all-in-one): generates a WireGuard server keypair once (test -f wg0.key || wg genkey), writes/etc/wireguard/wg0.conf(Address 10.13.13.1/24,ListenPort 51820,SaveConfig false), and enableswg-quick@wg0. - Registry (roles:
registry,edge,all-in-one): creates/opt/forge/registry/authand/opt/forge/registry/data, generates an htpasswd file forregistry_user/registry_password, and runsdistribution/distribution:3on127.0.0.1:5000. - App (roles:
app,all-in-one): writes/opt/forge/app/.envwithSITE_URL,SECRET_KEY,REGISTRY_HOST/USER/PASS,LAB_ENABLED=true, etc., then runsdocker compose --env-file .env up -d.
TLS step — between services and healthcheck¶
After phase 4 completes on any server whose role is app, edge, or
all-in-one, apply_caddy_tls() (provisioner/tls.py) runs before the
healthcheck:
- Renders
/etc/caddy/Caddyfilerouting: {slug}.{base_domain}→127.0.0.1:8000(app)registry.{slug}.{base_domain}→127.0.0.1:5000(registry)*.{slug}.{base_domain}→127.0.0.1:8000with DNS-01 (Cloudflare plugin; requiresCF_API_TOKENin the server environment)caddy validate --config /etc/caddy/Caddyfile— aborts if the config is invalid.systemctl reload caddy.
Phase 5 — healthcheck (phase5_healthcheck.py)¶
Runs role-appropriate probes on the server itself (using curl, openssl,
wg):
| Probe | Roles | Check |
|---|---|---|
docker |
app, registry, edge, all-in-one |
docker info exits 0 |
app_https |
app, all-in-one |
curl https://{fqdn}/ returns 2xx |
app_tls |
app, all-in-one |
Certificate is valid and not expired |
registry_v2 |
registry, edge, all-in-one |
GET https://registry.{slug}.{base}/v2/ with credentials |
wireguard |
vpn, edge, all-in-one |
wg show wg0 exits 0 |
Any failed probe sets PhaseResult.status = FAILED, which triggers
ProvisionAborted and rollback.
Executors: FakeExecutor vs SSH¶
The orchestrator is transport-agnostic. You choose the executor via
executor_factory.
FakeExecutor (default for tests and dry runs)¶
In-memory, zero network. Every command returns exit code 0 by default;
you can script specific patterns to return failures. All commands and file
writes are recorded on executor.commands and executor.files for
assertions.
Use FakeExecutor in unit tests and in --dry-run mode.
SSHRemoteExecutor (live provisioning)¶
Wraps scripts/ssh_client.SSHClientWrapper (paramiko). Built by
provisioner/ssh.py:ssh_executor_factory().
Single connection per provision run
The factory opens one SSH connection per server at the start. That
connection stays alive for all five phases. Phase 3 moves sshd to a
non-standard port and reloads — the existing channel stays alive through
this, so you do not reconnect mid-run. Re-provisioning an already-hardened
host requires passing the non-standard port as initial_port:
The audit chain¶
Every command executed through any executor is appended to an AuditLog
(provisioner/audit.py). Each entry carries:
- The HMAC-SHA256 signature of the previous entry (
prev_sig) and this entry (sig). - The redacted, canonically serialised command record.
Secrets (password, server_secret, registry_password, token, secret,
private_key) are redacted at two points: by the executor before the record is
built, and by the audit log's backstop redaction before writing.
Verify the chain after a run:
from provisioner.audit import AuditLog
audit = AuditLog("data/provision-audit.jsonl", secret=spec.server_secret)
ok, bad_idx = audit.verify_chain()
# ok=True, bad_idx=None means the log is intact
See Audit chain reference for the full chain specification.
The Linode cloud provider¶
provisioner/cloud/linode.py:LinodeProvider wraps the Linode v4 REST API:
| Method | Description |
|---|---|
create_server(label, region, plan, image, authorized_keys, root_password) |
POST /linode/instances |
get_server(server_id) |
GET /linode/instances/{id} |
delete_server(server_id) |
DELETE /linode/instances/{id} |
wait_until_running(server_id, timeout, interval) |
Polls until status is running and primary_ip is set |
Set LINODE_API_TOKEN in the environment or pass token= to the constructor.
The HTTP transport is injectable for unit testing (pass a fake _Transport).
End-to-end with scripts/stage_deploy_linode.py¶
The staging runner is the reference implementation for a complete create → provision → verify → (optionally) destroy cycle.
Dry run first (always)¶
No Linode API calls, no SSH, no charges. Uses FakeExecutor and a stub
CloudServer.
uv run python scripts/stage_deploy_linode.py --dry-run \
--name "Acme Training" \
--base-domain ctfhive.us
This runs the full phase pipeline and verifies the audit chain. Fix any failures here before spending money on a real instance.
Live staging run¶
export LINODE_API_TOKEN=...
uv run python scripts/stage_deploy_linode.py \
--name "Acme Training" \
--base-domain ctfhive.us \
--authorized-key ~/.ssh/id_ed25519.pub \
--ssh-key ~/.ssh/id_ed25519 \
--region us-east \
--plan g6-standard-2 \
--image linode/debian12 \
--ssh-port 2222 \
--cap 300 \
--audit-log data/provision-audit.jsonl
If the provision succeeds, audit.verify_chain() is printed. Add --destroy
to delete the Linode instance afterward (useful for CI pipelines).
Key flags¶
| Flag | Default | Description |
|---|---|---|
--name |
(required) | Tenant name → DNS slug |
--base-domain |
(required) | e.g. ctfhive.us |
--admin-email |
ops@ctfhive.us |
ACME contact |
--tier |
standard |
solo, standard, or pro |
--ssh-port |
2222 |
Port sshd is moved to by phase 3 |
--cap |
300 |
Wall-clock cap in seconds |
--audit-log |
data/provision-audit.jsonl |
Path for the JSONL audit log |
--dry-run |
false |
Offline rehearsal |
--token |
$LINODE_API_TOKEN |
Linode API token |
--region |
us-east |
Linode region |
--plan |
g6-standard-2 |
Linode type (Standard 4 GB) |
--image |
linode/debian12 |
Base image |
--authorized-key |
(required for live) | Path to operator SSH public key |
--ssh-key |
(required for live) | Path to SSH private key |
--destroy |
false |
Delete the instance after a successful run |
What happens on failure¶
If any phase fails or the wall-clock cap is exceeded, the orchestrator calls
the rollback hook (if provided). In stage_deploy_linode.py, rollback deletes
the Linode instance and prints a warning if deletion itself fails. The
ProvisionResult.status is set to ROLLED_BACK and the error is recorded.
Legacy: scripts/provision_registry_vps.py¶
Legacy script — not integrated
scripts/provision_registry_vps.py is an older standalone script that sets
up a registry node using raw paramiko commands and certbot (not the phase
pipeline). It is not integrated with provisioner/ and does not benefit
from the audit chain or rollback logic. Do not use it for new deployments.
It is retained only for historical reference.
ProvisionResult status values¶
| Status | Meaning |
|---|---|
PENDING |
Not yet started |
RUNNING |
In progress |
ACTIVE |
All phases passed, healthchecks green |
FAILED |
A phase failed but no rollback was attempted |
ROLLED_BACK |
A phase failed, rollback hook was called |