Provisioning a tenant¶

The provisioner/ package automates the full setup of a CTFHive edge node: system update, package installation, security hardening, service bring-up, TLS configuration, and health verification — all over SSH, all audited.

Cloud provider: Linode only

The only implemented cloud backend is provisioner/cloud/linode.py (LinodeProvider). CLAUDE.md refers to Hetzner Cloud — no Hetzner implementation exists. The Linode v4 API client is what you use today. The CloudProvider base class (provisioner/cloud/base.py) is the extension point for adding other providers.

For control-plane context on when provisioning is triggered (Stripe webhook → tenant lifecycle), see Control plane overview.

For the audit chain used to record every provisioning command, see Audit chain reference.

Core concepts¶

`TenantSpec` — what you provision¶

TenantSpec (in provisioner/models.py) is the Pydantic v2 model that describes a complete tenant. Build one from validated webhook data before calling the orchestrator.

Field	Type	Notes
`name`	`str`	Human-readable name; becomes the DNS slug
`base_domain`	`str`	e.g. `ctfhive.us`
`tier`	`"solo"` \| `"standard"` \| `"pro"`	Informational for now; gates future feature flags
`servers`	`list[ServerSpec]`	At least one server required
`admin_email`	`str`	Caddy's ACME contact address
`authorized_key`	`str`	Operator SSH public key (ed25519/rsa/ecdsa)
`ssh_port`	`int`	Default `22`; phase 3 moves sshd to this port
`server_secret`	`str`	≥16 chars; used for per-team flag derivation HMAC
`registry_user`	`str`	Default `"forge"`
`registry_password`	`str`	≥8 chars
`wall_clock_cap_seconds`	`int`	Default `300`; range 60–1800; abort + rollback if exceeded

TenantSpec.slug is derived automatically from name by slugify_name().

`ServerSpec` — a target node¶

Each server in TenantSpec.servers is a ServerSpec:

Field	Type	Notes
`role`	`"app"` \| `"registry"` \| `"vpn"` \| `"edge"` \| `"all-in-one"`	Controls which packages and services are installed
`ssh`	`SSHCredentials`	Connection details
`label`	`str`	Human label for logs and audit records
`provider_id`	`str \\| None`	Cloud instance ID (e.g. Linode instance ID)
`region`	`str \\| None`	Cloud region
`plan`	`str \\| None`	Cloud plan / type

SSHCredentials:

Field	Default	Notes
`host`	—	Hostname or IP
`port`	`22`	Initial SSH port (before phase 3 moves it)
`user`	`"root"`
`private_key_path`	`None`	Path to private key file
`password`	`None`	Password auth; key auth is strongly preferred

Role package matrix¶

Phase 2 installs role-scoped packages. Here is what each role gets:

Role	Docker	Caddy	WireGuard
`app`	Yes	Yes	—
`registry`	Yes	—	—
`vpn`	—	—	Yes
`edge`	Yes	Yes	Yes
`all-in-one`	Yes	Yes	Yes

The phase pipeline¶

The orchestrator (provisioner/orchestrator.py:provision_tenant()) runs phases over each server in role-dependency order: vpn → registry → edge → all-in-one → app.

Within each server, phases run in this fixed order:

Phase 1 — `system` (`phase1_system.py`)¶

apt-get update + dist-upgrade + autoremove. Brings the freshly-booted image fully up to date before anything is installed. Idempotent.

Phase 2 — `packages` (`phase2_packages.py`)¶

Installs the base package set (ca-certificates, curl, gnupg, ufw, fail2ban, apache2-utils, jq, git) plus role-scoped extras. Docker and Caddy are installed via guarded one-liners (command -v docker || ...) so re-runs are no-ops.

Phase 3 — `hardening` (`phase3_hardening.py`)¶

SSH: writes /etc/ssh/sshd_config.d/99-forge.conf (non-standard port, password auth off, key auth only, MaxAuthTries 3), runs sshd -t to validate before systemctl reload ssh. A validation failure aborts before the reload so you cannot be locked out.
UFW: default deny incoming, then allows the SSH port first, then role-specific ports (80/tcp, 443/tcp, 51820/udp as applicable).
sysctl: writes /etc/sysctl.d/99-forge.conf (rp_filter, syncookies, no redirects/source-routing, kptr_restrict=2, ptrace_scope=1, ASLR).

Phase 4 — `services` (`phase4_services.py`)¶

Brings up the services each role needs:

VPN (roles: vpn, edge, all-in-one): generates a WireGuard server keypair once (test -f wg0.key || wg genkey), writes /etc/wireguard/wg0.conf (Address 10.13.13.1/24, ListenPort 51820, SaveConfig false), and enables wg-quick@wg0.
Registry (roles: registry, edge, all-in-one): creates /opt/forge/registry/auth and /opt/forge/registry/data, generates an htpasswd file for registry_user/registry_password, and runs distribution/distribution:3 on 127.0.0.1:5000.
App (roles: app, all-in-one): writes /opt/forge/app/.env with SITE_URL, SECRET_KEY, REGISTRY_HOST/USER/PASS, LAB_ENABLED=true, etc., then runs docker compose --env-file .env up -d.

TLS step — between services and healthcheck¶

After phase 4 completes on any server whose role is app, edge, or all-in-one, apply_caddy_tls() (provisioner/tls.py) runs before the healthcheck:

Renders /etc/caddy/Caddyfile routing:
{slug}.{base_domain} → 127.0.0.1:8000 (app)
registry.{slug}.{base_domain} → 127.0.0.1:5000 (registry)
*.{slug}.{base_domain} → 127.0.0.1:8000 with DNS-01 (Cloudflare plugin; requires CF_API_TOKEN in the server environment)
caddy validate --config /etc/caddy/Caddyfile — aborts if the config is invalid.
systemctl reload caddy.

Phase 5 — `healthcheck` (`phase5_healthcheck.py`)¶

Runs role-appropriate probes on the server itself (using curl, openssl, wg):

Probe	Roles	Check
`docker`	`app`, `registry`, `edge`, `all-in-one`	`docker info` exits 0
`app_https`	`app`, `all-in-one`	`curl https://{fqdn}/` returns 2xx
`app_tls`	`app`, `all-in-one`	Certificate is valid and not expired
`registry_v2`	`registry`, `edge`, `all-in-one`	`GET https://registry.{slug}.{base}/v2/` with credentials
`wireguard`	`vpn`, `edge`, `all-in-one`	`wg show wg0` exits 0

Any failed probe sets PhaseResult.status = FAILED, which triggers ProvisionAborted and rollback.

Executors: FakeExecutor vs SSH¶

The orchestrator is transport-agnostic. You choose the executor via executor_factory.

`FakeExecutor` (default for tests and dry runs)¶

In-memory, zero network. Every command returns exit code 0 by default; you can script specific patterns to return failures. All commands and file writes are recorded on executor.commands and executor.files for assertions.

Use FakeExecutor in unit tests and in --dry-run mode.

`SSHRemoteExecutor` (live provisioning)¶

Wraps scripts/ssh_client.SSHClientWrapper (paramiko). Built by provisioner/ssh.py:ssh_executor_factory().

Single connection per provision run

The factory opens one SSH connection per server at the start. That connection stays alive for all five phases. Phase 3 moves sshd to a non-standard port and reloads — the existing channel stays alive through this, so you do not reconnect mid-run. Re-provisioning an already-hardened host requires passing the non-standard port as initial_port:

factory = ssh_executor_factory(audit, initial_port=spec.ssh_port)

The audit chain¶

Every command executed through any executor is appended to an AuditLog (provisioner/audit.py). Each entry carries:

The HMAC-SHA256 signature of the previous entry (prev_sig) and this entry (sig).
The redacted, canonically serialised command record.

Secrets (password, server_secret, registry_password, token, secret, private_key) are redacted at two points: by the executor before the record is built, and by the audit log's backstop redaction before writing.

Verify the chain after a run:

from provisioner.audit import AuditLog
audit = AuditLog("data/provision-audit.jsonl", secret=spec.server_secret)
ok, bad_idx = audit.verify_chain()
# ok=True, bad_idx=None means the log is intact

See Audit chain reference for the full chain specification.

The Linode cloud provider¶

provisioner/cloud/linode.py:LinodeProvider wraps the Linode v4 REST API:

Method	Description
`create_server(label, region, plan, image, authorized_keys, root_password)`	POST `/linode/instances`
`get_server(server_id)`	GET `/linode/instances/{id}`
`delete_server(server_id)`	DELETE `/linode/instances/{id}`
`wait_until_running(server_id, timeout, interval)`	Polls until status is `running` and `primary_ip` is set

Set LINODE_API_TOKEN in the environment or pass token= to the constructor. The HTTP transport is injectable for unit testing (pass a fake _Transport).

End-to-end with `scripts/stage_deploy_linode.py`¶

The staging runner is the reference implementation for a complete create → provision → verify → (optionally) destroy cycle.

Dry run first (always)¶

No Linode API calls, no SSH, no charges. Uses FakeExecutor and a stub CloudServer.

uv run python scripts/stage_deploy_linode.py --dry-run \
  --name "Acme Training" \
  --base-domain ctfhive.us

This runs the full phase pipeline and verifies the audit chain. Fix any failures here before spending money on a real instance.

Live staging run¶

export LINODE_API_TOKEN=...

uv run python scripts/stage_deploy_linode.py \
  --name "Acme Training" \
  --base-domain ctfhive.us \
  --authorized-key ~/.ssh/id_ed25519.pub \
  --ssh-key ~/.ssh/id_ed25519 \
  --region us-east \
  --plan g6-standard-2 \
  --image linode/debian12 \
  --ssh-port 2222 \
  --cap 300 \
  --audit-log data/provision-audit.jsonl

If the provision succeeds, audit.verify_chain() is printed. Add --destroy to delete the Linode instance afterward (useful for CI pipelines).

Key flags¶

Flag	Default	Description
`--name`	(required)	Tenant name → DNS slug
`--base-domain`	(required)	e.g. `ctfhive.us`
`--admin-email`	`ops@ctfhive.us`	ACME contact
`--tier`	`standard`	`solo`, `standard`, or `pro`
`--ssh-port`	`2222`	Port sshd is moved to by phase 3
`--cap`	`300`	Wall-clock cap in seconds
`--audit-log`	`data/provision-audit.jsonl`	Path for the JSONL audit log
`--dry-run`	`false`	Offline rehearsal
`--token`	`$LINODE_API_TOKEN`	Linode API token
`--region`	`us-east`	Linode region
`--plan`	`g6-standard-2`	Linode type (Standard 4 GB)
`--image`	`linode/debian12`	Base image
`--authorized-key`	(required for live)	Path to operator SSH public key
`--ssh-key`	(required for live)	Path to SSH private key
`--destroy`	`false`	Delete the instance after a successful run

What happens on failure¶

If any phase fails or the wall-clock cap is exceeded, the orchestrator calls the rollback hook (if provided). In stage_deploy_linode.py, rollback deletes the Linode instance and prints a warning if deletion itself fails. The ProvisionResult.status is set to ROLLED_BACK and the error is recorded.

Legacy: `scripts/provision_registry_vps.py`¶

Legacy script — not integrated

scripts/provision_registry_vps.py is an older standalone script that sets up a registry node using raw paramiko commands and certbot (not the phase pipeline). It is not integrated with provisioner/ and does not benefit from the audit chain or rollback logic. Do not use it for new deployments. It is retained only for historical reference.

ProvisionResult status values¶

Status	Meaning
`PENDING`	Not yet started
`RUNNING`	In progress
`ACTIVE`	All phases passed, healthchecks green
`FAILED`	A phase failed but no rollback was attempted
`ROLLED_BACK`	A phase failed, rollback hook was called

Provisioning a tenant¶

Core concepts¶

TenantSpec — what you provision¶

ServerSpec — a target node¶