Skip to content

Provisioning a tenant

The provisioner/ package automates the full setup of a CTFHive edge node: system update, package installation, security hardening, service bring-up, TLS configuration, and health verification — all over SSH, all audited.

Cloud provider: Linode only

The only implemented cloud backend is provisioner/cloud/linode.py (LinodeProvider). CLAUDE.md refers to Hetzner Cloud — no Hetzner implementation exists. The Linode v4 API client is what you use today. The CloudProvider base class (provisioner/cloud/base.py) is the extension point for adding other providers.

For control-plane context on when provisioning is triggered (Stripe webhook → tenant lifecycle), see Control plane overview.

For the audit chain used to record every provisioning command, see Audit chain reference.


Core concepts

TenantSpec — what you provision

TenantSpec (in provisioner/models.py) is the Pydantic v2 model that describes a complete tenant. Build one from validated webhook data before calling the orchestrator.

Field Type Notes
name str Human-readable name; becomes the DNS slug
base_domain str e.g. ctfhive.us
tier "solo" | "standard" | "pro" Informational for now; gates future feature flags
servers list[ServerSpec] At least one server required
admin_email str Caddy's ACME contact address
authorized_key str Operator SSH public key (ed25519/rsa/ecdsa)
ssh_port int Default 22; phase 3 moves sshd to this port
server_secret str ≥16 chars; used for per-team flag derivation HMAC
registry_user str Default "forge"
registry_password str ≥8 chars
wall_clock_cap_seconds int Default 300; range 60–1800; abort + rollback if exceeded

TenantSpec.slug is derived automatically from name by slugify_name().

ServerSpec — a target node

Each server in TenantSpec.servers is a ServerSpec:

Field Type Notes
role "app" | "registry" | "vpn" | "edge" | "all-in-one" Controls which packages and services are installed
ssh SSHCredentials Connection details
label str Human label for logs and audit records
provider_id str \| None Cloud instance ID (e.g. Linode instance ID)
region str \| None Cloud region
plan str \| None Cloud plan / type

SSHCredentials:

Field Default Notes
host Hostname or IP
port 22 Initial SSH port (before phase 3 moves it)
user "root"
private_key_path None Path to private key file
password None Password auth; key auth is strongly preferred

Role package matrix

Phase 2 installs role-scoped packages. Here is what each role gets:

Role Docker Caddy WireGuard
app Yes Yes
registry Yes
vpn Yes
edge Yes Yes Yes
all-in-one Yes Yes Yes

The phase pipeline

The orchestrator (provisioner/orchestrator.py:provision_tenant()) runs phases over each server in role-dependency order: vpnregistryedgeall-in-oneapp.

Within each server, phases run in this fixed order:

Phase 1 — system (phase1_system.py)

apt-get update + dist-upgrade + autoremove. Brings the freshly-booted image fully up to date before anything is installed. Idempotent.

Phase 2 — packages (phase2_packages.py)

Installs the base package set (ca-certificates, curl, gnupg, ufw, fail2ban, apache2-utils, jq, git) plus role-scoped extras. Docker and Caddy are installed via guarded one-liners (command -v docker || ...) so re-runs are no-ops.

Phase 3 — hardening (phase3_hardening.py)

  • SSH: writes /etc/ssh/sshd_config.d/99-forge.conf (non-standard port, password auth off, key auth only, MaxAuthTries 3), runs sshd -t to validate before systemctl reload ssh. A validation failure aborts before the reload so you cannot be locked out.
  • UFW: default deny incoming, then allows the SSH port first, then role-specific ports (80/tcp, 443/tcp, 51820/udp as applicable).
  • sysctl: writes /etc/sysctl.d/99-forge.conf (rp_filter, syncookies, no redirects/source-routing, kptr_restrict=2, ptrace_scope=1, ASLR).

Phase 4 — services (phase4_services.py)

Brings up the services each role needs:

  • VPN (roles: vpn, edge, all-in-one): generates a WireGuard server keypair once (test -f wg0.key || wg genkey), writes /etc/wireguard/wg0.conf (Address 10.13.13.1/24, ListenPort 51820, SaveConfig false), and enables wg-quick@wg0.
  • Registry (roles: registry, edge, all-in-one): creates /opt/forge/registry/auth and /opt/forge/registry/data, generates an htpasswd file for registry_user/registry_password, and runs distribution/distribution:3 on 127.0.0.1:5000.
  • App (roles: app, all-in-one): writes /opt/forge/app/.env with SITE_URL, SECRET_KEY, REGISTRY_HOST/USER/PASS, LAB_ENABLED=true, etc., then runs docker compose --env-file .env up -d.

TLS step — between services and healthcheck

After phase 4 completes on any server whose role is app, edge, or all-in-one, apply_caddy_tls() (provisioner/tls.py) runs before the healthcheck:

  1. Renders /etc/caddy/Caddyfile routing:
  2. {slug}.{base_domain}127.0.0.1:8000 (app)
  3. registry.{slug}.{base_domain}127.0.0.1:5000 (registry)
  4. *.{slug}.{base_domain}127.0.0.1:8000 with DNS-01 (Cloudflare plugin; requires CF_API_TOKEN in the server environment)
  5. caddy validate --config /etc/caddy/Caddyfile — aborts if the config is invalid.
  6. systemctl reload caddy.

Phase 5 — healthcheck (phase5_healthcheck.py)

Runs role-appropriate probes on the server itself (using curl, openssl, wg):

Probe Roles Check
docker app, registry, edge, all-in-one docker info exits 0
app_https app, all-in-one curl https://{fqdn}/ returns 2xx
app_tls app, all-in-one Certificate is valid and not expired
registry_v2 registry, edge, all-in-one GET https://registry.{slug}.{base}/v2/ with credentials
wireguard vpn, edge, all-in-one wg show wg0 exits 0

Any failed probe sets PhaseResult.status = FAILED, which triggers ProvisionAborted and rollback.


Executors: FakeExecutor vs SSH

The orchestrator is transport-agnostic. You choose the executor via executor_factory.

FakeExecutor (default for tests and dry runs)

In-memory, zero network. Every command returns exit code 0 by default; you can script specific patterns to return failures. All commands and file writes are recorded on executor.commands and executor.files for assertions.

Use FakeExecutor in unit tests and in --dry-run mode.

SSHRemoteExecutor (live provisioning)

Wraps scripts/ssh_client.SSHClientWrapper (paramiko). Built by provisioner/ssh.py:ssh_executor_factory().

Single connection per provision run

The factory opens one SSH connection per server at the start. That connection stays alive for all five phases. Phase 3 moves sshd to a non-standard port and reloads — the existing channel stays alive through this, so you do not reconnect mid-run. Re-provisioning an already-hardened host requires passing the non-standard port as initial_port:

factory = ssh_executor_factory(audit, initial_port=spec.ssh_port)

The audit chain

Every command executed through any executor is appended to an AuditLog (provisioner/audit.py). Each entry carries:

  • The HMAC-SHA256 signature of the previous entry (prev_sig) and this entry (sig).
  • The redacted, canonically serialised command record.

Secrets (password, server_secret, registry_password, token, secret, private_key) are redacted at two points: by the executor before the record is built, and by the audit log's backstop redaction before writing.

Verify the chain after a run:

from provisioner.audit import AuditLog
audit = AuditLog("data/provision-audit.jsonl", secret=spec.server_secret)
ok, bad_idx = audit.verify_chain()
# ok=True, bad_idx=None means the log is intact

See Audit chain reference for the full chain specification.


The Linode cloud provider

provisioner/cloud/linode.py:LinodeProvider wraps the Linode v4 REST API:

Method Description
create_server(label, region, plan, image, authorized_keys, root_password) POST /linode/instances
get_server(server_id) GET /linode/instances/{id}
delete_server(server_id) DELETE /linode/instances/{id}
wait_until_running(server_id, timeout, interval) Polls until status is running and primary_ip is set

Set LINODE_API_TOKEN in the environment or pass token= to the constructor. The HTTP transport is injectable for unit testing (pass a fake _Transport).


End-to-end with scripts/stage_deploy_linode.py

The staging runner is the reference implementation for a complete create → provision → verify → (optionally) destroy cycle.

Dry run first (always)

No Linode API calls, no SSH, no charges. Uses FakeExecutor and a stub CloudServer.

uv run python scripts/stage_deploy_linode.py --dry-run \
  --name "Acme Training" \
  --base-domain ctfhive.us

This runs the full phase pipeline and verifies the audit chain. Fix any failures here before spending money on a real instance.

Live staging run

export LINODE_API_TOKEN=...

uv run python scripts/stage_deploy_linode.py \
  --name "Acme Training" \
  --base-domain ctfhive.us \
  --authorized-key ~/.ssh/id_ed25519.pub \
  --ssh-key ~/.ssh/id_ed25519 \
  --region us-east \
  --plan g6-standard-2 \
  --image linode/debian12 \
  --ssh-port 2222 \
  --cap 300 \
  --audit-log data/provision-audit.jsonl

If the provision succeeds, audit.verify_chain() is printed. Add --destroy to delete the Linode instance afterward (useful for CI pipelines).

Key flags

Flag Default Description
--name (required) Tenant name → DNS slug
--base-domain (required) e.g. ctfhive.us
--admin-email ops@ctfhive.us ACME contact
--tier standard solo, standard, or pro
--ssh-port 2222 Port sshd is moved to by phase 3
--cap 300 Wall-clock cap in seconds
--audit-log data/provision-audit.jsonl Path for the JSONL audit log
--dry-run false Offline rehearsal
--token $LINODE_API_TOKEN Linode API token
--region us-east Linode region
--plan g6-standard-2 Linode type (Standard 4 GB)
--image linode/debian12 Base image
--authorized-key (required for live) Path to operator SSH public key
--ssh-key (required for live) Path to SSH private key
--destroy false Delete the instance after a successful run

What happens on failure

If any phase fails or the wall-clock cap is exceeded, the orchestrator calls the rollback hook (if provided). In stage_deploy_linode.py, rollback deletes the Linode instance and prints a warning if deletion itself fails. The ProvisionResult.status is set to ROLLED_BACK and the error is recorded.


Legacy: scripts/provision_registry_vps.py

Legacy script — not integrated

scripts/provision_registry_vps.py is an older standalone script that sets up a registry node using raw paramiko commands and certbot (not the phase pipeline). It is not integrated with provisioner/ and does not benefit from the audit chain or rollback logic. Do not use it for new deployments. It is retained only for historical reference.


ProvisionResult status values

Status Meaning
PENDING Not yet started
RUNNING In progress
ACTIVE All phases passed, healthchecks green
FAILED A phase failed but no rollback was attempted
ROLLED_BACK A phase failed, rollback hook was called