The smoke runbook caught a real bug: ConsumeEnrollmentToken was
inserting into host_credentials (FK -> hosts) inside the same tx as
the token burn, but the host row didn't exist yet — CreateHost
runs in the *next* statement. The agent saw a generic 401 with no
clue why.
Fix: drop the host_credentials insert from ConsumeEnrollmentToken;
the HTTP handler now does Consume -> CreateHost ->
SetHostCredentials. SetHostCredentials failure is logged loudly
but doesn't fail the enrol — operator recovers via PUT
/api/hosts/{id}/repo-credentials.
Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the
underlying cause is visible in server logs (the wire response stays
generic to avoid leaking which step failed).
Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new
order (consume -> create host -> SetHostCredentials).
Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly
in use); URLs use trailing slash on the rest path; clarified that
secrets_key is minted on first agent start, not at enrol time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-minted enrollment tokens now carry the repo URL/username/
password as one AEAD blob bound (via additional-data) to the token
hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a
host_credentials row in the same tx as token-burn, so the binding
moves with the credential.
PUT /api/hosts/{id}/repo-credentials lets an operator edit creds
post-enrollment; merges with the existing blob, audits, and pushes
config.update if the agent is connected.
WS handler grows an OnHello hook that the HTTP layer wires to send
the host's decrypted creds as a config.update immediately after the
hello succeeds — synchronously, so a racing command.run lands after
the agent has its repo password.
Schema: 0002_host_credentials.sql adds enc_repo_creds to
enrollment_tokens and a host_credentials table (PK = host_id, FK
ON DELETE CASCADE).
Tests: round-trip token → consume → host_credentials with AAD swap
detection; no-creds path stays compatible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.
Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.
Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the protocol layer end-to-end: an agent can be enrolled
through the operator UI, store credentials, dial back to the server
over WS, complete the protocol_version handshake, and stay
connected with periodic heartbeats.
Server side:
- P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction,
json envelope writer with a write mutex, reader, error envelopes.
- P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage
(10s deadline, protocol_version checked against
api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on
reject), main read loop, defer hub register/unregister.
- P1-10 POST /api/agents/enroll consumes a one-time token, mints a
persistent agent bearer (sha-256 stored), creates a host row.
- P1-10 POST /api/enrollment-tokens (operator, session-auth)
issues a 1h one-time token.
- P1-11 hello upserts agent_version + restic_version +
protocol_version on the host row, flips status to online.
- P1-12 heartbeat touches last_seen_at; background sweeper marks
hosts offline after 90s without one.
- store: hosts table accessors, host_schedule_version,
enrollment_tokens FK on consumed_host dropped (audit-only field;
the token gets burned before the host row exists).
Agent side:
- P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml,
atomic Save (tmp+fsync+rename), Enrolled() helper.
- P1-15 internal/agent/wsclient: dial with bearer + optional
TLS cert pinning (sha-256 of leaf), exponential backoff with
jitter (1s → 60s cap), heartbeat goroutine, fatal handling for
ErrProtocolTooOld.
- P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo.
- P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version
collection. restic detected by `restic version` parse; absent
restic doesn't block startup.
- cmd/agent: -enroll-server / -enroll-token flags drive first-run
enrollment then exit (so the install script can hand off to
systemd to run the persistent service).
End-to-end smoke verified: bootstrap → login → issue token →
enroll → run agent → server logs `ws agent connected` with the
right host_id and protocol_version 1.
All tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>