Commit Graph

189 Commits

Author SHA1 Message Date
steve 413d0bdb1b restic: don't fall back to parent's HOME when picking the cache dir
Agent runs as root (HOME=/root from systemd) with ProtectHome=
read-only, so restic's `mkdir /root/.cache/restic` fails on the
first call. Backups still completed (restic falls back to no-cache)
but every job log started with a noisy red "unable to open cache"
warning.

Default to /var/lib/restic-manager unconditionally — that's already
in the unit's ReadWritePaths and survives ProtectHome. ExtraEnv
overrides still win for tests / unusual setups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:43:10 +01:00
steve 047c1d1912 agent unit: drop SystemCallFilter — was killing restic with SIGSYS
Allow-list filter @system-service excludes some syscalls Go's
runtime + restic's file scanner reach for; init job died
immediately with "bad system call (core dumped)". CapabilityBounding
already constrains what root can do; the Protect*/Restrict* toggles
still cover network / kernel / mount / namespace. Net effect on the
threat model is negligible vs the operational cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:40:43 +01:00
steve c46024c03a Add CLAUDE.md with project-specific rules
Three rules to date:

* After every make build, restage the agent binary + install
  assets into /tmp/rm-smoke/data/ and replace the running agent
  on this dev box. Plain `make build` doesn't reach either, and
  forgetting has bitten the smoke env twice today (stale agent
  without mergeRestCreds; stale unit without User=root).

* Migrations: prefer ALTER TABLE DROP/RENAME COLUMN (SQLite
  3.35+) over the rebuild dance. With foreign_keys=ON in the DSN,
  DROP TABLE on a parent with ON DELETE CASCADE children wipes
  every dependent table — and PRAGMA foreign_keys=OFF inside a
  migration is a no-op (PRAGMA can only change outside a tx).

* Don't slog restic's merged URL. The user:pass@-embedded form
  exists only inside envSlice() at exec time; if any URL needs
  to be operator-visible, route it through restic.RedactURL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:33:20 +01:00
steve f692ad592c restic: treat 'config file already exists' on init as soft success
Re-running restic init on a repo that's already initialised exits
non-zero with "Fatal: ... config file already exists". Semantically
that's a no-op, not a failure — the repo IS initialised, the
caller's intent is satisfied. Sniff stderr for the magic string
and swallow the exit code in that case, emitting an event line
so the operator-facing log says what happened.

Caught while smoke-testing P2-04.5: I'd init'd the repo manually
during a debug session, then the operator clicking the UI's
Init-repo button would hit this and the host's repo_initialised_at
would never flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:22:01 +01:00
steve c5777122db Add-host: default repo username to hostname; always show htpasswd snippet
The pending page suppressed the htpasswd snippet when repo_username
was blank — but with --private-repos the username is required for
auth, and operators routinely leave the field blank assuming the
system will pick something sensible.

* handleUIAddHostPost defaults repo_username to the typed hostname
  when blank. Matches what --private-repos expects (URL path
  segment == username).
* pending_host.html: snippet now renders whenever a password is
  present (always true after the generate-on-blank logic landed
  earlier).
* Form help-text updated to describe the default explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:08:23 +01:00
steve c1f85da55f Add-host: durable pending page + polled awaiting-agent panel
Two issues from a smoke session:
1. The awaiting-agent panel never refreshed — operator had to go
   back to the dashboard to see the host had connected.
2. Generated passwords were displayed only on the POST response.
   Navigating away (or even an accidental tab close) lost them
   permanently, so the operator couldn't update the rest-server's
   htpasswd.

Both are the same fix: convert the POST-rendered transient
"result state" into a durable GET page at /hosts/pending/{token}.

* New route GET /hosts/pending/{token} renders the install-command +
  htpasswd snippet view. Password is decrypted from the (still-
  encrypted-at-rest) token row on every render — operator can
  refresh, bookmark, navigate away and come back. Once the agent
  enrols, the page redirects to /hosts/{id}; once the token
  expires, redirect to /hosts/new.
* New route GET /hosts/pending/{token}/awaiting returns a polled
  HTML fragment that the pending page swaps in every 2s via HTMX.
  States: awaiting (keep polling) | connected (show "Open host →"
  + "View schedules" CTAs, polling stops) | expired (mint-new
  link, polling stops). Polling stops naturally because only the
  awaiting state's wrapper carries the hx-trigger attribute.
* POST /hosts/new now 303-redirects to /hosts/pending/{token}
  on success; validation errors keep re-rendering the form with
  banner.

Supporting changes:
* New store helper Store.GetEnrollmentTokenStatus(tokenHash) for
  the polling endpoint — returns {expires_at, consumed_at,
  consumed_host} in one round-trip without dragging in the
  attachments-decryption path.
* New ui.Renderer.RenderPartial(w, name, data) for HTMX fragment
  responses (no layout wrap). Picks an arbitrary page's template
  set as the lookup point — every page parses the full common-
  paths list, so they all see every partial.
* add_host.html stripped to form-only; pending_host.html owns the
  result-state UI; awaiting_agent.html is the polled partial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 12:59:24 +01:00
steve 8fb1c100fd P2-04.5: kill host.default_paths in favour of manual schedules
Two independent path lists for "what does this host back up?" was
a real divergence footgun — operator types one set at Add-host time
and a different set into a schedule, both end up in the same repo,
the snapshot history looks fine until restore. Resolution: drop
host.default_paths entirely; add a `manual` flag on schedules.
A manual schedule has paths/excludes/tags/retention like any other
but no cron — it fires only via per-schedule Run-now. Single source
of truth for what gets backed up.

Schema (migration 0007):
* schedules.manual INTEGER NOT NULL DEFAULT 0.
* For every host with non-empty default_paths, seed a manual
  schedule with those paths and bump host_schedule_version.
* ALTER TABLE hosts DROP COLUMN default_paths.
* ALTER TABLE enrollment_tokens RENAME COLUMN default_paths
  TO initial_paths.

Original draft of this migration rebuilt hosts via the
create-new + drop-old + rename-new pattern. With foreign_keys=ON
(set in the connection DSN), DROP TABLE on the parent fired
ON DELETE CASCADE on every child of hosts(id) — schedules /
jobs / snapshots / host_credentials all wiped on the smoke env
when I tried it. SQLite 3.35+ supports column-level ALTERs
directly, so we skip the rebuild dance and avoid the cascade
trap. Six lines of SQL instead of sixty, no FK risk.

Run-now rewiring:
* New `dispatchScheduleNow(hostID, scheduleID, conn?)` helper
  unifies the agent-driven path (cron fire → schedule.fire →
  OnScheduleFire callback) and the UI-driven path (operator
  clicks Run-now on a schedule row). Conn arg is optional; nil
  falls back to Hub.Send.
* New POST /hosts/{id}/schedules/{sid}/run endpoint — per-row
  Run-now button on the schedules list.
* Dashboard's per-host Run-now (handleUIRunBackup) now picks the
  host's only enabled manual schedule, falls back to the only
  enabled schedule, else returns "pick one in Schedules tab".
  Keeps one-click for the common case.

Agent:
* Scheduler skips manual schedules in cron build (silent — they're
  a normal data shape, not an error).
* Wire Schedule struct gains Manual flag.
* Schedule.fire flow unchanged — the agent only ever fires
  non-manual schedules anyway.

UI:
* Add-host form retitled "Initial schedule · manual" so the
  operator knows the paths become an editable schedule under
  the Schedules tab. Result page calls out the manual schedule
  + points at Host > Schedules.
* Schedule edit form: "Manual schedule" checkbox at the top of
  the When section; toggling it hides/shows the cron field via
  inline JS. Server-side validator skips the cron requirement
  when manual=true.
* Schedule list shows a "manual" tag under the status pill and
  renders the When column as "— run-now only —" for manual rows.
  Each row gets a Run-now button when the schedule is enabled
  and the host is online.

Tests + go test ./... green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 12:26:06 +01:00
steve c6237d4004 P2-04: schedule editor UI
Closes the schedule foundations slice — operator can now drive the
plumbing P2-01..03 landed without touching the JSON API.

* New routes:
  - GET  /hosts/{id}/schedules          (list)
  - GET  /hosts/{id}/schedules/new      (create form)
  - POST /hosts/{id}/schedules/new      (create)
  - GET  /hosts/{id}/schedules/{sid}/edit (edit form)
  - POST /hosts/{id}/schedules/{sid}/edit (update)
  - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect)

* List view (web/templates/pages/schedules_list.html):
  status, cron, paths, retention summary, tags, edit/delete buttons.
  Header shows "version N · agent in sync" or "agent at vM" when the
  push hasn't been ack'd yet — backed by host_schedule_version +
  applied_schedule_version. Empty-state CTA points at /schedules/new.

* Create/edit form (web/templates/pages/schedule_edit.html, shared):
  cron expression with five quick-pick presets (daily 3am / every 6h
  / @hourly / weekly Sun / monthly 1st), paths textarea (one per
  line), excludes textarea, tags (comma-separated), retention as six
  numeric fields (mirrors restic's --keep-* flags one-for-one),
  bandwidth caps, enabled toggle. Side panel explains the
  reconciliation flow so the operator knows what saving actually
  does. Validation errors re-render with operator's input intact.

* internal/server/http/ui_schedules.go owns the handlers; reuses
  the same validateSchedule + pushScheduleSetAsync used by the JSON
  API path. Each save audit-logs schedule.created / schedule.updated
  / schedule.deleted (matching the JSON API actions).

* store.RetentionPolicy gains a Summary() method ("last=7, d=14,
  w=4" or "—"). Used by the list view's table cell so templates
  don't have to do any conditional retention rendering.

* Two new template helpers: list (string varargs → []string, used
  for the cron preset row) and joinComma (sibling to joinDot for
  the rare list that wants commas). RetentionPolicy.Summary covers
  the schedule-list case but the helpers are general.

* host_detail.html secondary tabs row converted from inert <div>s
  into <a> links. Snapshots active by default; Schedules now points
  at the new page. Jobs/Repo/Settings remain inert until their
  P2 owners ship.

Hooks UI deferred to P2-15 (lands with the hook execution path).
Single-kind UI (backup only) by design — other kinds get a UI when
their job dispatch lands in P2-05..08.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:44:40 +01:00
steve 608962441b P2-02 (agent side) + P2-03: agent scheduler + schedule.fire dispatch
Closes the schedule reconciliation loop end-to-end.

* New `internal/agent/scheduler` package wraps robfig/cron/v3 with
  the lifecycle the agent needs:
  - Apply(ScheduleSetPayload, Sender) stops the prior cron (waiting
    for in-flight entries to return), rebuilds from scratch, starts,
    and emits schedule.ack with the version we just applied.
  - Disabled entries skipped silently; bad cron exprs (which
    shouldn't reach us — the server validates — but defensive)
    log a warn and skip.
  - On each cron tick the entry sends a new schedule.fire envelope
    to the server with {schedule_id, scheduled_at}. The scheduler
    itself never builds CommandRunPayloads — server is the source
    of truth for jobs.
  - tx is swapped on every Apply, so reconnect is handled
    naturally: cron entries that fire against a dropped tx log
    "no active connection" and skip the tick.
  - Stop() is idempotent and waits for the cron's in-flight
    workers via cron.Stop().Done().

* New wire message api.MsgScheduleFire + api.ScheduleFirePayload
  for the agent → server "I just fired locally" RPC.

* Server-side dispatch (schedule_push.go: dispatchScheduledJob):
  looks up the schedule by id, validates ownership + that it's
  enabled, builds args from kind (paths for backup; other kinds
  are still arg-less in Phase 2 and grow as those job kinds land
  in P2-05..08), persists a jobs row with actor_kind=schedule +
  scheduled_id, and writes command.run back on the same conn so
  the agent runs through its existing dispatch path.

* store.CreateJob now writes scheduled_id. This column was in the
  schema since 0001 but never populated — the original P1 path
  only had operator-driven jobs, so actor_kind was always 'user'
  and scheduled_id was always nil.

* cmd/agent/main.go integration: dispatcher gains a
  *scheduler.Scheduler; the MsgScheduleSet case now hands the
  payload to scheduler.Apply (in a goroutine so the WS read loop
  keeps draining other messages).

* WS dispatcher gains OnScheduleFire alongside OnScheduleAck.

* Tests:
  - scheduler unit tests (4): ack-on-apply, cron tick fires
    schedule.fire envelope, disabled entries don't fire, replace-
    prior-state stops the old cron.
  - Server-side end-to-end: schedule.fire → command.run with the
    right job_id / kind / args, plus jobs row with actor_kind=
    "schedule" and scheduled_id linking back to the schedule.

Persistence of next-fire times across agent restarts is
deliberately deferred. A missed fire window during downtime
simply fires once on reconnect — that's the desirable behaviour
(the operator wants the missed backup to run, not be silently
skipped because we lost track of when it was due).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:29:12 +01:00
steve a086b0eb75 P2-02 (server side): schedule reconciliation push + ack handling
Server is now the source of truth for the agent's cron set.

* Helpers in schedule_push.go:
  - loadScheduleSetPayload reads the host's schedules + canonical
    version into the wire shape.
  - pushScheduleSetOnConn writes directly to a just-handshaken conn
    (avoids racing against Hub.Register on a brand-new connection).
  - pushScheduleSetAsync is the post-CRUD flavour — no-op when the
    host is offline (the next reconnect's on-hello path catches it
    up, so a missed push is non-fatal).
  - applyScheduleAck records what version the agent has confirmed.

* onAgentHello restructured: was returning early when the host had
  no repo credentials, which made the schedule push unreachable for
  fresh hosts. Split into pushRepoCredsOnHello (silent no-op on
  ErrNotFound) + pushScheduleSetOnConn (always runs). Empty schedule
  list is a valid push: tells the agent to drop stale cron entries.

* WS dispatcher gains an OnScheduleAck hook on HandlerDeps; the
  http server wires it to applyScheduleAck. MsgScheduleAck moves
  out of the "TODO(P2)" group into a real case that decodes the
  payload and forwards to the callback.

* Schedule CRUD handlers each fire pushScheduleSetAsync after the
  audit-log write so the agent picks up changes within seconds.

Tests cover:
  - On-hello push of an already-created schedule, agent acks,
    applied_schedule_version flips on the host row.
  - Connect-then-CRUD: empty initial push (version 0), then a
    follow-on push at version 1 after the operator creates a
    schedule via REST.

Agent-side `schedule.set` handler (parse, replace local cron,
emit `schedule.ack`) is the remainder of P2-02 and lands with
P2-03's local scheduler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:22:06 +01:00
steve aa9fc330fc P2-01: schedule schema + CRUD API
The `schedules` table was already laid down in migration 0001; this
slice adds the Go-side data model, store CRUD with atomic version
bumps, and REST endpoints.

* `store.Schedule` + `RetentionPolicy` + `ScheduleOptions` typed
  views (the wire form on the agent side keeps retention/options
  as raw JSON since the agent just forwards them to restic).
* Store CRUD: CreateSchedule / GetSchedule / ListSchedulesByHost /
  UpdateSchedule / DeleteSchedule. Each mutation bumps
  `host_schedule_version` atomically in the same tx via UPSERT on
  `host_schedule_version`. SetHostAppliedScheduleVersion records
  what the agent has confirmed via schedule.ack (P2-02 will use it).
* REST endpoints under /api/hosts/{id}/schedules + /{sid}:
  GET (list, with the version envelope so callers can detect
  drift), POST (create), PUT (update — kind is immutable), DELETE.
* Validation: cron expressions parse via robfig/cron/v3 (same
  parser the agent will use, so anything that validates here will
  fire there); kind ∈ {backup, forget, prune, check} (init/unlock
  are operator-only one-shot kinds, not schedulable); backup
  schedules require ≥1 path; hooks rejected on non-backup kinds
  (spec §14.3).
* All mutations audit-logged.
* Tests: store-level CRUD + version-bump invariants; REST happy
  path (create→list→update→delete with version progression); REST
  validation table covers each rejection code.

newTestServerWithHub now sets BootstrapToken so the schedules
handler tests can use the existing login flow without a parallel
test-server constructor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:12:58 +01:00
steve c8ead66f08 P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:02:12 +01:00
steve 8aa635f0c1 P1 polish: Host.default_paths interim + restic env hygiene + job_id JS quoting
Two fixes that close the loop on dashboard run-now and harden the
agent's restic invocation.

Default paths (interim until P2-01 schedules):
  - 0003 migration adds default_paths TEXT NOT NULL DEFAULT '[]'
    to hosts and to enrollment_tokens.
  - Operator types paths in the Add-host form (textarea, one per
    line). They ride on the enrol_token row alongside the
    encrypted creds (paths aren't secret — plain JSON column).
  - On consume, ConsumeEnrollmentToken still just burns the token;
    the new GetEnrollmentTokenAttachments returns both the
    re-bindable creds and the path list in one round trip, the
    handler transfers them onto the new host row inside CreateHost.
  - The dashboard's Run-now and host-detail's "Run backup now"
    button now read Host.DefaultPaths and pass them to dispatchJob.
    A host with no default paths returns 400 with a friendly
    "no paths set" message instead of dispatching a doomed
    `restic backup` with no positional args.
  - Doc comments explicitly call this out as a Phase 1 interim —
    schedules supersede.

Restic env hygiene:
  - envSlice() previously omitted HOME / XDG_CACHE_HOME, which
    bit the smoke runs whenever the agent was launched outside
    systemd (restic refused to start: "neither $XDG_CACHE_HOME
    nor $HOME are defined"). Now both are set explicitly: prefer
    Env.ExtraEnv overrides, fall back to the agent process's own
    HOME, and finally to /var/lib/restic-manager.
  - Comment makes the env policy explicit: parent's RESTIC_* /
    AWS_* / B2_* env is filtered out by design — control-plane
    is the unambiguous source of truth.

JS bug fix in the live log page:
  - {{$job.ID | printf "%q"}} produced a literal-quoted JS string,
    which then went into the WS URL as ".../jobs/"<ID>"/stream"
    → 404. Switched to '{{$job.ID}}' inside the literal so
    html/template's auto-escape does the right thing. Verified
    end-to-end: dashboard "Run now" → live progress + log lines
    arrive over the WS → succeeded pill renders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:35:33 +01:00
steve e6729a5a3d P1-26: live job log viewer + WS browser fan-out hub
Closes the P1-21 remainder.

internal/server/ws/jobhub.go — new JobHub. Per-job_id set of
subscribers; each gets a 64-deep buffered channel with a writer
goroutine. Broadcast is non-blocking: if a subscriber is slow,
its channel fills and messages are dropped for that subscriber
only — the agent's read loop is never blocked by a stuck browser.

The agent dispatchAgentMessage path mirrors job.started /
job.progress / log.stream / job.finished envelopes onto the hub
in addition to its existing persistence work. The wire shape is
the same end-to-end, so client-side JS switches on env.type the
same way Go code does.

GET /api/jobs/{id}/stream is the browser endpoint. Auth via
session cookie (HTTP layer); upgrade; subscribe; pump until
context closes.

GET /jobs/{id} renders the live log page. Three states (queued/
running/succeeded/failed) drive the header pill, the progress
bar block, the failure summary panel, and the action button
(Cancel job while running, Back to host afterwards). Already-
persisted log lines are server-rendered on initial load; new
lines arrive over the WS and append to #log-stream. Auto-scrolls
unless the user scrolls up (a "⇢ Follow" pill re-attaches).
On job.finished the page reloads after 600ms to pick up the
final-state header rendered server-side.

POST /hosts/{id}/run-backup now sets HX-Redirect → /jobs/{job_id}
on success so HTMX lands the operator straight on the live log.
For non-HTMX callers (curl / plain form post) it 303s to the
same target.

store.ListJobLogs returns persisted log lines for initial render
on page load.

Browser-verified end-to-end: enrol → run a real backup against a
sibling restic/rest-server → live progress + 11 log lines stream
in → succeeded pill + final stats land after page reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:45:56 +01:00
steve cc9dcff816 P1-25: host detail page (snapshots tab default)
GET /hosts/{id} renders the v1 host detail layout:

  - persistent header: status dot (pulse if a job is in flight),
    monospace name, tags, plus a metadata strip (os/arch, agent
    version, restic version, "last seen Xs ago" or "online · last
    heartbeat …").
  - vitals strip: four tiles for last backup (status + relative
    time), repo size, snapshot count, open alerts.
  - sub-tabs: Snapshots is active; Jobs / Repo / Settings are
    visible but inert until P2.
  - snapshot table: short id, time (absolute), paths joined with
    " · ", size, file count, restore button (disabled — wires up
    in P3).
  - right rail: run-now stack (backup live, forget/prune/check/
    unlock disabled with the Phase tag), danger-zone remove panel
    (also disabled for now).

Empty state: when a host has no snapshots yet, the table replaces
itself with a "no snapshots yet" prompt that includes the run-now
button (provided the agent is online).

Pagination cap of 50 most-recent snapshots; full pagination lands
when fleet sizes demand it.

Template helpers grew: comma() now accepts int / int32 / int64 so
templates don't fight Go's type inference; joinDot() concatenates
a []string with " · "; absTime() formats time.Time as
YYYY-MM-DD HH:MM:SS; the existing relTime() already accepts T or
*T after P1-27.

Browser-verified end-to-end with seeded fixture data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:20:21 +01:00
steve 9795492f2e P1-27: Add host flow — form + minted-token result page
GET /hosts/new renders the focused two-column form (hostname,
tags, repo URL/username/password). POST /hosts/new validates,
mints a one-time token via the new mintEnrollmentToken helper —
shared with the existing JSON /api/enrollment-tokens endpoint —
and re-renders the same page in result state showing:

  - the install command with RM_SERVER + RM_TOKEN filled in (and
    an inline copy-to-clipboard button),
  - an "awaiting agent connection" panel with the hostname
    pre-filled,
  - a troubleshooting list pointing at the most common reasons
    the agent doesn't appear,
  - back-to-dashboard / add-another-host links.

publicURL() resolves RM_BASE_URL first, falling back to scheme +
Host on the inbound request — useful for local smoke without a
proxy.

Browser-verified end-to-end: form submit → token minted → install
command renders with the right values from the form input.

template fn formatRelTime now accepts time.Time *or* *time.Time
so templates can pass either without fighting Go's lack of an
address-of operator.

Deferred: download-preconfigured-installer (a templated .sh with
the values baked in) — copy-paste covers v1; nice-to-have later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:16:54 +01:00
steve 86f7c17d9d P1-24: live dashboard — fleet summary tiles + host table
Server-rendered HTML view backed by:
  - new store.FleetSummary aggregating host counts + repo bytes +
    snapshot total + open alerts + last-24h job rollup in two queries.
  - GET /api/hosts (JSON list of hosts in the dashboard projection).
  - GET /api/fleet/summary (JSON aggregate, same shape as above).

The HTML page (web/templates/pages/dashboard.html) renders the four
summary tiles + host table directly from store data — no separate
fetch. Per-row state colour comes from .host-row.{degraded,failed,
offline} which paint a 3px left edge so problem hosts are scannable
without reading. HTMX is loaded into the base layout so per-row
"Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin
HTML wrapper that funnels into a new dispatchJob helper shared
with the JSON /api/hosts/{id}/jobs endpoint.

Empty state (zero hosts) collapses to the "no hosts yet" prompt
with the + Add host CTA — matches the v1 mockup.

Template helpers (internal/server/ui/funcs.go) added for byte
formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and
comma grouping (1,847). Pure Go, no template-magic dependency.

Browser-verified end-to-end with seeded fixture data: five hosts
across all four states render with correct dots, accents, last-
backup pills, sizes, snapshot counts, alerts, tags, and the right
action button (Run now / Retry / Run first / View → / offline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:29:11 +01:00
steve 55242caf58 P1-23 / P1-28: base layout, login, session-aware nav + Tailwind build
P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind`
downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored),
builds web/styles/input.css → web/static/css/styles.css. `make build`
now runs the CSS pass first; `make tailwind-watch` for dev. Output is
embedded in the binary via web.FS — single static binary, no Node.

The CSS source carries every component class the v1 mockups defined
(status dots, buttons, host row, log viewer, progress bar, fields,
chips, snippet panel, empty state) so screens that land later can
just reach for them.

P1-23: html/template tree at web/templates with two layouts (base
with chrome, chromeless for login + bootstrap), one nav partial, and
two pages (dashboard placeholder, login). internal/server/ui parses
the tree at startup; ui_handlers.go in the http package wires:

  GET  /         dashboard (303 → /login when unauthed)
  GET  /login    sign-in form
  POST /login    consume form, mint session cookie, 303 → /
  POST /logout   drop cookie, 303 → /login
  GET  /static/* embedded Tailwind bundle

The HTML login flow shares store/session logic with /api/auth/login
via a new authenticateAndSession helper — same security guarantees,
two surface representations (HTML form / JSON).

Verified end-to-end: bootstrap → form-login → authed dashboard →
sign-out → 303 cycle works in the browser; Tailwind output emits
only the component classes referenced in the live templates (9.6kB
minified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:19:06 +01:00
steve 8b7b1479a1 design: extend v1 to login / add-host / host-detail / job-log + lock components
Five hi-fi screens completing the Phase 1 surface, all in v1's dark
operator-console register.

  v1-login          Sparse centred card. Sign-in + first-error variant.
                    No marketing chrome; build version sits in footer
                    so a returning operator can spot agent drift.

  v1-add-host       Focused two-column page (form left, contextual
                    "what happens next" right) — not a modal. Two
                    states: form (state A) and minted-token result
                    with install command (state B). Backed by
                    POST /api/enrollment-tokens (P1-32).

  v1-host-detail    Persistent header (status dot, mono name, tags,
                    primary CTAs, vitals strip) over four sub-tabs
                    (Snapshots / Jobs / Repo / Settings). Snapshots
                    is the default — the thing 90% of operators
                    want when they click a host name. Right rail
                    holds Recent activity, run-now stack, and a
                    danger-zone panel.

  v1-job-log        WS-streamed log view. Three states: running (live
                    progress bar + auto-scroll cursor), succeeded
                    (summary stats + final lines), failed (error
                    panel + tail). Backed by WS /api/jobs/{id}/stream
                    (P1-21 remainder).

  v1-components     The load-bearing reference. 14 sections covering
                    tokens (colour + type scale), status, buttons,
                    form fields, tags, tabs, host row, log viewer,
                    progress bar, stat tile, modal, toast, install
                    snippet, empty-state pattern. Every CSS class is
                    real and copy-able into the Go template build.

This locks the visual register before P1-23 onwards. Each Phase 1
template gets a {{define}} matching a section in v1-components.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:05:39 +01:00
steve cca525a04d design: v1 polish — row accents, wider last-backup col, empty state
- Single .host-row CSS rule replaces 13 inline grid-template-columns
  copies; column widths bumped so "backup running…" doesn't wrap.
- Faint left-edge accent for degraded / failed / offline rows so
  problem hosts are scannable without reading.
- Empty-state hero added: top-bar + nav still present (Dashboard
  active, others dimmed) but body collapses to a calm "no hosts yet"
  prompt with the install command as the load-bearing affordance.
  Prerequisite note keeps the deliberate "restic must already be
  installed" decision visible to first-time operators.

This is the artefact P1-23/24/27 will template against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:48:15 +01:00
steve afce98f105 design: three hi-fi dashboard directions for review
Three deliberately differentiated takes on the dashboard so we can
lock the visual register before the UI work starts (P1-23 onwards).

  v1 — Operator console (Linear/Datadog dark register).
       Dense table, monospace numerics, restrained colour, pulsing
       status dot only when a job is running. The natural fit for
       the audience and the most defensible choice.

  v2 — Editorial calm (Stripe/Notion light register).
       Serif hero headline that humanises the data, cards with
       breathing room in a 2-up grid, demoted "quiet hosts" strip,
       subtle rust accent. Reads as trustworthy infrastructure.

  v3 — Print spec (Tufte/aerospace monospace register).
       Pure monospace, near-monochrome, status as typeset glyphs
       (●▶▲○✗) so the screen survives greyscale. "Requires
       attention" block groups problem hosts at the top; activity
       tail reads like a real log. Most polarising; highest
       craft ceiling.

Each file is self-contained (Tailwind via CDN + Google Fonts) and
includes a philosophy preamble + the dashboard hero + a component
vocabulary section so we can read the system, not just one screen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:39:57 +01:00
steve 9798a2b5fe agent: log accept/complete on backup jobs; audit: populate host.enrolled payload
Two warts surfaced during the smoke run:

- Agent was silent between "config.update applied" and "job
  finished" — operators tailing journalctl saw no acknowledgement
  that a command.run had landed. Adds Info logs at job-accept
  ({job_id, paths}) and at successful completion.

- The host.enrolled audit row had an empty {} payload. Now
  carries {hostname, os, arch, has_repo_creds} so an audit-log
  reader can answer "what got enrolled and did the operator
  bundle creds with the token" without joining back to hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:24:56 +01:00
steve 44feb708bc fix: enrollment FK race + log-when-rejected; runbook fixes from dry-run
The smoke runbook caught a real bug: ConsumeEnrollmentToken was
inserting into host_credentials (FK -> hosts) inside the same tx as
the token burn, but the host row didn't exist yet — CreateHost
runs in the *next* statement. The agent saw a generic 401 with no
clue why.

Fix: drop the host_credentials insert from ConsumeEnrollmentToken;
the HTTP handler now does Consume -> CreateHost ->
SetHostCredentials. SetHostCredentials failure is logged loudly
but doesn't fail the enrol — operator recovers via PUT
/api/hosts/{id}/repo-credentials.

Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the
underlying cause is visible in server logs (the wire response stays
generic to avoid leaking which step failed).

Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new
order (consume -> create host -> SetHostCredentials).

Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly
in use); URLs use trailing slash on the rest path; clarified that
secrets_key is minted on first agent start, not at enrol time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:01:59 +01:00
steve 6cfbdfc7ab P1-34: e2e smoke runbook + redacted GET /repo-credentials
Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full
P1 happy path against a sibling restic/rest-server: bootstrap
admin, mint token with repo creds, enrol an agent, watch the
config.update push land, run a backup, confirm the snapshot, edit
creds and watch the second push fire. Per the design discussion
this is a runbook (not a Go integration test); the Playwright
version lands in P5-06.

GET /api/hosts/{id}/repo-credentials returns the redacted view —
{repo_url, repo_username, has_password} — so the UI can pre-fill
the edit form without ever pulling the password out of the AEAD
blob.

Marks P1-32 / P1-33 / P1-34 done in tasks.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:49:34 +01:00
steve 27086783da P1-33: agent-side encrypted secrets store + push-on-update
New internal/agent/secrets package: AEAD blob at
/var/lib/restic-manager/secrets.enc, atomic write (os.CreateTemp +
Sync + Rename), 0600. Key lives in agent.yaml as base64
(SecretsKey) — same trust boundary as the bearer token, minted on
first start via EnsureSecretsKey.

cmd/agent: dispatcher reads creds fresh from secrets.Load() on
each job rather than from in-memory config. config.update merges
the push with what's on disk and persists, so a daemon restart
keeps the latest values. Legacy plaintext repo_url/repo_password
in agent.yaml are silently migrated into secrets.enc on next start
and stripped from the YAML on the following save.

Tests: round-trip + wrong-key rejection + atomic-write
post-condition for secrets; key idempotence + legacy-field
parse/clear for config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:41:28 +01:00
steve b3b89045f2 P1-32: server-side encrypted repo creds + push-on-hello
Operator-minted enrollment tokens now carry the repo URL/username/
password as one AEAD blob bound (via additional-data) to the token
hash. ConsumeEnrollmentToken re-encrypts under host_id and writes a
host_credentials row in the same tx as token-burn, so the binding
moves with the credential.

PUT /api/hosts/{id}/repo-credentials lets an operator edit creds
post-enrollment; merges with the existing blob, audits, and pushes
config.update if the agent is connected.

WS handler grows an OnHello hook that the HTTP layer wires to send
the host's decrypted creds as a config.update immediately after the
hello succeeds — synchronously, so a racing command.run lands after
the agent has its repo password.

Schema: 0002_host_credentials.sql adds enc_repo_creds to
enrollment_tokens and a host_credentials table (PK = host_id, FK
ON DELETE CASCADE).

Tests: round-trip token → consume → host_credentials with AAD swap
detection; no-creds path stays compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:38:35 +01:00
steve 8d8150ee6e spec/tasks: pull repo-credential plumbing into Phase 1
Adds P1-32/33/34: encrypted repo creds carried on the enrollment token,
agent-side AEAD secrets file, end-to-end smoke. spec.md §4.2 and §7.3
rewritten to describe the full flow (server-issued at token time,
pushed via config.update on hello, persisted encrypted on the agent)
and to make the encrypted-file-now / OS-keyring-Phase-2 split
explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:32:53 +01:00
steve 51bbb555d4 tasks: add P2-18 announce-and-approve, expand P1-27 with preconfigured installer
P2-18 captures the keypair + fingerprint-comparison enrollment flow
as a Phase 2 alternative to the token model. Includes guards
(rate limit, pending cap, hostname-collision flagging) and explicit
acceptance criteria.

P1-27 grows to mint encrypted repo creds alongside the token and
expose a one-click preconfigured-installer download from the
"Add host" form (cf. UrBackup Internet-mode push installer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:31:28 +01:00
steve 8d5282a180 P1-22: snapshot listing via restic snapshots --json
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.

Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.

Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.

Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:20:57 +01:00
steve 811157b4ce server: drop in-process TLS — HTTP-only behind reverse proxy
Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.

Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.

Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:20:41 +01:00
steve 80a57b3b84 tasks.md: mark Phase 1 progress
Captures the state landed in this session:

Done (P1-01..03, P1-05, P1-06, P1-08..16, P1-17..20, P1-29):
  HTTP server, store + schema, crypto, first-run bootstrap,
  every API type with wire-shape tests, WS transport,
  enrollment + hello + heartbeat round-trip, agent config +
  service unit + WS client + sysinfo, restic wrapper, job
  lifecycle store + run-now endpoint, agent runner.

Partial (P1-04, P1-07, P1-21, P1-31):
  CSRF middleware lives with the UI work; audit middleware
  sweep lives with rest of API; live job-log fan-out needs
  the per-job browser hub; signed agent binaries deferred to
  Phase 5.

Open (P1-22..28):
  Snapshot listing, full UI suite (login, dashboard, host
  detail, live job log, add-host, Tailwind build).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:46:16 +01:00
steve a7c6a6e09c phase 1: run-now backup — restic wrapper, job lifecycle, end-to-end
Lands the operator → server → agent → restic → server roundtrip for
on-demand backups. The flow:

  POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]}
    → server creates a queued Job row
    → server emits command.run over WS to the host's agent
    → agent dispatcher spawns runner.RunBackup in a goroutine
    → runner spawns `restic backup --json`, parses each line
    → forwards: job.started, log.stream (every line), job.progress
      (throttled to 1/sec), job.finished (with summary stats blob)
    → server WS handler persists those into jobs / job_logs

P1-16 internal/restic: thin Locate + Env wrapper that runs `restic
  backup --json`, scans stdout/stderr, parses BackupStatus +
  BackupSummary, calls back into a LineHandler so the agent can fan
  out to log.stream + job.progress. Treats exit code 3 as
  "succeeded with issues" (matches restic's contract).

P1-18 store: jobs accessors (CreateJob, MarkJobStarted,
  MarkJobFinished, AppendJobLog, GetJob).

P1-19 server: POST /api/hosts/{id}/jobs creates the Job row,
  validates kind, dispatches via Hub.Send, audit-logs the action.

P1-20 agent runner: wraps restic.RunBackup with throttled progress
  emission. Sender abstraction was added to wsclient.Handler so
  background goroutines can keep replying after dispatch returns.

P1-21 server WS: dispatchAgentMessage now persists job.started,
  job.finished, log.stream into the database. Browser fan-out for
  live tailing lands with the UI work.

Agent gets repo_url + repo_password from agent.yaml in plaintext
for now (mode 0600, owned by service user); spec.md §7.3's keyring
storage moves there in P2. config.update over WS overrides the
in-memory copy (does not persist).

Build clean; all tests pass. End-to-end with a real restic still
needs a host that has restic installed — wire shape verified by
the existing hello/heartbeat round-trip test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:45:04 +01:00
steve 24ab071702 phase 1: agent install path — systemd unit, install.sh, asset endpoints
P1-14 deploy/install/restic-manager-agent.service: standard systemd
  unit with the usual hardening switches (NoNewPrivileges, Protect*,
  RestrictRealtime, MemoryDenyWriteExecute). Restart=always with a
  5s backoff. Runs as a dedicated unprivileged restic-manager-agent
  user; the install script creates it.

P1-29 deploy/install/install.sh: arch detection (amd64/arm64), pulls
  the agent binary from /agent/binary, creates the service user
  + dirs (/etc/restic-manager, /var/lib/restic-manager), runs
  enrollment via `agent -enroll-server -enroll-token`, lays down
  the systemd unit, enables and starts it.

  Honours the spec's "detect, don't auto-disable" rule for existing
  schedulers: scans systemd timers, /etc/cron.d/*, /etc/cron.daily/*,
  root crontab for restic-named entries and prints them with the
  exact disable command — operator decides.

P1-31 server endpoints to ship the agent installation payload:
  GET /agent/binary?os=linux&arch=amd64 → serves
    <DataDir>/agent-binaries/restic-manager-agent-linux-amd64
  GET /install/<file>                   → serves
    <DataDir>/install/<file>
  Both endpoints reject path traversal and return 404 if the file
  isn't published. Operators drop the binaries + service unit into
  these directories at release time. Signed-bundle verification is
  deferred to Phase 5 OSS readiness.

All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:40:36 +01:00
steve 9cc0caff1e phase 1: WS transport, enrollment, agent that hellos and heartbeats
Lands the protocol layer end-to-end: an agent can be enrolled
through the operator UI, store credentials, dial back to the server
over WS, complete the protocol_version handshake, and stay
connected with periodic heartbeats.

Server side:
- P1-09 ws.Hub: one Conn per host_id, last-write-wins eviction,
  json envelope writer with a write mutex, reader, error envelopes.
- P1-09 ws.AgentHandler: bearer-auth, accept upgrade, hello-stage
  (10s deadline, protocol_version checked against
  api.MinAgentProtocolVersion → ErrProtocolTooOld with help URL on
  reject), main read loop, defer hub register/unregister.
- P1-10 POST /api/agents/enroll consumes a one-time token, mints a
  persistent agent bearer (sha-256 stored), creates a host row.
- P1-10 POST /api/enrollment-tokens (operator, session-auth)
  issues a 1h one-time token.
- P1-11 hello upserts agent_version + restic_version +
  protocol_version on the host row, flips status to online.
- P1-12 heartbeat touches last_seen_at; background sweeper marks
  hosts offline after 90s without one.
- store: hosts table accessors, host_schedule_version,
  enrollment_tokens FK on consumed_host dropped (audit-only field;
  the token gets burned before the host row exists).

Agent side:
- P1-13 internal/agent/config: yaml at /etc/restic-manager/agent.yaml,
  atomic Save (tmp+fsync+rename), Enrolled() helper.
- P1-15 internal/agent/wsclient: dial with bearer + optional
  TLS cert pinning (sha-256 of leaf), exponential backoff with
  jitter (1s → 60s cap), heartbeat goroutine, fatal handling for
  ErrProtocolTooOld.
- P1-15 wsclient.Enroll: HTTP POST /api/agents/enroll with sysinfo.
- P1-17 internal/agent/sysinfo: hostname/OS/arch/restic-version
  collection. restic detected by `restic version` parse; absent
  restic doesn't block startup.
- cmd/agent: -enroll-server / -enroll-token flags drive first-run
  enrollment then exit (so the install script can hand off to
  systemd to run the persistent service).

End-to-end smoke verified: bootstrap → login → issue token →
enroll → run agent → server logs `ws agent connected` with the
right host_id and protocol_version 1.

All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:39:00 +01:00
steve df2c584b23 phase 1: HTTP server + first-run bootstrap
P1-01 chi router, slog request log, graceful shutdown via signal
  context. Health endpoint, /api/auth/login, /api/auth/logout,
  /api/bootstrap. Background sweeper for expired sessions and
  enrollment tokens (15 min cadence).

P1-04 (sessions half) HttpOnly Secure-when-TLS cookie carrying a
  base64url token; server stores SHA-256(token) so a stolen DB
  doesn't yield credentials. Unknown user and bad password collapse
  to the same 401 response code so a probe can't enumerate names.

P1-05 first-run admin bootstrap. On a fresh DB the server mints a
  one-time token and prints it to stderr inside a banner. The
  /api/bootstrap handler accepts {token, username, password},
  creates the first admin, then becomes a 409 forever.

P1-07 (partial) audit hooks fire on auth.login and auth.bootstrap.
  Full middleware-driven coverage lands with the rest of the API.

internal/server/config: env > YAML > defaults. RM_LISTEN /
  RM_DATA_DIR / RM_BASE_URL / RM_TLS_CERT / RM_TLS_KEY /
  RM_SECRET_KEY_FILE / RM_TRUSTED_PROXY (CIDR list, validated).

End-to-end smoke test passes: server boots on a fresh dir,
prints the bootstrap token, POST /api/bootstrap creates the admin,
POST /api/auth/login returns 200 with a session cookie.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:28:18 +01:00
steve f55747a281 phase 1 foundations: api types, store, crypto, auth
Lands the bottom three layers of Phase 1:

P1-08 internal/api: protocol_version + envelope + every WS message
  shape from spec.md §6.2 (Hello, Heartbeat, Job*, Schedule*, etc).
  Wire-format tests pin the JSON shape so a rename here breaks
  tests instead of silently breaking the agent.

P1-02 + P1-03 internal/store: SQLite via modernc.org/sqlite,
  embed.FS + a tiny version table for hand-rolled migrations.
  0001_initial.sql covers every table from spec.md §5 plus
  enrollment_tokens and host_schedule_version. Typed accessors
  for users / sessions / enrollment / audit. WAL + foreign_keys
  + busy_timeout on by default.

P1-06 internal/crypto: XChaCha20-Poly1305 AEAD wrapper with
  per-message random nonce. Key file lifecycle (generate +
  refuse-to-overwrite, load with size validation). Optional
  additionalData binds ciphertext to the row that owns it.

P1-04 internal/auth (partial — passwords + tokens; sessions
  middleware lands with the HTTP handlers): argon2id following
  RFC 9106 (64 MiB / t=3 / p=4 / 32B), constant-time verify.
  HashToken stores SHA-256 of session/agent/enrollment tokens
  so a stolen DB doesn't hand over credentials.

Build floor moves to Go 1.25 (modernc.org/sqlite v1.50+ requires
it); CI + Dockerfile + README updated. Markdown lint diagnostics
on tasks.md cleared.

All packages tested. ~70 new tests pass in <1s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:24:40 +01:00
steve c821ec1fe0 spec/tasks: address pre-Phase-1 design feedback
Doc-only changes captured before any Phase 1 code lands.

spec.md:
- §4.1 nhooyr.io/websocket → github.com/coder/websocket (the
  maintained fork; the original is unmaintained)
- §4.1 RM_LISTEN documented as source of truth for the bind port;
  add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind
  Caddy/Traefik
- §4.2 Phase 1 ships Linux only; Windows binaries continue to build
  in CI to keep the codebase portable, but service integration +
  installer move to Phase 2
- §4.2 self-update via apt/choco, not bespoke signed binaries
- §5 add Host.protocol_version + Host.applied_schedule_version
- §6.2 lock protocol_version handshake semantics (clean error on
  mismatch, not weird JSON parse failures)
- §6.2 schedule reconciliation when server unreachable: agent keeps
  firing last-known-good indefinitely; server's view canonical on
  reconnect; UI surfaces drift via applied_schedule_version
- §6.2 schedule.set carries schedule_version; new schedule.ack
  agent→server message
- §10.1 cross-reference RM_LISTEN ↔ compose port mapping
- §14.3 hooks rejected at validation on non-backup schedule kinds

tasks.md:
- P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as
  P2-16 / P2-17
- P1-29 install.sh detects existing restic timers/cron and prints
  disable commands, doesn't auto-disable
- Phase 1 acceptance: drop Windows from end-to-end criterion,
  require windows cross-compile in CI
- P4-01 rewritten: package-manager-based update delivery
- P5-08 removed (duplicate of P4-08 Prometheus /metrics)
- Various references updated

No Go code changes; build still clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:12:55 +01:00
steve 25aa001135 phase 0: project bootstrap
P0-01 Go module + cmd/server + cmd/agent skeletons + internal/ tree
P0-02 LICENSE (PolyForm NC 1.0.0), README, CONTRIBUTING
P0-03 golangci-lint, pre-commit, .editorconfig, .gitignore
P0-04 Gitea Actions CI: test (race+coverage), lint, cross-platform build matrix
P0-05 Dockerfile.server (multi-stage, distroless/static), docker-compose.yml
P0-06 Makefile with build/test/lint/fmt/run/release targets

build, vet, test, and cross-compile to linux/{amd64,arm64} + windows/amd64
all verified locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:03:59 +01:00
steve ab02869d82 initial setup ready 2026-04-30 23:55:52 +01:00