Files
restic-manager/tasks.md
T
steve 160d788bae
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
P2-04: schedule editor UI
Closes the schedule foundations slice — operator can now drive the
plumbing P2-01..03 landed without touching the JSON API.

* New routes:
  - GET  /hosts/{id}/schedules          (list)
  - GET  /hosts/{id}/schedules/new      (create form)
  - POST /hosts/{id}/schedules/new      (create)
  - GET  /hosts/{id}/schedules/{sid}/edit (edit form)
  - POST /hosts/{id}/schedules/{sid}/edit (update)
  - POST /hosts/{id}/schedules/{sid}/delete (delete, confirm-then-redirect)

* List view (web/templates/pages/schedules_list.html):
  status, cron, paths, retention summary, tags, edit/delete buttons.
  Header shows "version N · agent in sync" or "agent at vM" when the
  push hasn't been ack'd yet — backed by host_schedule_version +
  applied_schedule_version. Empty-state CTA points at /schedules/new.

* Create/edit form (web/templates/pages/schedule_edit.html, shared):
  cron expression with five quick-pick presets (daily 3am / every 6h
  / @hourly / weekly Sun / monthly 1st), paths textarea (one per
  line), excludes textarea, tags (comma-separated), retention as six
  numeric fields (mirrors restic's --keep-* flags one-for-one),
  bandwidth caps, enabled toggle. Side panel explains the
  reconciliation flow so the operator knows what saving actually
  does. Validation errors re-render with operator's input intact.

* internal/server/http/ui_schedules.go owns the handlers; reuses
  the same validateSchedule + pushScheduleSetAsync used by the JSON
  API path. Each save audit-logs schedule.created / schedule.updated
  / schedule.deleted (matching the JSON API actions).

* store.RetentionPolicy gains a Summary() method ("last=7, d=14,
  w=4" or "—"). Used by the list view's table cell so templates
  don't have to do any conditional retention rendering.

* Two new template helpers: list (string varargs → []string, used
  for the cron preset row) and joinComma (sibling to joinDot for
  the rare list that wants commas). RetentionPolicy.Summary covers
  the schedule-list case but the helpers are general.

* host_detail.html secondary tabs row converted from inert <div>s
  into <a> links. Snapshots active by default; Schedules now points
  at the new page. Jobs/Repo/Settings remain inert until their
  P2 owners ship.

Hooks UI deferred to P2-15 (lands with the hook execution path).
Single-kind UI (backup only) by design — other kinds get a UI when
their job dispatch lands in P2-05..08.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:44:40 +01:00

22 KiB
Raw Blame History

restic-manager — Tasks

Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.

Sizes: S = under a day, M = 13 days, L = 37 days.


Phase 0 — Project bootstrap

  • P0-01 (S) Initialize Go module, cmd/server, cmd/agent, baseline internal/ packages
  • P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
  • P0-03 (S) Set up golangci-lint, gofumpt, goimports; pre-commit config
  • P0-04 (S) GitHub Actions Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
  • P0-05 (S) Dockerfile.server (multi-stage, distroless), deploy/docker-compose.yml
  • P0-06 (S) Makefile / taskfile.yml with common targets (build, test, run, release)

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

  • P1-01 (M) HTTP server scaffolding (chi, structured logging via slog, graceful shutdown)
  • P1-02 (M) SQLite store layer (modernc.org/sqlite) + migrations (hand-rolled, embed.FS)
  • P1-03 (M) Schema for users, sessions, hosts, repos, credentials, jobs, job_logs, snapshots, audit_log
  • [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
  • P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
  • P1-06 (M) Secret encryption helper (AEAD with key from RM_SECRET_KEY_FILE)
  • [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited

Agent ↔ server protocol

  • P1-08 (M) Define shared API types in internal/api (envelopes, every WS message + protocol_version constants; JSON-shape tests pin the wire)
  • P1-09 (L) WebSocket transport (github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
  • P1-10 (M) Enrollment flow: POST /api/agents/enroll with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
  • P1-11 (M) Agent registration on connect (hello upserts agent_version/restic_version/protocol_version, flips status online, protocol_too_old rejection has clean error envelope)
  • P1-12 (S) Heartbeat handler (touches last_seen_at; background sweeper marks hosts offline after 90s without one)

Agent foundations

  • P1-13 (M) Agent config file (/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
  • P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
  • P1-15 (M) Outbound WS client (github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, protocol_version in hello
  • P1-16 (M) Restic wrapper: locate via PATH or override, run with --json, scan stdout/stderr, parse BackupStatus + BackupSummary, exit-code 3 treated as success-with-issues
  • P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)

Run-now backup

  • P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
  • P1-19 (M) Server endpoint POST /api/hosts/{id}/jobs to dispatch a backup command (validates kind, checks online, audit-logs)
  • P1-20 (M) Agent executes restic backup, streams stdout/stderr + parsed JSON events back as job.progress (1Hz throttle) / log.stream
  • [~] P1-21 (M) Server persists log stream to job_logs ✓; WS /api/jobs/{id}/stream for live browser tailing still TODO — needs the per-job fan-out hub
  • P1-22 (S) Snapshot listing: agent calls restic snapshots --json after each successful backup and ships the projection over snapshots.report. Server ReplaceHostSnapshots atomically swaps the per-host list and updates hosts.snapshot_count in the same tx. Read endpoint: GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused repo_id FK from snapshots (repos as a first-class entity is P2 work).

UI (HTMX + Tailwind)

  • P1-23 (M) Base layout, login page, session-aware nav
  • P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by GET /api/hosts + GET /api/fleet/summary (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX Run now button posts to /hosts/{id}/run-backup.
  • P1-25 (M) Host detail page (/hosts/{id}): persistent header (status dot + mono name + tags + OS/arch/agent/restic/last-seen), vitals strip (last backup / repo size / snapshots / open alerts), sub-tabs (Snapshots active; Jobs/Repo/Settings tabs visible but inert until P2), snapshot table (cap 50, pagination later), right-rail run-now stack (backup live; forget/prune/check/unlock disabled with P2 hints) and a danger-zone delete panel.
  • P1-26 (M) Live job log viewer + WS browser fan-out hub (closes the P1-21 remainder). Browser opens /api/jobs/{id}/stream; agent-emitted job.started/job.progress/log.stream/job.finished are mirrored to subscribers. Per-subscriber buffered channel + non-blocking broadcast keeps a slow browser from blocking the agent's read loop. Page renders running / succeeded / failed states; auto-scrolls until the operator scrolls up; reloads on job.finished to show the final header. "Run now" sets HX-Redirect so the operator lands on the live log.
  • [~] P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), re-renders the same page in result-state with the install command (RM_SERVER + RM_TOKEN filled in), copy button, and an awaiting-agent panel. Encrypted repo creds ride on the token row (P1-32) and get pushed to the agent on first WS connect (P1-33). Deferred: one-click "download preconfigured installer" install-<hostname>.sh (cf. UrBackup Internet-mode push installer) — copy-paste covers it for v1.
  • P1-28 (S) Tailwind build via tailwindcss standalone binary (no Node) — Makefile downloads pinned v3.4.17 into bin/tailwindcss, builds web/styles/input.cssweb/static/css/styles.css, embedded into the binary via web.FS. make build runs Tailwind first.

Install scripts

  • P1-29 (M) install.sh (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / /etc/cron.{d,daily,hourly,weekly}/* / root crontab and prints them with the exact disable commands — does not auto-disable
  • [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (/agent/binary + /install/*); signature verification deferred to Phase 5 OSS readiness

Repo credentials (pulled forward from Phase 2)

  • P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:

    • POST /api/enrollment-tokens body grows repo_url, repo_username, repo_password (all required).
    • Token row stores them as one AEAD-encrypted blob (existing crypto.AEAD); ConsumeEnrollmentToken moves the blob to a new host_credentials row keyed by host_id in the same tx.
    • PUT /api/hosts/{id}/repo-credentials (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
    • GET /api/hosts/{id}/repo-credentials returns the redacted view (URL + username + has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
    • On WS hello, server pushes a config.update with decrypted creds before returning the connection to idle. Same path on edit-while-connected.
    • Audit-logged on create / consume / edit; payload omits the secret material.
  • P1-33 (M) Agent-side encrypted secrets store:

    • New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600).
    • Per-host 32-byte secrets key minted at enrollment, persisted in agent.yaml (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
    • Strip repo_url / repo_password from agent.config.Config. Agent loads creds from secrets.enc at startup; config.update handler writes through to the file.
    • Dispatcher reads from the secrets store on every job rather than from in-memory config.
    • Migration path: if agent.yaml still contains repo_url/repo_password, copy them into secrets.enc on next start, then strip from the YAML on save.
  • P1-34 (S) End-to-end smoke runbook: docs/e2e-smoke.md walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real restic/rest-server in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.

Phase 1 acceptance

  • One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
  • Windows binary builds cleanly in CI (.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
  • Agent ↔ server protocol_version handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
  • Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as config.update.

Phase 2 — Scheduling, retention, repo operations

  • P2-01 (M) Schedule schema + CRUD API. schedules table was already laid down in 0001; this slice adds store.Schedule/RetentionPolicy/ScheduleOptions types, CreateSchedule / GetSchedule / ListSchedulesByHost / UpdateSchedule / DeleteSchedule / GetHostScheduleVersion / SetHostAppliedScheduleVersion (mutations bump host_schedule_version atomically in-tx), and REST endpoints GET|POST /api/hosts/{id}/schedules + PUT|DELETE /api/hosts/{id}/schedules/{sid}. Validation: cron-expr parses via robfig/cron/v3 (same parser the agent will use, so anything that validates here will fire there); kind ∈ {backup, forget, prune, check} (init/unlock are operator-only); backup schedules require ≥1 path; hooks rejected on non-backup kinds (spec §14.3). Mutations audit-logged. Server + store tests cover the happy path, validation, and version bumps.
  • P2-02 (L) Server-pushed schedule reconciliation. pushScheduleSet* helpers (on-hello + async post-CRUD flavours), wiring in onAgentHello (always pushes, even when the host has no repo creds yet), pushScheduleSetAsync called from Create/Update/Delete handlers (no-op when the host is offline; on-hello catches up). MsgScheduleAck handled in the WS dispatcher: OnScheduleAck callback persists applied_schedule_version. Agent-side schedule.set handler ships in P2-03; the server side now has parity tests.
  • P2-03 (M) Agent local scheduler. New internal/agent/scheduler package wraps robfig/cron/v3Apply(ScheduleSetPayload, Sender) stops the prior cron (waits for in-flight entries), rebuilds from scratch (skipping disabled entries + skipping bad cron exprs with a warn log), starts, and emits schedule.ack. On a tick the entry sends a new schedule.fire envelope to the server with {schedule_id, scheduled_at}. The server's OnScheduleFire callback (dispatchScheduledJob) looks up the schedule, builds args from kind, persists a jobs row with actor_kind=schedule + scheduled_id, and ships command.run back on the same conn — agent runs the job through the existing dispatcher. Tx is swapped on every Apply so reconnect is handled naturally (cron entries that fire against a dropped tx log + skip the tick). CreateJob now writes scheduled_id; this column was in the schema since 0001 but never populated. Tests: scheduler unit tests cover ack-on-apply, cron tick → fire envelope, disabled-entries silent, replace-prior-state stops the old cron. Server-side end-to-end test covers fire → command.run with the right job_id/kind/args, plus jobs row with actor_kind=schedule + scheduled_id linking back. Deferred: persistence of next-fire times across agent restarts (a missed fire window during downtime simply fires once on reconnect — desirable behaviour).
  • P2-04 (M) Schedule editor UI. New "Schedules" sub-tab on host detail (header + run-now panel preserved across the snapshot/schedule pages). List view shows status, cron, paths, retention summary (store.RetentionPolicy.Summary() renders "last=7, d=14, w=4"), tags, and edit/delete buttons. The header carries a "version N · agent in sync / agent at vM" indicator backed by host_schedule_version + applied_schedule_version. Create/edit form covers cron expr (with quick-pick presets), paths textarea, excludes textarea, tags (comma-separated), retention (six numeric inputs mirroring restic's --keep-* flags), bandwidth caps, enabled toggle. Form validation re-renders with the operator's typed input still in place. Each save fires pushScheduleSetAsync so an online agent re-arms within a few seconds. Hooks UI deferred to P2-15 (lands when the hook execution path does).
  • P2-05 (M) forget command with retention policy (keep-last/daily/weekly/monthly/yearly)
  • P2-06 (M) prune command (admin-only, uses non-append-only credential)
  • P2-07 (S) check command (random subset + --read-data-subset)
  • P2-08 (S) unlock command
  • P2-09 (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
  • P2-10 (S) Run-now buttons for forget/prune/check/unlock on host detail page
  • P2-11 (S) Schedule "next run" / "last run" surfaced on host card
  • P2-12 (S) Bandwidth limit fields on schedule editor (--limit-upload, --limit-download); also overridable on run-now jobs
  • P2-13 (M) Pre/post backup hooks: schema (Schedule.pre_hook, Schedule.post_hook, Host.pre_hook_default, Host.post_hook_default), encrypted at rest, admin-only edit, audit-logged
  • P2-14 (M) Agent execution of hooks: configurable shell per host, pre_hook failure aborts backup, post_hook always runs with RM_JOB_STATUS env var, stdout/stderr captured into JobLog with prefix
  • P2-15 (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on kind = backup schedules (see spec.md §14.3)
  • P2-16 (M) Windows service integration: agent runs under the Service Control Manager via golang.org/x/sys/windows/svc; install/uninstall/start/stop wired up
  • P2-17 (M) install.ps1 (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named *restic* and prints them for manual review
  • P2-18 (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
    • Agent run with no RM_TOKEN generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then POST /api/agents/announce with {hostname, os, arch, agent_version, restic_version, public_key}. Server stores a pending_hosts row (public_key, fingerprint = sha256(public_key), announced_from_ip, first_seen_at, last_seen_at, expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
    • Agent then opens a long-poll/WS to /ws/agent/pending authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
    • Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept.
    • UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
    • Server-side guards: per-source-IP rate limit on /api/agents/announce (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
    • Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting hostname over the wire.

Phase 2 acceptance

  • Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a mysqldump example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured.
  • A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
  • A Linux host can enroll via announce-and-approve: operator runs the install script with no token, sees a fingerprint in the terminal, finds the matching pending row in the UI, clicks accept, and the host is fully credentialled and online without further endpoint interaction. Rejecting a pending row leaves the agent process exited cleanly with a clear log line. Rate-limit and pending-cap guards verified under a synthetic flood.

Phase 3 — Restore, alerts, audit

  • P3-01 (L) Restore wizard backend: snapshot tree browse via restic ls --json, path picker, target selection
  • P3-02 (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
  • P3-03 (M) Restore execution: restic restore invocation, progress streaming
  • P3-04 (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
  • P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
  • P3-06 (M) Notification channels: webhook, ntfy, SMTP email
  • P3-07 (S) Alert UI: list, acknowledge, resolve
  • P3-08 (S) Audit log UI with filters (user, action, target, time range)
  • P3-09 (S) diff between two snapshots in UI

Phase 3 acceptance

  • A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.

Phase 4 — Update delivery, RBAC polish, OIDC

  • P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. restic-manager-agent update is a thin wrapper over apt-get install --only-upgrade restic-manager-agent / choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
  • P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
  • P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
  • P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
  • P4-05 (L) OIDC login (generic provider config, group → role mapping)
  • P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
  • P4-07 (S) Per-host tags + dashboard filtering by tag
  • P4-08 (M) Prometheus /metrics endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
  • P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON

Phase 4 acceptance

  • Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape /metrics and the sample Grafana dashboard renders with live data.

Phase 5 — OSS readiness

  • P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
  • P5-02 (S) CONTRIBUTING.md, CODE_OF_CONDUCT.md, issue + PR templates
  • P5-03 (S) Release automation: goreleaser for binaries + Docker image to GHCR
  • P5-04 (S) Demo screenshots / short Loom walkthrough in README
  • P5-05 (S) SECURITY.md with disclosure process
  • P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
  • P5-07 (S) Reference deployment: docker-compose.yml + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates RM_TRUSTED_PROXY)

Phase 5 acceptance

  • A stranger can read the docs and stand up a working install in under 30 minutes.

Cross-cutting / ongoing

  • X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
  • X-02 Track restic version compatibility matrix
  • X-03 Periodic dependency updates (dependabot or renovate)
  • X-04 Threat-model review at end of each phase