Files
restic-manager/tasks.md
T
steve 229f89fee2
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
P1-23 / P1-28: base layout, login, session-aware nav + Tailwind build
P1-28: Tailwind standalone CLI wired into the Makefile. `make tailwind`
downloads the pinned v3.4.17 binary into bin/tailwindcss (gitignored),
builds web/styles/input.css → web/static/css/styles.css. `make build`
now runs the CSS pass first; `make tailwind-watch` for dev. Output is
embedded in the binary via web.FS — single static binary, no Node.

The CSS source carries every component class the v1 mockups defined
(status dots, buttons, host row, log viewer, progress bar, fields,
chips, snippet panel, empty state) so screens that land later can
just reach for them.

P1-23: html/template tree at web/templates with two layouts (base
with chrome, chromeless for login + bootstrap), one nav partial, and
two pages (dashboard placeholder, login). internal/server/ui parses
the tree at startup; ui_handlers.go in the http package wires:

  GET  /         dashboard (303 → /login when unauthed)
  GET  /login    sign-in form
  POST /login    consume form, mint session cookie, 303 → /
  POST /logout   drop cookie, 303 → /login
  GET  /static/* embedded Tailwind bundle

The HTML login flow shares store/session logic with /api/auth/login
via a new authenticateAndSession helper — same security guarantees,
two surface representations (HTML form / JSON).

Verified end-to-end: bootstrap → form-login → authed dashboard →
sign-out → 303 cycle works in the browser; Tailwind output emits
only the component classes referenced in the live templates (9.6kB
minified).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:19:06 +01:00

18 KiB
Raw Blame History

restic-manager — Tasks

Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.

Sizes: S = under a day, M = 13 days, L = 37 days.


Phase 0 — Project bootstrap

  • P0-01 (S) Initialize Go module, cmd/server, cmd/agent, baseline internal/ packages
  • P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
  • P0-03 (S) Set up golangci-lint, gofumpt, goimports; pre-commit config
  • P0-04 (S) GitHub Actions Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
  • P0-05 (S) Dockerfile.server (multi-stage, distroless), deploy/docker-compose.yml
  • P0-06 (S) Makefile / taskfile.yml with common targets (build, test, run, release)

Phase 1 — MVP: enrollment, visibility, on-demand backup

Server foundations

  • P1-01 (M) HTTP server scaffolding (chi, structured logging via slog, graceful shutdown)
  • P1-02 (M) SQLite store layer (modernc.org/sqlite) + migrations (hand-rolled, embed.FS)
  • P1-03 (M) Schema for users, sessions, hosts, repos, credentials, jobs, job_logs, snapshots, audit_log
  • [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
  • P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
  • P1-06 (M) Secret encryption helper (AEAD with key from RM_SECRET_KEY_FILE)
  • [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited

Agent ↔ server protocol

  • P1-08 (M) Define shared API types in internal/api (envelopes, every WS message + protocol_version constants; JSON-shape tests pin the wire)
  • P1-09 (L) WebSocket transport (github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
  • P1-10 (M) Enrollment flow: POST /api/agents/enroll with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
  • P1-11 (M) Agent registration on connect (hello upserts agent_version/restic_version/protocol_version, flips status online, protocol_too_old rejection has clean error envelope)
  • P1-12 (S) Heartbeat handler (touches last_seen_at; background sweeper marks hosts offline after 90s without one)

Agent foundations

  • P1-13 (M) Agent config file (/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
  • P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
  • P1-15 (M) Outbound WS client (github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, protocol_version in hello
  • P1-16 (M) Restic wrapper: locate via PATH or override, run with --json, scan stdout/stderr, parse BackupStatus + BackupSummary, exit-code 3 treated as success-with-issues
  • P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)

Run-now backup

  • P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
  • P1-19 (M) Server endpoint POST /api/hosts/{id}/jobs to dispatch a backup command (validates kind, checks online, audit-logs)
  • P1-20 (M) Agent executes restic backup, streams stdout/stderr + parsed JSON events back as job.progress (1Hz throttle) / log.stream
  • [~] P1-21 (M) Server persists log stream to job_logs ✓; WS /api/jobs/{id}/stream for live browser tailing still TODO — needs the per-job fan-out hub
  • P1-22 (S) Snapshot listing: agent calls restic snapshots --json after each successful backup and ships the projection over snapshots.report. Server ReplaceHostSnapshots atomically swaps the per-host list and updates hosts.snapshot_count in the same tx. Read endpoint: GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused repo_id FK from snapshots (repos as a first-class entity is P2 work).

UI (HTMX + Tailwind)

  • P1-23 (M) Base layout, login page, session-aware nav
  • P1-24 (M) Dashboard: host cards (status dot, last backup, repo size)
  • P1-25 (M) Host detail page: snapshots tab + run-now button
  • P1-26 (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
  • P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), shows the operator a copy-friendly install command and a one-click "download preconfigured installer" — a install-<hostname>.sh with RM_SERVER + RM_TOKEN already templated in (cf. UrBackup Internet-mode push installer). Encrypted repo creds ride on the token row and get pushed to the agent on first WS connect (see secrets/keyring task).
  • P1-28 (S) Tailwind build via tailwindcss standalone binary (no Node) — Makefile downloads pinned v3.4.17 into bin/tailwindcss, builds web/styles/input.cssweb/static/css/styles.css, embedded into the binary via web.FS. make build runs Tailwind first.

Install scripts

  • P1-29 (M) install.sh (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / /etc/cron.{d,daily,hourly,weekly}/* / root crontab and prints them with the exact disable commands — does not auto-disable
  • [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (/agent/binary + /install/*); signature verification deferred to Phase 5 OSS readiness

Repo credentials (pulled forward from Phase 2)

  • P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:

    • POST /api/enrollment-tokens body grows repo_url, repo_username, repo_password (all required).
    • Token row stores them as one AEAD-encrypted blob (existing crypto.AEAD); ConsumeEnrollmentToken moves the blob to a new host_credentials row keyed by host_id in the same tx.
    • PUT /api/hosts/{id}/repo-credentials (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
    • GET /api/hosts/{id}/repo-credentials returns the redacted view (URL + username + has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
    • On WS hello, server pushes a config.update with decrypted creds before returning the connection to idle. Same path on edit-while-connected.
    • Audit-logged on create / consume / edit; payload omits the secret material.
  • P1-33 (M) Agent-side encrypted secrets store:

    • New internal/agent/secrets package: AEAD blob at /var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600).
    • Per-host 32-byte secrets key minted at enrollment, persisted in agent.yaml (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
    • Strip repo_url / repo_password from agent.config.Config. Agent loads creds from secrets.enc at startup; config.update handler writes through to the file.
    • Dispatcher reads from the secrets store on every job rather than from in-memory config.
    • Migration path: if agent.yaml still contains repo_url/repo_password, copy them into secrets.enc on next start, then strip from the YAML on save.
  • P1-34 (S) End-to-end smoke runbook: docs/e2e-smoke.md walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real restic/rest-server in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.

Phase 1 acceptance

  • One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
  • Windows binary builds cleanly in CI (.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
  • Agent ↔ server protocol_version handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
  • Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as config.update.

Phase 2 — Scheduling, retention, repo operations

  • P2-01 (M) Schedule schema + CRUD API
  • P2-02 (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
  • P2-03 (M) Agent local scheduler (robfig/cron/v3); persists next-fire times across restarts
  • P2-04 (M) Schedule editor UI (paths, excludes, tags, cron, retention)
  • P2-05 (M) forget command with retention policy (keep-last/daily/weekly/monthly/yearly)
  • P2-06 (M) prune command (admin-only, uses non-append-only credential)
  • P2-07 (S) check command (random subset + --read-data-subset)
  • P2-08 (S) unlock command
  • P2-09 (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
  • P2-10 (S) Run-now buttons for forget/prune/check/unlock on host detail page
  • P2-11 (S) Schedule "next run" / "last run" surfaced on host card
  • P2-12 (S) Bandwidth limit fields on schedule editor (--limit-upload, --limit-download); also overridable on run-now jobs
  • P2-13 (M) Pre/post backup hooks: schema (Schedule.pre_hook, Schedule.post_hook, Host.pre_hook_default, Host.post_hook_default), encrypted at rest, admin-only edit, audit-logged
  • P2-14 (M) Agent execution of hooks: configurable shell per host, pre_hook failure aborts backup, post_hook always runs with RM_JOB_STATUS env var, stdout/stderr captured into JobLog with prefix
  • P2-15 (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on kind = backup schedules (see spec.md §14.3)
  • P2-16 (M) Windows service integration: agent runs under the Service Control Manager via golang.org/x/sys/windows/svc; install/uninstall/start/stop wired up
  • P2-17 (M) install.ps1 (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named *restic* and prints them for manual review
  • P2-18 (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
    • Agent run with no RM_TOKEN generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then POST /api/agents/announce with {hostname, os, arch, agent_version, restic_version, public_key}. Server stores a pending_hosts row (public_key, fingerprint = sha256(public_key), announced_from_ip, first_seen_at, last_seen_at, expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
    • Agent then opens a long-poll/WS to /ws/agent/pending authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
    • Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept.
    • UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
    • Server-side guards: per-source-IP rate limit on /api/agents/announce (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
    • Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting hostname over the wire.

Phase 2 acceptance

  • Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a mysqldump example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured.
  • A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
  • A Linux host can enroll via announce-and-approve: operator runs the install script with no token, sees a fingerprint in the terminal, finds the matching pending row in the UI, clicks accept, and the host is fully credentialled and online without further endpoint interaction. Rejecting a pending row leaves the agent process exited cleanly with a clear log line. Rate-limit and pending-cap guards verified under a synthetic flood.

Phase 3 — Restore, alerts, audit

  • P3-01 (L) Restore wizard backend: snapshot tree browse via restic ls --json, path picker, target selection
  • P3-02 (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
  • P3-03 (M) Restore execution: restic restore invocation, progress streaming
  • P3-04 (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
  • P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
  • P3-06 (M) Notification channels: webhook, ntfy, SMTP email
  • P3-07 (S) Alert UI: list, acknowledge, resolve
  • P3-08 (S) Audit log UI with filters (user, action, target, time range)
  • P3-09 (S) diff between two snapshots in UI

Phase 3 acceptance

  • A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.

Phase 4 — Update delivery, RBAC polish, OIDC

  • P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. restic-manager-agent update is a thin wrapper over apt-get install --only-upgrade restic-manager-agent / choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
  • P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
  • P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
  • P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
  • P4-05 (L) OIDC login (generic provider config, group → role mapping)
  • P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
  • P4-07 (S) Per-host tags + dashboard filtering by tag
  • P4-08 (M) Prometheus /metrics endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
  • P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON

Phase 4 acceptance

  • Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape /metrics and the sample Grafana dashboard renders with live data.

Phase 5 — OSS readiness

  • P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
  • P5-02 (S) CONTRIBUTING.md, CODE_OF_CONDUCT.md, issue + PR templates
  • P5-03 (S) Release automation: goreleaser for binaries + Docker image to GHCR
  • P5-04 (S) Demo screenshots / short Loom walkthrough in README
  • P5-05 (S) SECURITY.md with disclosure process
  • P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
  • P5-07 (S) Reference deployment: docker-compose.yml + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates RM_TRUSTED_PROXY)

Phase 5 acceptance

  • A stranger can read the docs and stand up a working install in under 30 minutes.

Cross-cutting / ongoing

  • X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
  • X-02 Track restic version compatibility matrix
  • X-03 Periodic dependency updates (dependabot or renovate)
  • X-04 Threat-model review at end of each phase