86f7c17d9d
Server-rendered HTML view backed by:
- new store.FleetSummary aggregating host counts + repo bytes +
snapshot total + open alerts + last-24h job rollup in two queries.
- GET /api/hosts (JSON list of hosts in the dashboard projection).
- GET /api/fleet/summary (JSON aggregate, same shape as above).
The HTML page (web/templates/pages/dashboard.html) renders the four
summary tiles + host table directly from store data — no separate
fetch. Per-row state colour comes from .host-row.{degraded,failed,
offline} which paint a 3px left edge so problem hosts are scannable
without reading. HTMX is loaded into the base layout so per-row
"Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin
HTML wrapper that funnels into a new dispatchJob helper shared
with the JSON /api/hosts/{id}/jobs endpoint.
Empty state (zero hosts) collapses to the "no hosts yet" prompt
with the + Add host CTA — matches the v1 mockup.
Template helpers (internal/server/ui/funcs.go) added for byte
formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and
comma grouping (1,847). Pure Go, no template-magic dependency.
Browser-verified end-to-end with seeded fixture data: five hosts
across all four states render with correct dots, accents, last-
backup pills, sizes, snapshot counts, alerts, tags, and the right
action button (Run now / Retry / Run first / View → / offline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
18 KiB
restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: S = under a day, M = 1–3 days, L = 3–7 days.
Phase 0 — Project bootstrap
- P0-01 (S) Initialize Go module,
cmd/server,cmd/agent, baselineinternal/packages - P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- P0-03 (S) Set up
golangci-lint,gofumpt,goimports; pre-commit config - P0-04 (S)
GitHub ActionsGitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint - P0-05 (S)
Dockerfile.server(multi-stage, distroless),deploy/docker-compose.yml - P0-06 (S) Makefile /
with common targets (taskfile.ymlbuild,test,run,release)
Phase 1 — MVP: enrollment, visibility, on-demand backup
Server foundations
- P1-01 (M) HTTP server scaffolding (
chi, structured logging viaslog, graceful shutdown) - P1-02 (M) SQLite store layer (
modernc.org/sqlite) + migrations (hand-rolled,embed.FS) - P1-03 (M) Schema for
users,sessions,hosts,repos,credentials,jobs,job_logs,snapshots,audit_log - [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
- P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
- P1-06 (M) Secret encryption helper (AEAD with key from
RM_SECRET_KEY_FILE) - [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited
Agent ↔ server protocol
- P1-08 (M) Define shared API types in
internal/api(envelopes, every WS message +protocol_versionconstants; JSON-shape tests pin the wire) - P1-09 (L) WebSocket transport (
github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side - P1-10 (M) Enrollment flow:
POST /api/agents/enrollwith one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate. - P1-11 (M) Agent registration on connect (
helloupserts agent_version/restic_version/protocol_version, flips status online,protocol_too_oldrejection has clean error envelope) - P1-12 (S) Heartbeat handler (touches
last_seen_at; background sweeper marks hosts offline after 90s without one)
Agent foundations
- P1-13 (M) Agent config file (
/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2 - P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- P1-15 (M) Outbound WS client (
github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats,protocol_versionin hello - P1-16 (M) Restic wrapper: locate via PATH or override, run with
--json, scan stdout/stderr, parseBackupStatus+BackupSummary, exit-code 3 treated as success-with-issues - P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
Run-now backup
- P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- P1-19 (M) Server endpoint
POST /api/hosts/{id}/jobsto dispatch abackupcommand (validates kind, checks online, audit-logs) - P1-20 (M) Agent executes
restic backup, streams stdout/stderr + parsed JSON events back asjob.progress(1Hz throttle) /log.stream - [~] P1-21 (M) Server persists log stream to
job_logs✓; WS/api/jobs/{id}/streamfor live browser tailing still TODO — needs the per-job fan-out hub - P1-22 (S) Snapshot listing: agent calls
restic snapshots --jsonafter each successful backup and ships the projection oversnapshots.report. ServerReplaceHostSnapshotsatomically swaps the per-host list and updateshosts.snapshot_countin the same tx. Read endpoint:GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unusedrepo_idFK fromsnapshots(repos as a first-class entity is P2 work).
UI (HTMX + Tailwind)
- P1-23 (M) Base layout, login page, session-aware nav
- P1-24 (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by
GET /api/hosts+GET /api/fleet/summary(JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMXRun nowbutton posts to/hosts/{id}/run-backup. - P1-25 (M) Host detail page: snapshots tab + run-now button
- P1-26 (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
- P1-27 (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), shows the operator a copy-friendly install command and a one-click "download preconfigured installer" — a
install-<hostname>.shwithRM_SERVER+RM_TOKENalready templated in (cf. UrBackup Internet-mode push installer). Encrypted repo creds ride on the token row and get pushed to the agent on first WS connect (see secrets/keyring task). - P1-28 (S) Tailwind build via
tailwindcssstandalone binary (no Node) — Makefile downloads pinned v3.4.17 intobin/tailwindcss, buildsweb/styles/input.css→web/static/css/styles.css, embedded into the binary viaweb.FS.make buildruns Tailwind first.
Install scripts
- P1-29 (M)
install.sh(Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers //etc/cron.{d,daily,hourly,weekly}/*/ root crontab and prints them with the exact disable commands — does not auto-disable - [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (
/agent/binary+/install/*); signature verification deferred to Phase 5 OSS readiness
Repo credentials (pulled forward from Phase 2)
-
P1-32 (M) Server-side encrypted repo creds carried on the enrollment token:
POST /api/enrollment-tokensbody growsrepo_url,repo_username,repo_password(all required).- Token row stores them as one AEAD-encrypted blob (existing
crypto.AEAD);ConsumeEnrollmentTokenmoves the blob to a newhost_credentialsrow keyed byhost_idin the same tx. PUT /api/hosts/{id}/repo-credentials(admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.GET /api/hosts/{id}/repo-credentialsreturns the redacted view (URL + username +has_password) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.- On WS
hello, server pushes aconfig.updatewith decrypted creds before returning the connection to idle. Same path on edit-while-connected. - Audit-logged on create / consume / edit; payload omits the secret material.
-
P1-33 (M) Agent-side encrypted secrets store:
- New
internal/agent/secretspackage: AEAD blob at/var/lib/restic-manager/secrets.enc, atomic write (tmp+fsync+rename, mode 0600). - Per-host 32-byte secrets key minted at enrollment, persisted in
agent.yaml(already 0600 root-only — same trust boundary as the bearer; explicit comment in the file). - Strip
repo_url/repo_passwordfromagent.config.Config. Agent loads creds fromsecrets.encat startup;config.updatehandler writes through to the file. - Dispatcher reads from the secrets store on every job rather than from in-memory config.
- Migration path: if
agent.yamlstill containsrepo_url/repo_password, copy them intosecrets.encon next start, then strip from the YAML on save.
- New
-
P1-34 (S) End-to-end smoke runbook:
docs/e2e-smoke.mdwalks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a realrestic/rest-serverin a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (
.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). - Agent ↔ server
protocol_versionhandshake rejects mismatched versions with a clear error rather than failing on JSON parse. - Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as
config.update.
Phase 2 — Scheduling, retention, repo operations
- P2-01 (M) Schedule schema + CRUD API
- P2-02 (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
- P2-03 (M) Agent local scheduler (
robfig/cron/v3); persists next-fire times across restarts - P2-04 (M) Schedule editor UI (paths, excludes, tags, cron, retention)
- P2-05 (M)
forgetcommand with retention policy (keep-last/daily/weekly/monthly/yearly) - P2-06 (M)
prunecommand (admin-only, uses non-append-only credential) - P2-07 (S)
checkcommand (random subset +--read-data-subset) - P2-08 (S)
unlockcommand - P2-09 (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
- P2-10 (S) Run-now buttons for forget/prune/check/unlock on host detail page
- P2-11 (S) Schedule "next run" / "last run" surfaced on host card
- P2-12 (S) Bandwidth limit fields on schedule editor (
--limit-upload,--limit-download); also overridable on run-now jobs - P2-13 (M) Pre/post backup hooks: schema (
Schedule.pre_hook,Schedule.post_hook,Host.pre_hook_default,Host.post_hook_default), encrypted at rest, admin-only edit, audit-logged - P2-14 (M) Agent execution of hooks: configurable shell per host,
pre_hookfailure aborts backup,post_hookalways runs withRM_JOB_STATUSenv var, stdout/stderr captured intoJobLogwith prefix - P2-15 (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on
kind = backupschedules (see spec.md §14.3) - P2-16 (M) Windows service integration: agent runs under the Service Control Manager via
golang.org/x/sys/windows/svc; install/uninstall/start/stop wired up - P2-17 (M)
install.ps1(Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named*restic*and prints them for manual review - P2-18 (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
- Agent run with no
RM_TOKENgenerates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), thenPOST /api/agents/announcewith{hostname, os, arch, agent_version, restic_version, public_key}. Server stores apending_hostsrow (public_key,fingerprint = sha256(public_key),announced_from_ip,first_seen_at,last_seen_at,expires_at = now+1h). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal. - Agent then opens a long-poll/WS to
/ws/agent/pendingauthenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits. - Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g.
SHA256:ab12…cd34) and tells the operator to compare it to the one shown in the UI before clicking accept. - UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: Accept (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / Reject (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
- Server-side guards: per-source-IP rate limit on
/api/agents/announce(token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does not auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race). - Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting
hostnameover the wire.
- Agent run with no
Phase 2 acceptance
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a
mysqldumpexample) and are rejected on non-backup schedule kinds. Bandwidth limits honoured. - A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
- A Linux host can enroll via announce-and-approve: operator runs the install script with no token, sees a fingerprint in the terminal, finds the matching pending row in the UI, clicks accept, and the host is fully credentialled and online without further endpoint interaction. Rejecting a pending row leaves the agent process exited cleanly with a clear log line. Rate-limit and pending-cap guards verified under a synthetic flood.
Phase 3 — Restore, alerts, audit
- P3-01 (L) Restore wizard backend: snapshot tree browse via
restic ls --json, path picker, target selection - P3-02 (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
- P3-03 (M) Restore execution:
restic restoreinvocation, progress streaming - P3-04 (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
- P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- P3-06 (M) Notification channels: webhook, ntfy, SMTP email
- P3-07 (S) Alert UI: list, acknowledge, resolve
- P3-08 (S) Audit log UI with filters (user, action, target, time range)
- P3-09 (S)
diffbetween two snapshots in UI
Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
Phase 4 — Update delivery, RBAC polish, OIDC
- P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases.
restic-manager-agent updateis a thin wrapper overapt-get install --only-upgrade restic-manager-agent/choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2) - P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
- P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
- P4-05 (L) OIDC login (generic provider config, group → role mapping)
- P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- P4-07 (S) Per-host tags + dashboard filtering by tag
- P4-08 (M) Prometheus
/metricsendpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list - P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON
Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape
/metricsand the sample Grafana dashboard renders with live data.
Phase 5 — OSS readiness
- P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- P5-02 (S)
CONTRIBUTING.md,CODE_OF_CONDUCT.md, issue + PR templates - P5-03 (S) Release automation:
goreleaserfor binaries + Docker image to GHCR - P5-04 (S) Demo screenshots / short Loom walkthrough in README
- P5-05 (S)
SECURITY.mdwith disclosure process - P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- P5-07 (S) Reference deployment:
docker-compose.yml+ Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstratesRM_TRUSTED_PROXY)
Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
Cross-cutting / ongoing
- X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
- X-02 Track restic version compatibility matrix
- X-03 Periodic dependency updates (
dependabotorrenovate) - X-04 Threat-model review at end of each phase