ee16bc7ce7
Server-rendered HTML view backed by:
- new store.FleetSummary aggregating host counts + repo bytes +
snapshot total + open alerts + last-24h job rollup in two queries.
- GET /api/hosts (JSON list of hosts in the dashboard projection).
- GET /api/fleet/summary (JSON aggregate, same shape as above).
The HTML page (web/templates/pages/dashboard.html) renders the four
summary tiles + host table directly from store data — no separate
fetch. Per-row state colour comes from .host-row.{degraded,failed,
offline} which paint a 3px left edge so problem hosts are scannable
without reading. HTMX is loaded into the base layout so per-row
"Run now" buttons can hx-post to /hosts/{id}/run-backup, a thin
HTML wrapper that funnels into a new dispatchJob helper shared
with the JSON /api/hosts/{id}/jobs endpoint.
Empty state (zero hosts) collapses to the "no hosts yet" prompt
with the + Add host CTA — matches the v1 mockup.
Template helpers (internal/server/ui/funcs.go) added for byte
formatting (412 GB / 3.7 TB), relative time (3m ago / 2d ago), and
comma grouping (1,847). Pure Go, no template-magic dependency.
Browser-verified end-to-end with seeded fixture data: five hosts
across all four states render with correct dots, accents, last-
backup pills, sizes, snapshot counts, alerts, tags, and the right
action button (Run now / Retry / Run first / View → / offline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
191 lines
18 KiB
Markdown
191 lines
18 KiB
Markdown
# restic-manager — Tasks
|
||
|
||
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
|
||
|
||
Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days.
|
||
|
||
---
|
||
|
||
## Phase 0 — Project bootstrap
|
||
|
||
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
|
||
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
|
||
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
|
||
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
|
||
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
|
||
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
|
||
|
||
---
|
||
|
||
## Phase 1 — MVP: enrollment, visibility, on-demand backup
|
||
|
||
### Server foundations
|
||
|
||
- [x] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
|
||
- [x] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (hand-rolled, `embed.FS`)
|
||
- [x] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
|
||
- [~] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies; **CSRF middleware deferred to P1-23 (UI work)** — REST clients use bearer/session-only flows
|
||
- [x] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
|
||
- [x] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
|
||
- [~] **P1-07** (M) Audit log writer; middleware sweep for every state-changing endpoint **lands when the rest of the API surface does** — login / bootstrap / host.enrolled / job.run_now currently audited
|
||
|
||
### Agent ↔ server protocol
|
||
|
||
- [x] **P1-08** (M) Define shared API types in `internal/api` (envelopes, every WS message + `protocol_version` constants; JSON-shape tests pin the wire)
|
||
- [x] **P1-09** (L) WebSocket transport (`github.com/coder/websocket`), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side
|
||
- [x] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate.
|
||
- [x] **P1-11** (M) Agent registration on connect (`hello` upserts agent_version/restic_version/protocol_version, flips status online, `protocol_too_old` rejection has clean error envelope)
|
||
- [x] **P1-12** (S) Heartbeat handler (touches `last_seen_at`; background sweeper marks hosts offline after 90s without one)
|
||
|
||
### Agent foundations
|
||
|
||
- [x] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2
|
||
- [x] **P1-14** (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
|
||
- [x] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats, `protocol_version` in hello
|
||
- [x] **P1-16** (M) Restic wrapper: locate via PATH or override, run with `--json`, scan stdout/stderr, parse `BackupStatus` + `BackupSummary`, exit-code 3 treated as success-with-issues
|
||
- [x] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
|
||
|
||
### Run-now backup
|
||
|
||
- [x] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
|
||
- [x] **P1-19** (M) Server endpoint `POST /api/hosts/{id}/jobs` to dispatch a `backup` command (validates kind, checks online, audit-logs)
|
||
- [x] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` (1Hz throttle) / `log.stream`
|
||
- [~] **P1-21** (M) Server persists log stream to `job_logs` ✓; **WS `/api/jobs/{id}/stream` for live browser tailing** still TODO — needs the per-job fan-out hub
|
||
- [x] **P1-22** (S) Snapshot listing: agent calls `restic snapshots --json` after each successful backup and ships the projection over `snapshots.report`. Server `ReplaceHostSnapshots` atomically swaps the per-host list and updates `hosts.snapshot_count` in the same tx. Read endpoint: `GET /api/hosts/{id}/snapshots`. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unused `repo_id` FK from `snapshots` (repos as a first-class entity is P2 work).
|
||
|
||
### UI (HTMX + Tailwind)
|
||
|
||
- [x] **P1-23** (M) Base layout, login page, session-aware nav
|
||
- [x] **P1-24** (M) Dashboard: fleet summary tiles + host table (status dot + row accent + os/arch + last backup + repo size + snapshots + alerts + tags + run-now). Backed by `GET /api/hosts` + `GET /api/fleet/summary` (JSON) and a server-rendered HTML view. Empty state hands the operator the install command. HTMX `Run now` button posts to `/hosts/{id}/run-backup`.
|
||
- [ ] **P1-25** (M) Host detail page: snapshots tab + run-now button
|
||
- [ ] **P1-26** (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
|
||
- [ ] **P1-27** (M) "Add host" flow: form takes hostname + repo URL/username/password, mints token (TTL 1h), shows the operator a copy-friendly install command **and** a one-click "download preconfigured installer" — a `install-<hostname>.sh` with `RM_SERVER` + `RM_TOKEN` already templated in (cf. UrBackup Internet-mode push installer). Encrypted repo creds ride on the token row and get pushed to the agent on first WS connect (see secrets/keyring task).
|
||
- [x] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) — Makefile downloads pinned v3.4.17 into `bin/tailwindcss`, builds `web/styles/input.css` → `web/static/css/styles.css`, embedded into the binary via `web.FS`. `make build` runs Tailwind first.
|
||
|
||
### Install scripts
|
||
|
||
- [x] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers / `/etc/cron.{d,daily,hourly,weekly}/*` / root crontab and prints them with the exact disable commands — does **not** auto-disable
|
||
- [~] **P1-31** (S) Server endpoint to serve agent binaries + install scripts ✓ (`/agent/binary` + `/install/*`); **signature verification** deferred to Phase 5 OSS readiness
|
||
|
||
### Repo credentials (pulled forward from Phase 2)
|
||
|
||
- [x] **P1-32** (M) Server-side encrypted repo creds carried on the enrollment token:
|
||
- `POST /api/enrollment-tokens` body grows `repo_url`, `repo_username`, `repo_password` (all required).
|
||
- Token row stores them as one AEAD-encrypted blob (existing `crypto.AEAD`); `ConsumeEnrollmentToken` moves the blob to a new `host_credentials` row keyed by `host_id` in the same tx.
|
||
- `PUT /api/hosts/{id}/repo-credentials` (admin/operator) re-encrypts and replaces the row, emits an in-memory event to the WS hub.
|
||
- `GET /api/hosts/{id}/repo-credentials` returns the redacted view (URL + username + `has_password`) so the UI can pre-fill the edit form. Password never leaves the server outside the WS push.
|
||
- On WS `hello`, server pushes a `config.update` with decrypted creds **before** returning the connection to idle. Same path on edit-while-connected.
|
||
- Audit-logged on create / consume / edit; payload omits the secret material.
|
||
|
||
- [x] **P1-33** (M) Agent-side encrypted secrets store:
|
||
- New `internal/agent/secrets` package: AEAD blob at `/var/lib/restic-manager/secrets.enc`, atomic write (tmp+fsync+rename, mode 0600).
|
||
- Per-host 32-byte secrets key minted at enrollment, persisted in `agent.yaml` (already 0600 root-only — same trust boundary as the bearer; explicit comment in the file).
|
||
- Strip `repo_url` / `repo_password` from `agent.config.Config`. Agent loads creds from `secrets.enc` at startup; `config.update` handler writes through to the file.
|
||
- Dispatcher reads from the secrets store on every job rather than from in-memory config.
|
||
- Migration path: if `agent.yaml` still contains `repo_url`/`repo_password`, copy them into `secrets.enc` on next start, then strip from the YAML on save.
|
||
|
||
- [x] **P1-34** (S) End-to-end smoke runbook: `docs/e2e-smoke.md` walks through enrollment with repo creds → agent receives them via push-on-connect → run-now backup completes against a real `restic/rest-server` in a sibling container → host appears with snapshot count. Test-driven version (Playwright + compose) deferred to P5-06.
|
||
|
||
### Phase 1 acceptance
|
||
|
||
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
|
||
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
|
||
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
|
||
- Repo credentials never appear in plaintext on disk: server stores them AEAD-encrypted, agent stores them AEAD-encrypted, the wire carries them only inside the authenticated WS as `config.update`.
|
||
|
||
---
|
||
|
||
## Phase 2 — Scheduling, retention, repo operations
|
||
|
||
- [ ] **P2-01** (M) Schedule schema + CRUD API
|
||
- [ ] **P2-02** (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
|
||
- [ ] **P2-03** (M) Agent local scheduler (`robfig/cron/v3`); persists next-fire times across restarts
|
||
- [ ] **P2-04** (M) Schedule editor UI (paths, excludes, tags, cron, retention)
|
||
- [ ] **P2-05** (M) `forget` command with retention policy (keep-last/daily/weekly/monthly/yearly)
|
||
- [ ] **P2-06** (M) `prune` command (admin-only, uses non-append-only credential)
|
||
- [ ] **P2-07** (S) `check` command (random subset + `--read-data-subset`)
|
||
- [ ] **P2-08** (S) `unlock` command
|
||
- [ ] **P2-09** (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
|
||
- [ ] **P2-10** (S) Run-now buttons for forget/prune/check/unlock on host detail page
|
||
- [ ] **P2-11** (S) Schedule "next run" / "last run" surfaced on host card
|
||
- [ ] **P2-12** (S) Bandwidth limit fields on schedule editor (`--limit-upload`, `--limit-download`); also overridable on run-now jobs
|
||
- [ ] **P2-13** (M) Pre/post backup hooks: schema (`Schedule.pre_hook`, `Schedule.post_hook`, `Host.pre_hook_default`, `Host.post_hook_default`), encrypted at rest, admin-only edit, audit-logged
|
||
- [ ] **P2-14** (M) Agent execution of hooks: configurable shell per host, `pre_hook` failure aborts backup, `post_hook` always runs with `RM_JOB_STATUS` env var, stdout/stderr captured into `JobLog` with prefix
|
||
- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on `kind = backup` schedules (see spec.md §14.3)
|
||
- [ ] **P2-16** (M) Windows service integration: agent runs under the Service Control Manager via `golang.org/x/sys/windows/svc`; install/uninstall/start/stop wired up
|
||
- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review
|
||
- [ ] **P2-18** (L) Announce-and-approve enrollment (second enrollment mode, alongside the token flow that ships in Phase 1):
|
||
- Agent run with no `RM_TOKEN` generates a local Ed25519 keypair (persisted alongside the encrypted secrets blob), then `POST /api/agents/announce` with `{hostname, os, arch, agent_version, restic_version, public_key}`. Server stores a `pending_hosts` row (`public_key`, `fingerprint = sha256(public_key)`, `announced_from_ip`, `first_seen_at`, `last_seen_at`, `expires_at = now+1h`). Hostname collisions with existing or other pending rows are flagged in the response so the install script can warn loudly on the endpoint terminal.
|
||
- Agent then opens a long-poll/WS to `/ws/agent/pending` authenticated by signing a server-issued nonce with its private key — proves possession of the key tied to the pending row. Connection stays open; agent waits.
|
||
- Install script prints the fingerprint on the endpoint's terminal in a copy-friendly form (e.g. `SHA256:ab12…cd34`) and tells the operator to compare it to the one shown in the UI before clicking accept.
|
||
- UI: new "Pending hosts" panel on the dashboard. Admin sees fingerprint, hostname, source IP, OS/arch, time announced. Buttons: **Accept** (mints persistent bearer + repo creds, pushes both down the open pending socket, promotes pending row → real Host row, audit-logged) / **Reject** (deletes pending row, closes the socket with a clean error). Fingerprint is the load-bearing field — UI must make comparison easy (large monospace, one-click copy).
|
||
- Server-side guards: per-source-IP rate limit on `/api/agents/announce` (token-bucket, e.g. 10/min); global cap on pending rows (e.g. 100); pending rows auto-expire after 1h; duplicate-hostname pending rows allowed but visually flagged in UI; accepting one does **not** auto-reject the others (admin sees them all and decides — defends against the "attacker announces first, real host second" race).
|
||
- Token-based enrollment (Phase 1) remains the default and is unchanged; announce-and-approve is opt-in for interactive installs. Docs explicitly call out that the fingerprint comparison step is what makes this flow safe — without it, this is no better than trusting `hostname` over the wire.
|
||
|
||
### Phase 2 acceptance
|
||
|
||
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured.
|
||
- A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
|
||
- A Linux host can enroll via announce-and-approve: operator runs the install script with no token, sees a fingerprint in the terminal, finds the matching pending row in the UI, clicks accept, and the host is fully credentialled and online without further endpoint interaction. Rejecting a pending row leaves the agent process exited cleanly with a clear log line. Rate-limit and pending-cap guards verified under a synthetic flood.
|
||
|
||
---
|
||
|
||
## Phase 3 — Restore, alerts, audit
|
||
|
||
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
|
||
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
|
||
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
|
||
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
|
||
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
|
||
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
|
||
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
|
||
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
|
||
- [ ] **P3-09** (S) `diff` between two snapshots in UI
|
||
|
||
### Phase 3 acceptance
|
||
|
||
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
|
||
|
||
---
|
||
|
||
## Phase 4 — Update delivery, RBAC polish, OIDC
|
||
|
||
- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
|
||
- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
|
||
- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
|
||
- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
|
||
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
|
||
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
|
||
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
|
||
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
|
||
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
|
||
|
||
### Phase 4 acceptance
|
||
|
||
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
|
||
|
||
---
|
||
|
||
## Phase 5 — OSS readiness
|
||
|
||
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
|
||
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
|
||
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
|
||
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
|
||
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
|
||
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
|
||
- [ ] **P5-07** (S) Reference deployment: `docker-compose.yml` + Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstrates `RM_TRUSTED_PROXY`)
|
||
|
||
### Phase 5 acceptance
|
||
|
||
- A stranger can read the docs and stand up a working install in under 30 minutes.
|
||
|
||
---
|
||
|
||
## Cross-cutting / ongoing
|
||
|
||
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
|
||
- [ ] **X-02** Track restic version compatibility matrix
|
||
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
|
||
- [ ] **X-04** Threat-model review at end of each phase
|