Files
restic-manager/tasks.md
T
steve 595546afb9
CI / Test (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Build (windows/amd64) (push) Has been cancelled
CI / Build (linux/amd64) (push) Has been cancelled
CI / Build (linux/arm64) (push) Has been cancelled
spec/tasks: address pre-Phase-1 design feedback
Doc-only changes captured before any Phase 1 code lands.

spec.md:
- §4.1 nhooyr.io/websocket → github.com/coder/websocket (the
  maintained fork; the original is unmaintained)
- §4.1 RM_LISTEN documented as source of truth for the bind port;
  add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind
  Caddy/Traefik
- §4.2 Phase 1 ships Linux only; Windows binaries continue to build
  in CI to keep the codebase portable, but service integration +
  installer move to Phase 2
- §4.2 self-update via apt/choco, not bespoke signed binaries
- §5 add Host.protocol_version + Host.applied_schedule_version
- §6.2 lock protocol_version handshake semantics (clean error on
  mismatch, not weird JSON parse failures)
- §6.2 schedule reconciliation when server unreachable: agent keeps
  firing last-known-good indefinitely; server's view canonical on
  reconnect; UI surfaces drift via applied_schedule_version
- §6.2 schedule.set carries schedule_version; new schedule.ack
  agent→server message
- §10.1 cross-reference RM_LISTEN ↔ compose port mapping
- §14.3 hooks rejected at validation on non-backup schedule kinds

tasks.md:
- P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as
  P2-16 / P2-17
- P1-29 install.sh detects existing restic timers/cron and prints
  disable commands, doesn't auto-disable
- Phase 1 acceptance: drop Windows from end-to-end criterion,
  require windows cross-compile in CI
- P4-01 rewritten: package-manager-based update delivery
- P5-08 removed (duplicate of P4-08 Prometheus /metrics)
- Various references updated

No Go code changes; build still clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:12:55 +01:00

152 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
---
## Phase 0 — Project bootstrap
- [x] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
- [x] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- [x] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
- [x] **P0-04** (S) ~~GitHub Actions~~ Gitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
- [x] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
- [x] **P0-06** (S) Makefile / ~~`taskfile.yml`~~ with common targets (`build`, `test`, `run`, `release`)
---
## Phase 1 — MVP: enrollment, visibility, on-demand backup
### Server foundations
- [ ] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
- [ ] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (`golang-migrate` or hand-rolled)
- [ ] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
- [ ] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies, CSRF middleware
- [ ] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
- [ ] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
- [ ] **P1-07** (M) Audit log writer + middleware
### Agent ↔ server protocol
- [ ] **P1-08** (M) Define shared API types in `internal/api` (Go structs, JSON tags)
- [ ] **P1-09** (L) WebSocket transport (`nhooyr.io/websocket`), framed JSON envelopes, request/response correlation, ping/pong, reconnect with backoff
- [ ] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer + cert pin
- [ ] **P1-11** (M) Agent registration on connect (`hello` message → upsert host record, mark online)
- [ ] **P1-12** (S) Heartbeat handler (mark host offline after 90s without heartbeat)
### Agent foundations
- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml`); Windows path deferred to Phase 2
- [ ] **P1-14** (M) Service integration: systemd unit (Linux only in Phase 1; Windows service entrypoint deferred to Phase 2 — see P2-16)
- [ ] **P1-15** (M) Outbound WS client (`github.com/coder/websocket`) with reconnect, server cert pinning, `protocol_version` advertisement in `hello`
- [ ] **P1-16** (M) Restic wrapper: locate `restic` binary, run with `--json`, stream parsed events
- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
### Run-now backup
- [ ] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with logs
- [ ] **P1-19** (M) Server endpoint `POST /api/hosts/:id/jobs` to dispatch a `backup` command
- [ ] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` / `log.stream`
- [ ] **P1-21** (M) Server persists log stream to `job_logs`, exposes `WS /api/jobs/:id/stream` for live tailing
- [ ] **P1-22** (S) Snapshot listing: `restic snapshots --json`, cached projection table, refresh after each backup
### UI (HTMX + Tailwind)
- [ ] **P1-23** (M) Base layout, login page, session-aware nav
- [ ] **P1-24** (M) Dashboard: host cards (status dot, last backup, repo size)
- [ ] **P1-25** (M) Host detail page: snapshots tab + run-now button
- [ ] **P1-26** (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
- [ ] **P1-27** (S) "Add host" flow: generate token, copy install command snippet
- [ ] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node)
### Install scripts
- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls. Also detects existing restic timers/cron (`systemctl list-timers --all | grep -i restic`, `crontab -l`, `/etc/cron.d/`, `/etc/cron.daily/`) and prints them with the disable commands — does **not** auto-disable, since heuristic matches could be unrelated tooling
- [ ] **P1-31** (S) Server endpoint to serve agent binaries + install scripts (signed)
### Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (`.gitea/workflows/ci.yml`) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17).
- Agent ↔ server `protocol_version` handshake rejects mismatched versions with a clear error rather than failing on JSON parse.
---
## Phase 2 — Scheduling, retention, repo operations
- [ ] **P2-01** (M) Schedule schema + CRUD API
- [ ] **P2-02** (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
- [ ] **P2-03** (M) Agent local scheduler (`robfig/cron/v3`); persists next-fire times across restarts
- [ ] **P2-04** (M) Schedule editor UI (paths, excludes, tags, cron, retention)
- [ ] **P2-05** (M) `forget` command with retention policy (keep-last/daily/weekly/monthly/yearly)
- [ ] **P2-06** (M) `prune` command (admin-only, uses non-append-only credential)
- [ ] **P2-07** (S) `check` command (random subset + `--read-data-subset`)
- [ ] **P2-08** (S) `unlock` command
- [ ] **P2-09** (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
- [ ] **P2-10** (S) Run-now buttons for forget/prune/check/unlock on host detail page
- [ ] **P2-11** (S) Schedule "next run" / "last run" surfaced on host card
- [ ] **P2-12** (S) Bandwidth limit fields on schedule editor (`--limit-upload`, `--limit-download`); also overridable on run-now jobs
- [ ] **P2-13** (M) Pre/post backup hooks: schema (`Schedule.pre_hook`, `Schedule.post_hook`, `Host.pre_hook_default`, `Host.post_hook_default`), encrypted at rest, admin-only edit, audit-logged
- [ ] **P2-14** (M) Agent execution of hooks: configurable shell per host, `pre_hook` failure aborts backup, `post_hook` always runs with `RM_JOB_STATUS` env var, stdout/stderr captured into `JobLog` with prefix
- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on `kind = backup` schedules (see spec.md §14.3)
- [ ] **P2-16** (M) Windows service integration: agent runs under the Service Control Manager via `golang.org/x/sys/windows/svc`; install/uninstall/start/stop wired up
- [ ] **P2-17** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named `*restic*` and prints them for manual review
### Phase 2 acceptance
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example) and are rejected on non-backup schedule kinds. Bandwidth limits honoured.
- A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
---
## Phase 3 — Restore, alerts, audit
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
- [ ] **P3-09** (S) `diff` between two snapshots in UI
### Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
---
## Phase 4 — Update delivery, RBAC polish, OIDC
- [ ] **P4-01** (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases. `restic-manager-agent update` is a thin wrapper over `apt-get install --only-upgrade restic-manager-agent` / `choco upgrade`. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2)
- [ ] **P4-02** (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
### Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
---
## Phase 5 — OSS readiness
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar (also demonstrates `RM_TRUSTED_PROXY`)
### Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
---
## Cross-cutting / ongoing
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
- [ ] **X-02** Track restic version compatibility matrix
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
- [ ] **X-04** Threat-model review at end of each phase