8d5282a180
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.
Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.
Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.
Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
12 KiB
restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: S = under a day, M = 1–3 days, L = 3–7 days.
Phase 0 — Project bootstrap
- P0-01 (S) Initialize Go module,
cmd/server,cmd/agent, baselineinternal/packages - P0-02 (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- P0-03 (S) Set up
golangci-lint,gofumpt,goimports; pre-commit config - P0-04 (S)
GitHub ActionsGitea Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint - P0-05 (S)
Dockerfile.server(multi-stage, distroless),deploy/docker-compose.yml - P0-06 (S) Makefile /
with common targets (taskfile.ymlbuild,test,run,release)
Phase 1 — MVP: enrollment, visibility, on-demand backup
Server foundations
- P1-01 (M) HTTP server scaffolding (
chi, structured logging viaslog, graceful shutdown) - P1-02 (M) SQLite store layer (
modernc.org/sqlite) + migrations (hand-rolled,embed.FS) - P1-03 (M) Schema for
users,sessions,hosts,repos,credentials,jobs,job_logs,snapshots,audit_log - [~] P1-04 (M) Auth: argon2id password hashing, login/logout, session cookies; CSRF middleware deferred to P1-23 (UI work) — REST clients use bearer/session-only flows
- P1-05 (S) First-run admin bootstrap (printed one-time setup token in server logs)
- P1-06 (M) Secret encryption helper (AEAD with key from
RM_SECRET_KEY_FILE) - [~] P1-07 (M) Audit log writer; middleware sweep for every state-changing endpoint lands when the rest of the API surface does — login / bootstrap / host.enrolled / job.run_now currently audited
Agent ↔ server protocol
- P1-08 (M) Define shared API types in
internal/api(envelopes, every WS message +protocol_versionconstants; JSON-shape tests pin the wire) - P1-09 (L) WebSocket transport (
github.com/coder/websocket), framed JSON envelopes, RPC correlation IDs, exponential-backoff reconnect on the agent side - P1-10 (M) Enrollment flow:
POST /api/agents/enrollwith one-time token → returns persistent bearer. Cert pin field stays in the response shape but is left empty: the server is HTTP-only behind a reverse proxy, so the operator pastes the proxy's cert hash into the install command rather than having the server introspect a cert it doesn't terminate. - P1-11 (M) Agent registration on connect (
helloupserts agent_version/restic_version/protocol_version, flips status online,protocol_too_oldrejection has clean error envelope) - P1-12 (S) Heartbeat handler (touches
last_seen_at; background sweeper marks hosts offline after 90s without one)
Agent foundations
- P1-13 (M) Agent config file (
/etc/restic-manager/agent.yaml, atomic save: tmp+fsync+rename); Windows path deferred to Phase 2 - P1-14 (M) Service integration: systemd unit (sandboxed: NoNewPrivileges, Protect*, MemoryDenyWriteExecute) — Linux only in Phase 1
- P1-15 (M) Outbound WS client (
github.com/coder/websocket) with reconnect (1s → 60s exponential + jitter), optional cert pin (sha-256 of leaf), heartbeats,protocol_versionin hello - P1-16 (M) Restic wrapper: locate via PATH or override, run with
--json, scan stdout/stderr, parseBackupStatus+BackupSummary, exit-code 3 treated as success-with-issues - P1-17 (S) Host metadata collection (OS, arch, hostname, restic version, agent version, protocol_version)
Run-now backup
- P1-18 (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with stats blob and error
- P1-19 (M) Server endpoint
POST /api/hosts/{id}/jobsto dispatch abackupcommand (validates kind, checks online, audit-logs) - P1-20 (M) Agent executes
restic backup, streams stdout/stderr + parsed JSON events back asjob.progress(1Hz throttle) /log.stream - [~] P1-21 (M) Server persists log stream to
job_logs✓; WS/api/jobs/{id}/streamfor live browser tailing still TODO — needs the per-job fan-out hub - P1-22 (S) Snapshot listing: agent calls
restic snapshots --jsonafter each successful backup and ships the projection oversnapshots.report. ServerReplaceHostSnapshotsatomically swaps the per-host list and updateshosts.snapshot_countin the same tx. Read endpoint:GET /api/hosts/{id}/snapshots. Tests cover round-trip, authoritative-replace semantics, and empty-after-prune. Schema dropped the unusedrepo_idFK fromsnapshots(repos as a first-class entity is P2 work).
UI (HTMX + Tailwind)
- P1-23 (M) Base layout, login page, session-aware nav
- P1-24 (M) Dashboard: host cards (status dot, last backup, repo size)
- P1-25 (M) Host detail page: snapshots tab + run-now button
- P1-26 (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
- P1-27 (S) "Add host" flow: generate token, copy install command snippet
- P1-28 (S) Tailwind build via
tailwindcssstandalone binary (no Node)
Install scripts
- P1-29 (M)
install.sh(Linux): detects arch, downloads agent, installs systemd unit, enrolls. Detects existing restic timers //etc/cron.{d,daily,hourly,weekly}/*/ root crontab and prints them with the exact disable commands — does not auto-disable - [~] P1-31 (S) Server endpoint to serve agent binaries + install scripts ✓ (
/agent/binary+/install/*); signature verification deferred to Phase 5 OSS readiness
Phase 1 acceptance
- One Linux host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
- Windows binary builds cleanly in CI (
.gitea/workflows/ci.yml) but is not service-tested or installer-shipped in Phase 1 — that lands in Phase 2 (P2-16, P2-17). - Agent ↔ server
protocol_versionhandshake rejects mismatched versions with a clear error rather than failing on JSON parse.
Phase 2 — Scheduling, retention, repo operations
- P2-01 (M) Schedule schema + CRUD API
- P2-02 (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
- P2-03 (M) Agent local scheduler (
robfig/cron/v3); persists next-fire times across restarts - P2-04 (M) Schedule editor UI (paths, excludes, tags, cron, retention)
- P2-05 (M)
forgetcommand with retention policy (keep-last/daily/weekly/monthly/yearly) - P2-06 (M)
prunecommand (admin-only, uses non-append-only credential) - P2-07 (S)
checkcommand (random subset +--read-data-subset) - P2-08 (S)
unlockcommand - P2-09 (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
- P2-10 (S) Run-now buttons for forget/prune/check/unlock on host detail page
- P2-11 (S) Schedule "next run" / "last run" surfaced on host card
- P2-12 (S) Bandwidth limit fields on schedule editor (
--limit-upload,--limit-download); also overridable on run-now jobs - P2-13 (M) Pre/post backup hooks: schema (
Schedule.pre_hook,Schedule.post_hook,Host.pre_hook_default,Host.post_hook_default), encrypted at rest, admin-only edit, audit-logged - P2-14 (M) Agent execution of hooks: configurable shell per host,
pre_hookfailure aborts backup,post_hookalways runs withRM_JOB_STATUSenv var, stdout/stderr captured intoJobLogwith prefix - P2-15 (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user"); validation enforces hooks only on
kind = backupschedules (see spec.md §14.3) - P2-16 (M) Windows service integration: agent runs under the Service Control Manager via
golang.org/x/sys/windows/svc; install/uninstall/start/stop wired up - P2-17 (M)
install.ps1(Windows): downloads agent, installs as service, enrolls; detects existing scheduled tasks named*restic*and prints them for manual review
Phase 2 acceptance
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a
mysqldumpexample) and are rejected on non-backup schedule kinds. Bandwidth limits honoured. - A Windows host can enroll, appear in the dashboard, and run a backup with live log streaming — closing the cross-platform gap left by Phase 1.
Phase 3 — Restore, alerts, audit
- P3-01 (L) Restore wizard backend: snapshot tree browse via
restic ls --json, path picker, target selection - P3-02 (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
- P3-03 (M) Restore execution:
restic restoreinvocation, progress streaming - P3-04 (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
- P3-05 (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- P3-06 (M) Notification channels: webhook, ntfy, SMTP email
- P3-07 (S) Alert UI: list, acknowledge, resolve
- P3-08 (S) Audit log UI with filters (user, action, target, time range)
- P3-09 (S)
diffbetween two snapshots in UI
Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
Phase 4 — Update delivery, RBAC polish, OIDC
- P4-01 (M) Update delivery via OS package managers — host an apt repo (Linux) and Chocolatey package (Windows) on gitea releases.
restic-manager-agent updateis a thin wrapper overapt-get install --only-upgrade restic-manager-agent/choco upgrade. Trades flexibility for a much smaller security surface than bespoke signed binaries (see spec.md §4.2) - P4-02 (M) Agent version reporting on dashboard: surface "agent N versions behind server"; "update all" admin action calls the package-manager wrapper on each host
- P4-03 (M) RBAC enforcement at API layer (admin / operator / viewer)
- P4-04 (S) User management UI (create/edit/disable, role assignment, password reset)
- P4-05 (L) OIDC login (generic provider config, group → role mapping)
- P4-06 (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- P4-07 (S) Per-host tags + dashboard filtering by tag
- P4-08 (M) Prometheus
/metricsendpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list - P4-09 (S) Document Prometheus integration + sample Grafana dashboard JSON
Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents upgrade via apt/choco with one admin-triggered action. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape
/metricsand the sample Grafana dashboard renders with live data.
Phase 5 — OSS readiness
- P5-01 (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- P5-02 (S)
CONTRIBUTING.md,CODE_OF_CONDUCT.md, issue + PR templates - P5-03 (S) Release automation:
goreleaserfor binaries + Docker image to GHCR - P5-04 (S) Demo screenshots / short Loom walkthrough in README
- P5-05 (S)
SECURITY.mdwith disclosure process - P5-06 (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- P5-07 (S) Reference deployment:
docker-compose.yml+ Caddyfile snippet showing the TLS-terminating reverse proxy in front of the HTTP-only server (also demonstratesRM_TRUSTED_PROXY)
Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
Cross-cutting / ongoing
- X-01 Keep CHANGELOG.md updated (Keep-a-Changelog format)
- X-02 Track restic version compatibility matrix
- X-03 Periodic dependency updates (
dependabotorrenovate) - X-04 Threat-model review at end of each phase