From 7612687a14f6afb8c0d87eb5834964ac7a44307f Mon Sep 17 00:00:00 2001 From: Steve Cliff Date: Thu, 30 Apr 2026 23:55:52 +0100 Subject: [PATCH] initial setup ready --- ask.md | 8 + design/wireframes.html | 721 +++++++++++++++++++++++++++++++++++++++++ spec.md | 455 ++++++++++++++++++++++++++ tasks.md | 148 +++++++++ 4 files changed, 1332 insertions(+) create mode 100644 ask.md create mode 100644 design/wireframes.html create mode 100644 spec.md create mode 100644 tasks.md diff --git a/ask.md b/ask.md new file mode 100644 index 0000000..e31a232 --- /dev/null +++ b/ask.md @@ -0,0 +1,8 @@ +# The ask! + +I have numerous servers deployed out in a lab, mainly Linux but some Windows +All have restic installed on them +I need to build a browser based management service that allows me to have a central single-plane-of-glass to monitor and manage all teh endpoints +All endpoints will be enabled for SSH (unless other methods are better?) + +Plan out how we would go about this please? \ No newline at end of file diff --git a/design/wireframes.html b/design/wireframes.html new file mode 100644 index 0000000..cb3e302 --- /dev/null +++ b/design/wireframes.html @@ -0,0 +1,721 @@ + + + + +restic-manager · Phase 0 wireframes + + + + +
+ +
+

restic-manager · Phase 0 wireframes

+

+ Low-fidelity wireframes for Phase 1/2 screens. Purpose: confirm the data each + screen needs before the API in spec.md §6.1 and the WS messages in §6.2 are + locked in. Grayscale on purpose — visual design is deferred to Phase 5 + (and a focused hi-fi pass on the restore wizard in Phase 3). +

+

+ [GET /api/...] tags mark REST data sources. + [WS: ...] tags mark WebSocket message dependencies. + Open the “Findings” section at the bottom for spec gaps. +

+
+ + + + + + +
+ Screen 1 · Dashboard (/) + +
+ + +
user: alice (admin) · logout
+
+ +
+
+ +
+
+
Fleet status
+
10 online · 1 offline · 1 degraded
+
Last sync 12s ago
+
+
+
Storage (sum across repos)
+
2.4 TB across 12 repos
+
+18 GB last 24h
+
+
+
Open alerts
+
3 · 1 critical
+
2 unacked
+
+
+ + +
+
[ search hosts · filter by tag · status ]
+
+ + Add host +
+
+ +

Hosts

+ + +
+ + +
+
+
prod-db-01 linux/amd64
+ online +
+
+
Last backup
+
2h ago · success
+
Repo
+
412 GB · 1,284 snapshots
+
Alerts
+
+
+ View + Backup now +
+
+ + +
+
+
staging-app linux/arm64
+ degraded +
+
+
Last backup
+
9h ago · failed
+
Repo
+
88 GB · 412 snapshots
+
Alerts
+
2 · 1 critical
+
+ View + Retry +
+
+ + +
+
+
laptop-bob windows/amd64
+ offline +
+
+
Last seen
+
3d ago
+
Repo
+
142 GB · 88 snapshots
+
Alerts
+
1
+
+ View + Backup now +
+
+ + +
… more host cards (12 total in target deployment)
+
+ + +

Recent activity (fleet-wide)

+
+ + + + + + + + + + + + + +
WhenHostKindStatusDuration
2h agoprod-db-01backupsucceeded00:14:22view
3h agoweb-02backupsucceeded00:08:11view
9h agostaging-appbackupfailed00:01:03view
1d agoprod-db-01checksucceeded00:42:17view
1d agoweb-01prunesucceeded00:04:55view
+
+
+ + + +
+
+ + + + + +
+ Screen 2 · Host detail (/hosts/:id) + +
+ + +
user: alice (admin)
+
+ +
+
+ +
+
+
+
« Dashboard / Hosts
+

prod-db-01

+
linux/amd64 · agent 0.4.2 · restic 0.17.1 · last seen 12s ago
+
+ online + tag: prod + tag: db +
+
+
+
Currently: idle
+
+ Backup now + Run check + +
+
+
+
+ + + + + +
+
+
[ filter by tag · path · date range ]
+
[ sort: newest first ]
+
+
+ + + + + + + + + + + + +
SnapshotTimePathsTagsSizeFiles
3a8f1e2h ago/var/lib/postgresauto, daily412 GB1.2Mrestore · diff
8c7b221d ago/var/lib/postgresauto, daily411 GB1.2Mrestore · diff
4f0a992d ago/var/lib/postgres, /etcauto, weekly411 GB1.2Mrestore · diff
… 1,281 more · load more
+
+
+ + +
+
Other tabs (preview, not navigated):
+ +
+ + +
+
Tab · Schedules
+ + + + + + + + + +
KindCronPathsRetentionEnabled
backup0 2 * * */var/lib/postgres7d/4w/12m[x]
forget+prune0 4 * * 0per policy[x]
check0 5 1 * *[ ]
+
+ New schedule
+
+ schedule editor (expanded form) +
+
kind: [backup ▾]
+
cron: [ 0 2 * * * ]   human: every day at 02:00
+
paths: [ /var/lib/postgres ] [+ add]
+
excludes: [ *.tmp, /tmp ]
+
tags: [ auto, daily ]
+
retention: keep [7] daily, [4] weekly, [12] monthly · keep-tag [ ]
+
bandwidth: upload [ ] KB/s · download [ ] KB/s   §14.2
+
pre-hook: [ pg_dump ... ]   §14.3 admin-only
+
post-hook: [ ... ]
+
enabled: [x]
+
+
+
+ + +
+
Tab · Jobs (host-scoped)
+ + + + + + + + + + +
StartedKindStatusDurationBy
2h agobackupsucceeded00:14:22schedule
1d agochecksucceeded00:42:17schedule
2d agobackupcancelled00:00:42alice
3d agobackupfailed00:01:09schedule
+
+ + +
+
Tab · Repo
+
+
URL
rest:https://restic.lab…/prod-db-01
+
Kind
rest (append-only)
+
Total size
412 GB
+
Dedup ratio
4.2×
+
Snapshots
1,284
+
Last check
1d ago · clean
+
Lock state
unlocked
+
Credential
append-only · rotated 14d ago
+
+
+ Run check + Unlock + Forget+prune (admin) +
+
+ + +
+
Tab · Settings
+
+
Tags
prod, db [+ add]
+
Default pre-hook
(empty)
+
Default post-hook
(empty)
+
Hook shell
/bin/sh
+
Default bandwidth caps
none
+
+
Enrollment
+
enrolled 42d ago · Regenerate token
+
+
+
Agent
+
0.4.2 · auto-update [x] · Force update now
+
+
+
Danger zone
+
Remove host does not touch repo data
+
+
+
+
+
+ + + +
+
+ + + + + +
+ Screen 3 · Job detail (/jobs/:id) — running state + +
+ + +
user: alice (admin)
+
+ +
+
+ + +
+
« prod-db-01 / Jobs
+
+
+

backup · prod-db-01

+
job j_01HJ8K7 · started 4m12s ago · triggered by alice
+
+ running + schedule: nightly-pg +
+
+
+ Cancel job +
+
+
+ + +
+
+
Progress
+
38% · ~6m remaining
+
+
156 GB of 412 GB · 482k of 1.2M files
+
+
+
+
Files new
2,103
+
Files changed
418
+
Bytes added
2.4 GB
+
Throughput
42 MB/s
+
+
+
+ + +
Live log (streaming via WS)
+
+14:02:11 [agent] starting restic backup --json +14:02:11 [agent] pre_hook: pg_dump | gzip > /tmp/dump.sql.gz +14:02:48 [pre_hook] dump complete (1.2 GB) +14:02:49 [restic] open repository +14:02:50 [restic] lock repository +14:02:50 [restic] load index files +14:02:53 [restic] start scan +14:02:55 [restic] start backup on /var/lib/postgres +14:03:01 [restic] {"message_type":"status","percent_done":0.04,"total_files":1234567,"files_done":48234,"total_bytes":442000000000,"bytes_done":17600000000} +14:04:22 [restic] {"message_type":"status","percent_done":0.18,"...} +14:05:55 [restic] {"message_type":"status","percent_done":0.31,"...} +14:06:23 [restic] warning: failed to lstat /var/lib/postgres/pg_wal/.lock +14:06:24 [restic] {"message_type":"status","percent_done":0.38,"...} + +
+
+
[ ] auto-scroll   [ ] show stderr only   download full log
+
+ +
+ + +
+
+ + + + + +
+

Findings — gaps in spec.md §6 surfaced by Phase 0 wireframing

+
    +
  1. + Aggregate fleet endpoint missing. Dashboard summary strip and Prometheus metrics (§14.4) both need fleet rollups. Add GET /api/fleet/summary returning host counts by status, total repo bytes, open alert counts. Cheaper than client fanout and reused by /metrics. +
  2. +
  3. + Host list response is too thin. Domain model Host (§5) has status + last_seen_at; cards need last_backup_at, last_backup_status, repo_size_bytes, snapshot_count, open_alert_count, current_job_id. Either add columns or compute server-side and include in GET /api/hosts. +
  4. +
  5. + Job actor not modelled. Job table tracks scheduled_id but not who (user vs schedule vs system) triggered a run-now. Dashboard "Recent activity" and Jobs tab both want this. Add Job.actor_kind + Job.actor_id — cheaper than joining AuditLog every time. +
  6. +
  7. + WS job.progress JSON shape is undefined. §6.2 lists the message name only. Lock the shape now: {percent_done: float, files_done: int, total_files: int, bytes_done: int, total_bytes: int, eta_seconds: int, throughput_bps: int}. Keeps client + agent in lockstep before Phase 1 codes against it. +
  8. +
  9. + Repo response needs more fields. §6.1 says size/last-check/lock state. Wireframe also wants: dedup ratio, snapshot count, credential rotation timestamp, append-only flag. Most derive from restic stats + Credential row — expose them through GET /api/hosts/:id/repo. +
  10. +
  11. + Snapshot filtering needs server support. Tag/path/date filters belong on the server (12-host fleets are small but a single host can hold thousands of snapshots). Add query params to GET /api/hosts/:id/snapshots: ?tag=, ?path=, ?since=, ?limit=. Distinct-tag list endpoint optional — could be derived client-side at first. +
  12. +
  13. + Job listing needs query params. Recent activity, host-scoped jobs, and the Jobs page all use GET /api/jobs. Lock down: ?host_id=, ?kind=, ?status=, ?since=, ?limit=, ?order=. Pagination too. +
  14. +
  15. + Agent self-update endpoint not in §6.1. §4.2 describes the mechanism but no REST endpoint exists. Settings tab wants a "Force update now" button — add POST /api/hosts/:id/agent/update. +
  16. +
  17. + Schedule retention/options JSON shape. §14.2 (bandwidth) and §14.3 (hooks) both extend Schedule. Document the canonical shape now (retention_policy, options.limit_upload, options.limit_download, pre_hook, post_hook) so the schedule editor and the agent can both target it. +
  18. +
  19. + HTMX-vs-WS responsibility split. Decision: only the Job detail screen needs WS. Dashboard, Hosts, Snapshots use HTMX polling (10s). This avoids fan-out complexity for v1; revisit if dashboard feels stale. +
  20. +
+
+ +
+ + diff --git a/spec.md b/spec.md new file mode 100644 index 0000000..865a06a --- /dev/null +++ b/spec.md @@ -0,0 +1,455 @@ +# restic-manager — Specification + +## 1. Overview + +**restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI. + +It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint. + +**License:** PolyForm Noncommercial 1.0.0 + +## 2. Goals & Non-Goals + +### Goals +- Central visibility into backup state for every endpoint +- Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`) +- Manage per-host backup schedules from the UI +- Live job progress streamed back to the UI +- Restore wizard (browse snapshots, pick paths, restore to original or alternate host) +- Repo health surfacing (size, dedup ratio, last check, lock state) +- Alerting on failure or staleness +- Cross-platform agent (Linux + Windows) +- Ransomware-resistant repo access via append-only credentials + +### Non-Goals (initial release) +- Replacing restic itself or providing custom repo formats +- Managing non-restic backup tools +- Multi-tenancy / SaaS deployment +- High availability of the control plane (SQLite, single-instance) +- Mobile-native apps (responsive web only) + +## 3. Architecture + +### 3.1 Components + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Proxmox cluster │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ docker compose: restic-manager │ │ +│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │ +│ │ - SQLite volume │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└────────────────────────▲─────────────────────────────────────────┘ + │ HTTPS (control plane) + │ - agent → server: status, telemetry + │ - server → agent: commands, schedules + │ +┌────────────────────────┴─────────────────────────────────────────┐ +│ Endpoints (Linux + Windows) │ +│ ┌──────────────────────┐ ┌────────────────────────────────┐ │ +│ │ restic-manager- │ │ restic CLI │ │ +│ │ agent (Go binary) │───▶│ invoked by agent │ │ +│ │ - systemd / svc │ └─────────────┬──────────────────┘ │ +│ │ - WS to server │ │ HTTPS │ +│ └──────────────────────┘ │ (data plane) │ +└─────────────────────────────────────────────┼────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Unraid │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ Docker: restic/rest-server │ │ +│ │ - per-host append-only credentials │ │ +│ │ - one repo per host │ │ +│ │ - storage: Unraid share │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +### 3.2 Data flow + +- **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes. +- **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata. +- **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser. + +### 3.3 Why agent (not SSH) + +- Push model works through NAT/firewalls without inbound rules +- Native Windows support without OpenSSH service quirks +- Local scheduling survives controller restarts +- Self-contained `restic --json` parsing, no remote shell quoting hazards + +### 3.4 Why per-host repos + +- Isolates corruption / lock contention +- Append-only credentials per host = compromised endpoint can't delete other hosts' backups +- Simpler `prune` orchestration (no global lock coordination) +- Trivially easy to retire a host (delete its repo + credential) + +## 4. Components in detail + +### 4.1 Server + +- **Language:** Go 1.22+ +- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo) +- **HTTP:** `net/http` + `chi` router +- **WebSocket:** `nhooyr.io/websocket` +- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step +- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml` +- **Config:** YAML or env vars (`RM_LISTEN`, `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`) +- **TLS:** terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS) + +### 4.2 Agent + +- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`) +- **Service integration:** systemd unit (Linux), Windows service via `golang.org/x/sys/windows/svc` +- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle +- **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable +- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time +- **Self-update:** server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service + +### 4.3 Restic REST server (Unraid) + +- Run `restic/rest-server` Docker container +- `--append-only` enabled +- `--private-repos` enabled (each user only sees their own subpath) +- htpasswd file with one user per host +- Storage path mapped to Unraid share + +## 5. Domain model + +``` +Host + id, name, os, arch, agent_version, restic_version, + enrolled_at, last_seen_at, status (online/offline/degraded), + repo_id (FK), tags, + current_job_id (FK nullable), + last_backup_at, last_backup_status (succeeded|failed|cancelled|null), + repo_size_bytes, snapshot_count, open_alert_count + # Last six fields are denormalised projections, refreshed on + # job.finished, snapshots.report, repo.stats, and alert state changes. + +Repo + id, name, url, kind (rest|s3|local), credential_id (FK), + password_secret_id (FK), + size_bytes, snapshot_count, dedup_ratio, + last_check_at, last_check_status, lock_state (locked|unlocked), + append_only (bool), credential_rotated_at + # Bottom block is a cached projection from `restic stats` + + # Credential row, refreshed by repo.stats agent messages. + +Credential + id, kind, username, secret_ref (encrypted), + rotated_at + +Schedule + id, host_id (FK), kind (backup|forget|prune|check), + cron_expr, paths (json), excludes (json), tags (json), + retention_policy (json), options (json), pre_hook, post_hook, + enabled + # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly, + # keep_monthly, keep_yearly, keep_tag: [...]} + # options: {limit_upload_kbps, limit_download_kbps} + # pre_hook/post_hook: see §14.3 (encrypted at rest) + +Job + id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled), + scheduled_id (FK nullable), + actor_kind (user|schedule|system), actor_id (nullable), + started_at, finished_at, + exit_code, stats (json), error + +JobLog + job_id (FK), seq, ts, stream (stdout|stderr|event), payload + +Snapshot (cached projection from `restic snapshots --json`) + id (restic id), host_id (FK), repo_id (FK), + time, hostname, paths, tags, size_bytes, file_count + +Alert + id, host_id (FK nullable), kind, severity, message, + created_at, acknowledged_at, resolved_at + +User + id, username, password_hash, role (admin|operator|viewer), + created_at, last_login_at + +Session + id, user_id (FK), created_at, expires_at, ip, ua + +AuditLog + id, user_id (FK nullable), actor (user|agent|system), + action, target_kind, target_id, ts, payload (json) +``` + +## 6. API surface (control plane) + +### 6.1 UI/REST (browser → server) + +``` +POST /api/auth/login +POST /api/auth/logout + +GET /api/fleet/summary (aggregate: host counts by status, + total bytes, open alerts; reused by /metrics) + +GET /api/hosts ?tag=&status=&limit=&offset= + (returns Host rows incl. denormalised + last_backup_*, repo_size_bytes, + snapshot_count, open_alert_count, + current_job_id) +GET /api/hosts/:id +DELETE /api/hosts/:id +POST /api/hosts/:id/enrollment-token (regenerate) +POST /api/hosts/:id/agent/update (force agent self-update; see §4.2) + +GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset= +GET /api/hosts/:id/repo (full Repo projection) +POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock) +POST /api/hosts/:id/restore (restore wizard submit) + +GET /api/hosts/:id/schedules +POST /api/hosts/:id/schedules +PUT /api/schedules/:id +DELETE /api/schedules/:id + +GET /api/jobs ?host_id=&kind=&status=&since=&until= + &limit=&offset=&order=desc +GET /api/jobs/:id +GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=) +WS /api/jobs/:id/stream (live progress; see §6.2 for shape) +POST /api/jobs/:id/cancel + +GET /api/repos +GET /api/repos/:id + +GET /api/alerts +POST /api/alerts/:id/ack + +GET /api/audit +GET /api/users (admin) +POST /api/users (admin) +``` + +**Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens +(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit +if dashboard staleness becomes a problem in practice. + +### 6.2 Agent ↔ Server + +Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages. + +**Agent → server:** +- `hello` (host metadata, agent version, restic version, OS) +- `heartbeat` (every 30s) +- `job.started` (job_id, kind, started_at) +- `job.progress` (job_id, percent_done, files_done, total_files, + bytes_done, total_bytes, eta_seconds, throughput_bps) +- `job.finished` (job_id, status, exit_code, stats, error, finished_at) +- `snapshots.report` (full list after each successful backup) +- `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at, + last_check_status, lock_state) +- `log.stream` (live stdout/stderr lines while job running; + {job_id, seq, ts, stream: stdout|stderr|event, payload}) + +**Server → agent:** +- `command.run` (kind, args) +- `command.cancel` (job_id) +- `schedule.set` (full schedule list, agent reconciles local cron) +- `config.update` +- `agent.update` (new version available, URL + signature) + +The server fans `job.progress` and `log.stream` for a given job to all +browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without +transformation, so the schema is shared end-to-end. + +### 6.3 Enrollment + +1. Operator clicks "Add host" → server generates one-time token (TTL 1h) +2. Operator runs install script on endpoint with token +3. Agent calls `POST /api/agents/enroll` with token + host metadata +4. Server issues persistent agent credential (bearer token + TLS pin) and host record +5. Agent stores credential, opens WS connection + +## 7. Security + +### 7.1 Authentication +- **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests +- **Phase 2:** OIDC (Authelia, Keycloak, Authentik) +- **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time + +### 7.2 Authorization (Phase 1: simple roles) +- **admin:** everything +- **operator:** trigger jobs, edit schedules, restore +- **viewer:** read-only + +### 7.3 Secret handling +- Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup) +- Pushed to agents only over the authenticated WS, only when needed for a job +- Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms) + +### 7.4 Repo protection +- Restic REST server runs with `--append-only` for routine backups +- A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited + +### 7.5 Audit +- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload + +## 8. UI + +Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement. + +**Pages:** +- **Login** +- **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts) +- **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings +- **Job detail:** live log streaming via WS, cancel button +- **Restore wizard:** host → snapshot → paths → target → confirm +- **Repos:** aggregate view across hosts +- **Alerts:** list, acknowledge +- **Settings:** users (admin), notification channels, agent download +- **Audit log** + +## 9. Alerting + +- **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly +- **Channels (Phase 1):** webhook, ntfy, email (SMTP) +- **Channels (Phase 2+):** Discord, Slack, Pushover + +## 10. Deployment + +### 10.1 Control plane (Proxmox host or LXC) + +`docker-compose.yml`: +```yaml +services: + restic-manager: + image: ghcr.io//restic-manager:latest + restart: unless-stopped + ports: + - "8443:8443" + volumes: + - ./data:/data + - ./certs:/certs:ro + environment: + - RM_DATA_DIR=/data + - RM_LISTEN=:8443 + - RM_BASE_URL=https://restic.lab.example + - RM_TLS_CERT=/certs/fullchain.pem + - RM_TLS_KEY=/certs/privkey.pem + - RM_SECRET_KEY_FILE=/data/secret.key +``` + +### 10.2 Restic REST server (Unraid) + +Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share. + +### 10.3 Agent install + +- **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh` +- **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`) +- Installer drops binary + service unit, calls enroll endpoint, starts service + +## 11. Testing strategy + +- **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic +- **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows +- **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container +- **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner + +## 12. Repository layout + +``` +restic-manager/ +├── cmd/ +│ ├── server/ +│ └── agent/ +├── internal/ +│ ├── api/ # shared API types +│ ├── server/ +│ │ ├── http/ +│ │ ├── ws/ +│ │ └── ui/ # templates, handlers +│ ├── agent/ +│ │ ├── service/ # systemd / windows service glue +│ │ ├── runner/ # restic invocation +│ │ └── scheduler/ +│ ├── restic/ # restic CLI wrapper, --json parsing +│ ├── store/ # sqlite layer +│ ├── crypto/ # secret encryption +│ └── auth/ +├── web/ +│ ├── templates/ +│ └── static/ +├── deploy/ +│ ├── docker-compose.yml +│ ├── Dockerfile.server +│ └── install/ +│ ├── install.sh +│ └── install.ps1 +├── docs/ +├── LICENSE # PolyForm Noncommercial 1.0.0 +├── README.md +├── spec.md +└── tasks.md +``` + +## 13. Phased delivery + +- **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log +- **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats +- **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log +- **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends +- **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour + +## 14. Confirmed extensions (in scope) + +These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below. + +### 14.1 Cross-host restore + +Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop). + +- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after +- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`) +- **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root +- **Phase:** 3 (with the restore wizard) + +### 14.2 Bandwidth limiting + +Per-host upload/download caps for backup, restore, and prune jobs. + +- Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s) +- Also overridable on run-now jobs via the UI +- Persisted in `Schedule.options` (JSON blob) so the schema stays stable +- **Phase:** 2 (with scheduling) + +### 14.3 Pre/post backup hooks + +Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications. + +- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden +- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable) +- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status +- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:` +- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged +- **Phase:** 2 (with scheduling) + +### 14.4 Prometheus `/metrics` endpoint + +Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list. + +- **Metrics (per host):** + - `restic_manager_last_backup_timestamp_seconds{host=...}` + - `restic_manager_last_backup_status{host=...}` (1=success, 0=failure) + - `restic_manager_repo_size_bytes{host=...}` + - `restic_manager_snapshot_count{host=...}` + - `restic_manager_agent_online{host=...}` (1/0) + - `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram) +- **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info` +- **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data) + +## 15. Future considerations (not yet committed) + +- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge diff --git a/tasks.md b/tasks.md new file mode 100644 index 0000000..8199606 --- /dev/null +++ b/tasks.md @@ -0,0 +1,148 @@ +# restic-manager — Tasks + +Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria. + +Sizes: **S** = under a day, **M** = 1–3 days, **L** = 3–7 days. + +--- + +## Phase 0 — Project bootstrap + +- [ ] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages +- [ ] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder +- [ ] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config +- [ ] **P0-04** (S) GitHub Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint +- [ ] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml` +- [ ] **P0-06** (S) Makefile / `taskfile.yml` with common targets (`build`, `test`, `run`, `release`) + +--- + +## Phase 1 — MVP: enrollment, visibility, on-demand backup + +### Server foundations +- [ ] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown) +- [ ] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (`golang-migrate` or hand-rolled) +- [ ] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log` +- [ ] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies, CSRF middleware +- [ ] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs) +- [ ] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`) +- [ ] **P1-07** (M) Audit log writer + middleware + +### Agent ↔ server protocol +- [ ] **P1-08** (M) Define shared API types in `internal/api` (Go structs, JSON tags) +- [ ] **P1-09** (L) WebSocket transport (`nhooyr.io/websocket`), framed JSON envelopes, request/response correlation, ping/pong, reconnect with backoff +- [ ] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer + cert pin +- [ ] **P1-11** (M) Agent registration on connect (`hello` message → upsert host record, mark online) +- [ ] **P1-12** (S) Heartbeat handler (mark host offline after 90s without heartbeat) + +### Agent foundations +- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml` / `%PROGRAMDATA%\restic-manager\agent.yaml`) +- [ ] **P1-14** (M) Service integration: systemd unit + Windows service entrypoint +- [ ] **P1-15** (M) Outbound WS client with reconnect, server cert pinning +- [ ] **P1-16** (M) Restic wrapper: locate `restic` binary, run with `--json`, stream parsed events +- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version) + +### Run-now backup +- [ ] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with logs +- [ ] **P1-19** (M) Server endpoint `POST /api/hosts/:id/jobs` to dispatch a `backup` command +- [ ] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` / `log.stream` +- [ ] **P1-21** (M) Server persists log stream to `job_logs`, exposes `WS /api/jobs/:id/stream` for live tailing +- [ ] **P1-22** (S) Snapshot listing: `restic snapshots --json`, cached projection table, refresh after each backup + +### UI (HTMX + Tailwind) +- [ ] **P1-23** (M) Base layout, login page, session-aware nav +- [ ] **P1-24** (M) Dashboard: host cards (status dot, last backup, repo size) +- [ ] **P1-25** (M) Host detail page: snapshots tab + run-now button +- [ ] **P1-26** (M) Live job log viewer (WS-driven, auto-scroll, cancel button) +- [ ] **P1-27** (S) "Add host" flow: generate token, copy install command snippet +- [ ] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node) + +### Install scripts +- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls +- [ ] **P1-30** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls +- [ ] **P1-31** (S) Server endpoint to serve agent binaries + install scripts (signed) + +### Phase 1 acceptance +- One Linux + one Windows host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success. + +--- + +## Phase 2 — Scheduling, retention, repo operations + +- [ ] **P2-01** (M) Schedule schema + CRUD API +- [ ] **P2-02** (L) Server-pushed schedule reconciliation (server is source of truth; agent applies) +- [ ] **P2-03** (M) Agent local scheduler (`robfig/cron/v3`); persists next-fire times across restarts +- [ ] **P2-04** (M) Schedule editor UI (paths, excludes, tags, cron, retention) +- [ ] **P2-05** (M) `forget` command with retention policy (keep-last/daily/weekly/monthly/yearly) +- [ ] **P2-06** (M) `prune` command (admin-only, uses non-append-only credential) +- [ ] **P2-07** (S) `check` command (random subset + `--read-data-subset`) +- [ ] **P2-08** (S) `unlock` command +- [ ] **P2-09** (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state +- [ ] **P2-10** (S) Run-now buttons for forget/prune/check/unlock on host detail page +- [ ] **P2-11** (S) Schedule "next run" / "last run" surfaced on host card +- [ ] **P2-12** (S) Bandwidth limit fields on schedule editor (`--limit-upload`, `--limit-download`); also overridable on run-now jobs +- [ ] **P2-13** (M) Pre/post backup hooks: schema (`Schedule.pre_hook`, `Schedule.post_hook`, `Host.pre_hook_default`, `Host.post_hook_default`), encrypted at rest, admin-only edit, audit-logged +- [ ] **P2-14** (M) Agent execution of hooks: configurable shell per host, `pre_hook` failure aborts backup, `post_hook` always runs with `RM_JOB_STATUS` env var, stdout/stderr captured into `JobLog` with prefix +- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user") + +### Phase 2 acceptance +- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example). Bandwidth limits honoured. + +--- + +## Phase 3 — Restore, alerts, audit + +- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection +- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm) +- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming +- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root +- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed) +- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email +- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve +- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range) +- [ ] **P3-09** (S) `diff` between two snapshots in UI + +### Phase 3 acceptance +- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s. + +--- + +## Phase 4 — Self-update, RBAC polish, OIDC + +- [ ] **P4-01** (L) Agent self-update: signed binary published by server, agent downloads, verifies, swaps, restarts +- [ ] **P4-02** (M) Agent version reporting on dashboard; "update all" admin action +- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer) +- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset) +- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping) +- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page) +- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag +- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list +- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON + +### Phase 4 acceptance +- Non-admin users see an appropriately limited UI. Agents update themselves with one click. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data. + +--- + +## Phase 5 — OSS readiness + +- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots +- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates +- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR +- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README +- [ ] **P5-05** (S) `SECURITY.md` with disclosure process +- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent) +- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar +- [ ] **P5-08** (S) Optional Prometheus `/metrics` endpoint + +### Phase 5 acceptance +- A stranger can read the docs and stand up a working install in under 30 minutes. + +--- + +## Cross-cutting / ongoing + +- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format) +- [ ] **X-02** Track restic version compatibility matrix +- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`) +- [ ] **X-04** Threat-model review at end of each phase