Doc-only changes captured before any Phase 1 code lands. spec.md: - §4.1 nhooyr.io/websocket → github.com/coder/websocket (the maintained fork; the original is unmaintained) - §4.1 RM_LISTEN documented as source of truth for the bind port; add RM_TRUSTED_PROXY env var for X-Forwarded-* handling behind Caddy/Traefik - §4.2 Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but service integration + installer move to Phase 2 - §4.2 self-update via apt/choco, not bespoke signed binaries - §5 add Host.protocol_version + Host.applied_schedule_version - §6.2 lock protocol_version handshake semantics (clean error on mismatch, not weird JSON parse failures) - §6.2 schedule reconciliation when server unreachable: agent keeps firing last-known-good indefinitely; server's view canonical on reconnect; UI surfaces drift via applied_schedule_version - §6.2 schedule.set carries schedule_version; new schedule.ack agent→server message - §10.1 cross-reference RM_LISTEN ↔ compose port mapping - §14.3 hooks rejected at validation on non-backup schedule kinds tasks.md: - P1-14 / P1-30 (Windows service + install.ps1) → Phase 2 as P2-16 / P2-17 - P1-29 install.sh detects existing restic timers/cron and prints disable commands, doesn't auto-disable - Phase 1 acceptance: drop Windows from end-to-end criterion, require windows cross-compile in CI - P4-01 rewritten: package-manager-based update delivery - P5-08 removed (duplicate of P4-08 Prometheus /metrics) - Various references updated No Go code changes; build still clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
25 KiB
restic-manager — Specification
1. Overview
restic-manager is a self-hosted, browser-based, single-pane-of-glass for managing restic backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.
It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.
License: PolyForm Noncommercial 1.0.0
2. Goals & Non-Goals
Goals
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (
backup,forget,prune,check,unlock,snapshots,stats,diff,restore) - Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
Non-Goals (initial release)
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
3. Architecture
3.1 Components
┌──────────────────────────────────────────────────────────────────┐
│ Proxmox cluster │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ docker compose: restic-manager │ │
│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │
│ │ - SQLite volume │ │
│ └────────────────────────────────────────────────────────────┘ │
└────────────────────────▲─────────────────────────────────────────┘
│ HTTPS (control plane)
│ - agent → server: status, telemetry
│ - server → agent: commands, schedules
│
┌────────────────────────┴─────────────────────────────────────────┐
│ Endpoints (Linux + Windows) │
│ ┌──────────────────────┐ ┌────────────────────────────────┐ │
│ │ restic-manager- │ │ restic CLI │ │
│ │ agent (Go binary) │───▶│ invoked by agent │ │
│ │ - systemd / svc │ └─────────────┬──────────────────┘ │
│ │ - WS to server │ │ HTTPS │
│ └──────────────────────┘ │ (data plane) │
└─────────────────────────────────────────────┼────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Unraid │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Docker: restic/rest-server │ │
│ │ - per-host append-only credentials │ │
│ │ - one repo per host │ │
│ │ - storage: Unraid share │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
3.2 Data flow
- Backup data: endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane never touches backup bytes.
- Control plane: agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
- UI: browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.
3.3 Why agent (not SSH)
- Push model works through NAT/firewalls without inbound rules
- Native Windows support without OpenSSH service quirks
- Local scheduling survives controller restarts
- Self-contained
restic --jsonparsing, no remote shell quoting hazards
3.4 Why per-host repos
- Isolates corruption / lock contention
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
- Simpler
pruneorchestration (no global lock coordination) - Trivially easy to retire a host (delete its repo + credential)
4. Components in detail
4.1 Server
- Language: Go 1.22+
- Storage: SQLite (via
modernc.org/sqlite, no CGo) - HTTP:
net/http+chirouter - WebSocket:
github.com/coder/websocket(the maintained fork of the unmaintainednhooyr.io/websocket; same API) - UI: HTMX + Tailwind, server-rendered Go templates, no Node build step
- Distribution: single static binary, packaged in a Docker image; published
docker-compose.yml - Config: YAML or env vars:
RM_LISTEN— bind address, e.g.:8443(source of truth for the port; the8443in the reference compose is just a default mapping)RM_DATA_DIR,RM_BASE_URL,RM_TLS_CERT,RM_TLS_KEY,RM_SECRET_KEY_FILERM_TRUSTED_PROXY— comma-separated CIDR list of reverse proxies whoseX-Forwarded-For/X-Forwarded-Protowe honour. Empty (the default) = trust no one. Set this when fronted by Caddy/Traefik.
- TLS: terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS)
4.2 Agent
- Language: Go (cross-compiled for
linux/amd64,linux/arm64,windows/amd64). Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but Windows service integration + signed installer- install.ps1 land in Phase 2.
- Service integration: systemd unit (Linux). Windows service via
golang.org/x/sys/windows/svc— Phase 2. - Footprint goal: ≤ 15 MB binary, ≤ 50 MB RSS idle
- Persistence: local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
- Restic invocation: spawns
resticwith--json, parses streamed output, forwards to server in real time - Updates: distributed via OS package manager — apt repo (Linux) and
Chocolatey package (Windows), both pointing at gitea releases. No
bespoke signed-binary self-update; the
restic-manager-agent updatecommand is a thin wrapper overapt-get install --only-upgrade/choco upgrade. UI surfaces "agent N versions behind server" so an operator knows when to upgrade.
4.3 Restic REST server (Unraid)
- Run
restic/rest-serverDocker container --append-onlyenabled--private-reposenabled (each user only sees their own subpath)- htpasswd file with one user per host
- Storage path mapped to Unraid share
5. Domain model
Host
id, name, os, arch, agent_version, restic_version, protocol_version,
enrolled_at, last_seen_at, status (online/offline/degraded),
repo_id (FK), tags,
current_job_id (FK nullable),
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
repo_size_bytes, snapshot_count, open_alert_count,
applied_schedule_version
# Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
# open_alert_count, applied_schedule_version) are denormalised
# projections, refreshed on job.finished, snapshots.report,
# repo.stats, and alert state changes.
# applied_schedule_version is the schedule_version the agent most
# recently acknowledged via `schedule.ack` — lets the UI surface
# drift when an agent is offline.
Repo
id, name, url, kind (rest|s3|local), credential_id (FK),
password_secret_id (FK),
size_bytes, snapshot_count, dedup_ratio,
last_check_at, last_check_status, lock_state (locked|unlocked),
append_only (bool), credential_rotated_at
# Bottom block is a cached projection from `restic stats` +
# Credential row, refreshed by repo.stats agent messages.
Credential
id, kind, username, secret_ref (encrypted),
rotated_at
Schedule
id, host_id (FK), kind (backup|forget|prune|check),
cron_expr, paths (json), excludes (json), tags (json),
retention_policy (json), options (json), pre_hook, post_hook,
enabled
# retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
# keep_monthly, keep_yearly, keep_tag: [...]}
# options: {limit_upload_kbps, limit_download_kbps}
# pre_hook/post_hook: see §14.3 (encrypted at rest)
Job
id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
scheduled_id (FK nullable),
actor_kind (user|schedule|system), actor_id (nullable),
started_at, finished_at,
exit_code, stats (json), error
JobLog
job_id (FK), seq, ts, stream (stdout|stderr|event), payload
Snapshot (cached projection from `restic snapshots --json`)
id (restic id), host_id (FK), repo_id (FK),
time, hostname, paths, tags, size_bytes, file_count
Alert
id, host_id (FK nullable), kind, severity, message,
created_at, acknowledged_at, resolved_at
User
id, username, password_hash, role (admin|operator|viewer),
created_at, last_login_at
Session
id, user_id (FK), created_at, expires_at, ip, ua
AuditLog
id, user_id (FK nullable), actor (user|agent|system),
action, target_kind, target_id, ts, payload (json)
6. API surface (control plane)
6.1 UI/REST (browser → server)
POST /api/auth/login
POST /api/auth/logout
GET /api/fleet/summary (aggregate: host counts by status,
total bytes, open alerts; reused by /metrics)
GET /api/hosts ?tag=&status=&limit=&offset=
(returns Host rows incl. denormalised
last_backup_*, repo_size_bytes,
snapshot_count, open_alert_count,
current_job_id)
GET /api/hosts/:id
DELETE /api/hosts/:id
POST /api/hosts/:id/enrollment-token (regenerate)
POST /api/hosts/:id/agent/update (force agent self-update; see §4.2)
GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset=
GET /api/hosts/:id/repo (full Repo projection)
POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock)
POST /api/hosts/:id/restore (restore wizard submit)
GET /api/hosts/:id/schedules
POST /api/hosts/:id/schedules
PUT /api/schedules/:id
DELETE /api/schedules/:id
GET /api/jobs ?host_id=&kind=&status=&since=&until=
&limit=&offset=&order=desc
GET /api/jobs/:id
GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=)
WS /api/jobs/:id/stream (live progress; see §6.2 for shape)
POST /api/jobs/:id/cancel
GET /api/repos
GET /api/repos/:id
GET /api/alerts
POST /api/alerts/:id/ack
GET /api/audit
GET /api/users (admin)
POST /api/users (admin)
Realtime strategy: only /api/jobs/:id/stream uses WS. All other screens
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
if dashboard staleness becomes a problem in practice.
6.2 Agent ↔ Server
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
Agent → server:
hello(host metadata, agent_version, restic_version, OS,protocol_version— see "Protocol versioning" below)heartbeat(every 30s)job.started(job_id, kind, started_at)job.progress(job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps)job.finished(job_id, status, exit_code, stats, error, finished_at)snapshots.report(full list after each successful backup)repo.stats(size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state)log.stream(live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload})schedule.ack(schedule_version) — agent confirms it has applied a schedule push; lets the server surface "this host is N versions behind" without polling
Server → agent:
command.run(kind, args)command.cancel(job_id)schedule.set(schedule_version, schedules: [...]) — full schedule list, agent reconciles local cron and replies withschedule.ackconfig.updateagent.update.available(new version + package source URL — informational only; agent does not self-update, see §4.2)
The server fans job.progress and log.stream for a given job to all
browsers subscribed to WS /api/jobs/:id/stream (§6.1) without
transformation, so the schema is shared end-to-end.
Protocol versioning. Agents and the server each declare an integer
protocol_version in hello. The version bumps only on breaking
wire-format changes (not human-readable software releases). The server
maintains a MinAgentProtocolVersion constant; agents below it are
disconnected with error: protocol_too_old and a URL pointing at the
upgrade instructions. Symmetrically, an agent talking to a server that
advertises a protocol_version it does not recognise refuses to
proceed and surfaces a clear log message. This avoids the failure mode
of "weird JSON parse errors when v0.3 agent meets v0.5 server."
Schedule reconciliation when the server is unreachable. Agents
keep firing the last-known-good schedule pushed by the server,
indefinitely. Rationale: a missed backup because the controller is
down is a worse outcome than firing a schedule the user has since
edited. On reconnect, the server's view is canonical: the next
schedule.set overrides whatever the agent was running, the agent
replies schedule.ack with the new schedule_version, and the server
updates Host.applied_schedule_version. The UI surfaces drift
("schedule v7 pushed, agent applied v5") when an agent has been
offline.
6.3 Enrollment
- Operator clicks "Add host" → server generates one-time token (TTL 1h)
- Operator runs install script on endpoint with token
- Agent calls
POST /api/agents/enrollwith token + host metadata - Server issues persistent agent credential (bearer token + TLS pin) and host record
- Agent stores credential, opens WS connection
7. Security
7.1 Authentication
- Phase 1: username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
- Phase 2: OIDC (Authelia, Keycloak, Authentik)
- Agents: bearer token over TLS; pin server cert fingerprint at enrollment time
7.2 Authorization (Phase 1: simple roles)
- admin: everything
- operator: trigger jobs, edit schedules, restore
- viewer: read-only
7.3 Secret handling
- Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
- Pushed to agents only over the authenticated WS, only when needed for a job
- Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)
7.4 Repo protection
- Restic REST server runs with
--append-onlyfor routine backups - A separate non-append-only credential exists for
forget/pruneoperations, used only when explicitly invoked from the UI by an admin/operator and audited
7.5 Audit
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload
8. UI
Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.
Pages:
- Login
- Dashboard: fleet overview (host cards: status, last backup, repo size, alerts)
- Host detail: tabs for Snapshots / Schedules / Jobs / Repo / Settings
- Job detail: live log streaming via WS, cancel button
- Restore wizard: host → snapshot → paths → target → confirm
- Repos: aggregate view across hosts
- Alerts: list, acknowledge
- Settings: users (admin), notification channels, agent download
- Audit log
9. Alerting
- Triggers: backup failed, backup hasn't run in N hours past its schedule, repo
checkfailed, agent offline > N minutes, repo size growth anomaly - Channels (Phase 1): webhook, ntfy, email (SMTP)
- Channels (Phase 2+): Discord, Slack, Pushover
10. Deployment
10.1 Control plane (Proxmox host or LXC)
docker-compose.yml:
services:
restic-manager:
image: ghcr.io/<owner>/restic-manager:latest
restart: unless-stopped
ports:
- "8443:8443"
volumes:
- ./data:/data
- ./certs:/certs:ro
environment:
- RM_DATA_DIR=/data
- RM_LISTEN=:8443
- RM_BASE_URL=https://restic.lab.example
- RM_TLS_CERT=/certs/fullchain.pem
- RM_TLS_KEY=/certs/privkey.pem
- RM_SECRET_KEY_FILE=/data/secret.key
# - RM_TRUSTED_PROXY=10.0.0.0/8 # set when fronted by a reverse proxy
RM_LISTEN is the source of truth for the server's bind address. The
8443:8443 mapping above is just the matching default; if you change
RM_LISTEN to e.g. :9443, change the right-hand side of the port
mapping to match.
10.2 Restic REST server (Unraid)
Standard restic/rest-server container, --append-only, --private-repos, htpasswd mounted, data path on the share.
10.3 Agent install
- Linux:
curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh - Windows:
iwr https://restic.lab.example/install.ps1 | iex(with$env:RM_TOKEN) - Installer drops binary + service unit, calls enroll endpoint, starts service
11. Testing strategy
- Unit tests: restic JSON parsing, schedule reconciliation, retention policy logic
- Integration tests: spin up real
restic+rest-serverin Docker, exercise full backup/snapshot/restore flows - End-to-end: Playwright against a compose-up'd stack with one Linux agent in a sibling container
- Cross-platform agent CI: build matrix
linux/amd64,linux/arm64,windows/amd64; smoke test on Windows runner
12. Repository layout
restic-manager/
├── cmd/
│ ├── server/
│ └── agent/
├── internal/
│ ├── api/ # shared API types
│ ├── server/
│ │ ├── http/
│ │ ├── ws/
│ │ └── ui/ # templates, handlers
│ ├── agent/
│ │ ├── service/ # systemd / windows service glue
│ │ ├── runner/ # restic invocation
│ │ └── scheduler/
│ ├── restic/ # restic CLI wrapper, --json parsing
│ ├── store/ # sqlite layer
│ ├── crypto/ # secret encryption
│ └── auth/
├── web/
│ ├── templates/
│ └── static/
├── deploy/
│ ├── docker-compose.yml
│ ├── Dockerfile.server
│ └── install/
│ ├── install.sh
│ └── install.ps1
├── docs/
├── LICENSE # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md
13. Phased delivery
- Phase 1 (MVP): server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
- Phase 2: schedules, retention, run-now for
forget/prune/check/unlock, repo stats - Phase 3: restore wizard, alerts (webhook/ntfy/email), audit log
- Phase 4: agent self-update, OIDC, multi-user/RBAC polish, repo trends
- Phase 5: OSS readiness — docs site, contribution guide, screenshot tour
14. Confirmed extensions (in scope)
These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.
14.1 Cross-host restore
Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).
- Credential model: target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- Path remapping: UI allows rewriting source paths to target paths (e.g.
/home/alice→/home/alice-new) - Permissions: restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
- Phase: 3 (with the restore wizard)
14.2 Bandwidth limiting
Per-host upload/download caps for backup, restore, and prune jobs.
- Exposed on the schedule editor as optional
--limit-upload/--limit-download(KB/s) - Also overridable on run-now jobs via the UI
- Persisted in
Schedule.options(JSON blob) so the schema stays stable - Phase: 2 (with scheduling)
14.3 Pre/post backup hooks
Per-host shell commands run before and after a backup job. Use cases: mysqldump/pg_dump to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
- Schema:
Schedule.pre_hookandSchedule.post_hook(string, optional). For more complex cases,Host.pre_hook_default/Host.post_hook_defaultapply to all schedules on that host unless overridden - Applicability: hooks are only meaningful for
kind = backupschedules. The API rejects non-nullpre_hook/post_hookon any other schedule kind (forget,prune,check) with a clear validation error rather than silently ignoring them. The same constraint applies toHost.pre_hook_default/Host.post_hook_default: they only fire for backup schedules on that host - Execution: agent runs hooks via the host's default shell (
/bin/shLinux,cmd.exeor PowerShell Windows — host-configurable) - Failure semantics:
pre_hooknon-zero exit aborts the backup and marks the job failed.post_hookruns on both success and failure (withRM_JOB_STATUSenv var); its own exit code is recorded but does not change the backup job's final status - Stdout/stderr: captured into
JobLoglike restic output, prefixedpre_hook:/post_hook: - Security: hooks are stored encrypted; only admins can edit them; every edit audit-logged
- Phase: 2 (with scheduling)
14.4 Prometheus /metrics endpoint
Standard Prometheus exposition on /metrics, protected by either bearer token or IP allow-list.
- Metrics (per host):
restic_manager_last_backup_timestamp_seconds{host=...}restic_manager_last_backup_status{host=...}(1=success, 0=failure)restic_manager_repo_size_bytes{host=...}restic_manager_snapshot_count{host=...}restic_manager_agent_online{host=...}(1/0)restic_manager_job_duration_seconds_bucket{kind=...,host=...}(histogram)
- Server-level:
restic_manager_jobs_total{kind=...,status=...},restic_manager_alerts_active,restic_manager_build_info - Phase: 4 (alongside repo trend charts — both rely on the same time-series data)
15. Future considerations (not yet committed)
- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge