# restic-manager — Specification ## 1. Overview **restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI. It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint. **License:** PolyForm Noncommercial 1.0.0 ## 2. Goals & Non-Goals ### Goals - Central visibility into backup state for every endpoint - Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`) - Manage per-host backup schedules from the UI - Live job progress streamed back to the UI - Restore wizard (browse snapshots, pick paths, restore to original or alternate host) - Repo health surfacing (size, dedup ratio, last check, lock state) - Alerting on failure or staleness - Cross-platform agent (Linux + Windows) - Ransomware-resistant repo access via append-only credentials ### Non-Goals (initial release) - Replacing restic itself or providing custom repo formats - Managing non-restic backup tools - Multi-tenancy / SaaS deployment - High availability of the control plane (SQLite, single-instance) - Mobile-native apps (responsive web only) ## 3. Architecture ### 3.1 Components ``` ┌──────────────────────────────────────────────────────────────────┐ │ Proxmox cluster │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ docker compose: restic-manager │ │ │ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │ │ │ - SQLite volume │ │ │ └────────────────────────────────────────────────────────────┘ │ └────────────────────────▲─────────────────────────────────────────┘ │ HTTPS (control plane) │ - agent → server: status, telemetry │ - server → agent: commands, schedules │ ┌────────────────────────┴─────────────────────────────────────────┐ │ Endpoints (Linux + Windows) │ │ ┌──────────────────────┐ ┌────────────────────────────────┐ │ │ │ restic-manager- │ │ restic CLI │ │ │ │ agent (Go binary) │───▶│ invoked by agent │ │ │ │ - systemd / svc │ └─────────────┬──────────────────┘ │ │ │ - WS to server │ │ HTTPS │ │ └──────────────────────┘ │ (data plane) │ └─────────────────────────────────────────────┼────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Unraid │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Docker: restic/rest-server │ │ │ │ - per-host append-only credentials │ │ │ │ - one repo per host │ │ │ │ - storage: Unraid share │ │ │ └────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` ### 3.2 Data flow - **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes. - **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata. - **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser. ### 3.3 Why agent (not SSH) - Push model works through NAT/firewalls without inbound rules - Native Windows support without OpenSSH service quirks - Local scheduling survives controller restarts - Self-contained `restic --json` parsing, no remote shell quoting hazards ### 3.4 Why per-host repos - Isolates corruption / lock contention - Append-only credentials per host = compromised endpoint can't delete other hosts' backups - Simpler `prune` orchestration (no global lock coordination) - Trivially easy to retire a host (delete its repo + credential) ## 4. Components in detail ### 4.1 Server - **Language:** Go 1.22+ - **Storage:** SQLite (via `modernc.org/sqlite`, no CGo) - **HTTP:** `net/http` + `chi` router - **WebSocket:** `nhooyr.io/websocket` - **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step - **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml` - **Config:** YAML or env vars (`RM_LISTEN`, `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`) - **TLS:** terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS) ### 4.2 Agent - **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`) - **Service integration:** systemd unit (Linux), Windows service via `golang.org/x/sys/windows/svc` - **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle - **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable - **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time - **Self-update:** server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service ### 4.3 Restic REST server (Unraid) - Run `restic/rest-server` Docker container - `--append-only` enabled - `--private-repos` enabled (each user only sees their own subpath) - htpasswd file with one user per host - Storage path mapped to Unraid share ## 5. Domain model ``` Host id, name, os, arch, agent_version, restic_version, enrolled_at, last_seen_at, status (online/offline/degraded), repo_id (FK), tags, current_job_id (FK nullable), last_backup_at, last_backup_status (succeeded|failed|cancelled|null), repo_size_bytes, snapshot_count, open_alert_count # Last six fields are denormalised projections, refreshed on # job.finished, snapshots.report, repo.stats, and alert state changes. Repo id, name, url, kind (rest|s3|local), credential_id (FK), password_secret_id (FK), size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state (locked|unlocked), append_only (bool), credential_rotated_at # Bottom block is a cached projection from `restic stats` + # Credential row, refreshed by repo.stats agent messages. Credential id, kind, username, secret_ref (encrypted), rotated_at Schedule id, host_id (FK), kind (backup|forget|prune|check), cron_expr, paths (json), excludes (json), tags (json), retention_policy (json), options (json), pre_hook, post_hook, enabled # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly, # keep_monthly, keep_yearly, keep_tag: [...]} # options: {limit_upload_kbps, limit_download_kbps} # pre_hook/post_hook: see §14.3 (encrypted at rest) Job id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled), scheduled_id (FK nullable), actor_kind (user|schedule|system), actor_id (nullable), started_at, finished_at, exit_code, stats (json), error JobLog job_id (FK), seq, ts, stream (stdout|stderr|event), payload Snapshot (cached projection from `restic snapshots --json`) id (restic id), host_id (FK), repo_id (FK), time, hostname, paths, tags, size_bytes, file_count Alert id, host_id (FK nullable), kind, severity, message, created_at, acknowledged_at, resolved_at User id, username, password_hash, role (admin|operator|viewer), created_at, last_login_at Session id, user_id (FK), created_at, expires_at, ip, ua AuditLog id, user_id (FK nullable), actor (user|agent|system), action, target_kind, target_id, ts, payload (json) ``` ## 6. API surface (control plane) ### 6.1 UI/REST (browser → server) ``` POST /api/auth/login POST /api/auth/logout GET /api/fleet/summary (aggregate: host counts by status, total bytes, open alerts; reused by /metrics) GET /api/hosts ?tag=&status=&limit=&offset= (returns Host rows incl. denormalised last_backup_*, repo_size_bytes, snapshot_count, open_alert_count, current_job_id) GET /api/hosts/:id DELETE /api/hosts/:id POST /api/hosts/:id/enrollment-token (regenerate) POST /api/hosts/:id/agent/update (force agent self-update; see §4.2) GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset= GET /api/hosts/:id/repo (full Repo projection) POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock) POST /api/hosts/:id/restore (restore wizard submit) GET /api/hosts/:id/schedules POST /api/hosts/:id/schedules PUT /api/schedules/:id DELETE /api/schedules/:id GET /api/jobs ?host_id=&kind=&status=&since=&until= &limit=&offset=&order=desc GET /api/jobs/:id GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=) WS /api/jobs/:id/stream (live progress; see §6.2 for shape) POST /api/jobs/:id/cancel GET /api/repos GET /api/repos/:id GET /api/alerts POST /api/alerts/:id/ack GET /api/audit GET /api/users (admin) POST /api/users (admin) ``` **Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens (dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit if dashboard staleness becomes a problem in practice. ### 6.2 Agent ↔ Server Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages. **Agent → server:** - `hello` (host metadata, agent version, restic version, OS) - `heartbeat` (every 30s) - `job.started` (job_id, kind, started_at) - `job.progress` (job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps) - `job.finished` (job_id, status, exit_code, stats, error, finished_at) - `snapshots.report` (full list after each successful backup) - `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state) - `log.stream` (live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload}) **Server → agent:** - `command.run` (kind, args) - `command.cancel` (job_id) - `schedule.set` (full schedule list, agent reconciles local cron) - `config.update` - `agent.update` (new version available, URL + signature) The server fans `job.progress` and `log.stream` for a given job to all browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without transformation, so the schema is shared end-to-end. ### 6.3 Enrollment 1. Operator clicks "Add host" → server generates one-time token (TTL 1h) 2. Operator runs install script on endpoint with token 3. Agent calls `POST /api/agents/enroll` with token + host metadata 4. Server issues persistent agent credential (bearer token + TLS pin) and host record 5. Agent stores credential, opens WS connection ## 7. Security ### 7.1 Authentication - **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests - **Phase 2:** OIDC (Authelia, Keycloak, Authentik) - **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time ### 7.2 Authorization (Phase 1: simple roles) - **admin:** everything - **operator:** trigger jobs, edit schedules, restore - **viewer:** read-only ### 7.3 Secret handling - Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup) - Pushed to agents only over the authenticated WS, only when needed for a job - Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms) ### 7.4 Repo protection - Restic REST server runs with `--append-only` for routine backups - A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited ### 7.5 Audit - Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload ## 8. UI Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement. **Pages:** - **Login** - **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts) - **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings - **Job detail:** live log streaming via WS, cancel button - **Restore wizard:** host → snapshot → paths → target → confirm - **Repos:** aggregate view across hosts - **Alerts:** list, acknowledge - **Settings:** users (admin), notification channels, agent download - **Audit log** ## 9. Alerting - **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly - **Channels (Phase 1):** webhook, ntfy, email (SMTP) - **Channels (Phase 2+):** Discord, Slack, Pushover ## 10. Deployment ### 10.1 Control plane (Proxmox host or LXC) `docker-compose.yml`: ```yaml services: restic-manager: image: ghcr.io//restic-manager:latest restart: unless-stopped ports: - "8443:8443" volumes: - ./data:/data - ./certs:/certs:ro environment: - RM_DATA_DIR=/data - RM_LISTEN=:8443 - RM_BASE_URL=https://restic.lab.example - RM_TLS_CERT=/certs/fullchain.pem - RM_TLS_KEY=/certs/privkey.pem - RM_SECRET_KEY_FILE=/data/secret.key ``` ### 10.2 Restic REST server (Unraid) Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share. ### 10.3 Agent install - **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh` - **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`) - Installer drops binary + service unit, calls enroll endpoint, starts service ## 11. Testing strategy - **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic - **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows - **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container - **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner ## 12. Repository layout ``` restic-manager/ ├── cmd/ │ ├── server/ │ └── agent/ ├── internal/ │ ├── api/ # shared API types │ ├── server/ │ │ ├── http/ │ │ ├── ws/ │ │ └── ui/ # templates, handlers │ ├── agent/ │ │ ├── service/ # systemd / windows service glue │ │ ├── runner/ # restic invocation │ │ └── scheduler/ │ ├── restic/ # restic CLI wrapper, --json parsing │ ├── store/ # sqlite layer │ ├── crypto/ # secret encryption │ └── auth/ ├── web/ │ ├── templates/ │ └── static/ ├── deploy/ │ ├── docker-compose.yml │ ├── Dockerfile.server │ └── install/ │ ├── install.sh │ └── install.ps1 ├── docs/ ├── LICENSE # PolyForm Noncommercial 1.0.0 ├── README.md ├── spec.md └── tasks.md ``` ## 13. Phased delivery - **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log - **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats - **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log - **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends - **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour ## 14. Confirmed extensions (in scope) These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below. ### 14.1 Cross-host restore Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop). - **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after - **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`) - **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root - **Phase:** 3 (with the restore wizard) ### 14.2 Bandwidth limiting Per-host upload/download caps for backup, restore, and prune jobs. - Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s) - Also overridable on run-now jobs via the UI - Persisted in `Schedule.options` (JSON blob) so the schema stays stable - **Phase:** 2 (with scheduling) ### 14.3 Pre/post backup hooks Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications. - **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden - **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable) - **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status - **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:` - **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged - **Phase:** 2 (with scheduling) ### 14.4 Prometheus `/metrics` endpoint Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list. - **Metrics (per host):** - `restic_manager_last_backup_timestamp_seconds{host=...}` - `restic_manager_last_backup_status{host=...}` (1=success, 0=failure) - `restic_manager_repo_size_bytes{host=...}` - `restic_manager_snapshot_count{host=...}` - `restic_manager_agent_online{host=...}` (1/0) - `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram) - **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info` - **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data) ## 15. Future considerations (not yet committed) - Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge