# restic-manager — Specification ## 1. Overview **restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI. It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint. **License:** PolyForm Noncommercial 1.0.0 ## 2. Goals & Non-Goals ### Goals - Central visibility into backup state for every endpoint - Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`) - Manage per-host backup schedules from the UI - Live job progress streamed back to the UI - Restore wizard (browse snapshots, pick paths, restore to original or alternate host) - Repo health surfacing (size, dedup ratio, last check, lock state) - Alerting on failure or staleness - Cross-platform agent (Linux + Windows) - Ransomware-resistant repo access via append-only credentials ### Non-Goals (initial release) - Replacing restic itself or providing custom repo formats - Managing non-restic backup tools - Multi-tenancy / SaaS deployment - High availability of the control plane (SQLite, single-instance) - Mobile-native apps (responsive web only) ## 3. Architecture ### 3.1 Components ``` ┌──────────────────────────────────────────────────────────────────┐ │ Proxmox cluster │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ docker compose: restic-manager │ │ │ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │ │ │ - SQLite volume │ │ │ └────────────────────────────────────────────────────────────┘ │ └────────────────────────▲─────────────────────────────────────────┘ │ HTTPS (control plane) │ - agent → server: status, telemetry │ - server → agent: commands, schedules │ ┌────────────────────────┴─────────────────────────────────────────┐ │ Endpoints (Linux + Windows) │ │ ┌──────────────────────┐ ┌────────────────────────────────┐ │ │ │ restic-manager- │ │ restic CLI │ │ │ │ agent (Go binary) │───▶│ invoked by agent │ │ │ │ - systemd / svc │ └─────────────┬──────────────────┘ │ │ │ - WS to server │ │ HTTPS │ │ └──────────────────────┘ │ (data plane) │ └─────────────────────────────────────────────┼────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Unraid │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Docker: restic/rest-server │ │ │ │ - per-host append-only credentials │ │ │ │ - one repo per host │ │ │ │ - storage: Unraid share │ │ │ └────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` ### 3.2 Data flow - **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes. - **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata. - **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser. ### 3.3 Why agent (not SSH) - Push model works through NAT/firewalls without inbound rules - Native Windows support without OpenSSH service quirks - Local scheduling survives controller restarts - Self-contained `restic --json` parsing, no remote shell quoting hazards ### 3.4 Why per-host repos - Isolates corruption / lock contention - Append-only credentials per host = compromised endpoint can't delete other hosts' backups - Simpler `prune` orchestration (no global lock coordination) - Trivially easy to retire a host (delete its repo + credential) ## 4. Components in detail ### 4.1 Server - **Language:** Go 1.22+ - **Storage:** SQLite (via `modernc.org/sqlite`, no CGo) - **HTTP:** `net/http` + `chi` router - **WebSocket:** `github.com/coder/websocket` (the maintained fork of the unmaintained `nhooyr.io/websocket`; same API) - **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step - **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml` - **Config:** YAML or env vars: - `RM_LISTEN` — bind address, e.g. `:8080` (source of truth for the port; the `8080` in the reference compose is just a default mapping). Bind to `127.0.0.1:8080` when running behind a same-host proxy. - `RM_DATA_DIR`, `RM_BASE_URL`, `RM_SECRET_KEY_FILE` - `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the default) = trust no one. Set this when fronted by Caddy/Traefik. - `RM_COOKIE_SECURE` — `true` (default) marks session cookies `Secure`. Only set to `false` for local HTTP-only testing. - **TLS:** the server speaks plain HTTP and is **always** expected to sit behind a TLS-terminating reverse proxy (Caddy / Traefik / nginx). This keeps cert renewal, ACME, and SNI in the proxy where operators already manage it. Agents must reach the server over HTTPS; the cert pin (`cert_pin_sha256`) pins whatever cert the proxy serves. ### 4.2 Agent - **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`). Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but Windows service integration + signed installer + install.ps1 land in Phase 2. - **Service integration:** systemd unit (Linux). Windows service via `golang.org/x/sys/windows/svc` — Phase 2. - **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle - **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable - **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time - **Updates:** distributed via OS package manager — apt repo (Linux) and Chocolatey package (Windows), both pointing at gitea releases. No bespoke signed-binary self-update; the `restic-manager-agent update` command is a thin wrapper over `apt-get install --only-upgrade` / `choco upgrade`. UI surfaces "agent N versions behind server" so an operator knows when to upgrade. ### 4.3 Restic REST server (Unraid) - Run `restic/rest-server` Docker container - `--append-only` enabled - `--private-repos` enabled (each user only sees their own subpath) - htpasswd file with one user per host - Storage path mapped to Unraid share ## 5. Domain model ``` Host id, name, os, arch, agent_version, restic_version, protocol_version, enrolled_at, last_seen_at, status (online/offline/degraded), repo_id (FK), tags, current_job_id (FK nullable), last_backup_at, last_backup_status (succeeded|failed|cancelled|null), repo_size_bytes, snapshot_count, open_alert_count, applied_schedule_version # Bottom block (last_backup_*, repo_size_bytes, snapshot_count, # open_alert_count, applied_schedule_version) are denormalised # projections, refreshed on job.finished, snapshots.report, # repo.stats, and alert state changes. # applied_schedule_version is the schedule_version the agent most # recently acknowledged via `schedule.ack` — lets the UI surface # drift when an agent is offline. Repo id, name, url, kind (rest|s3|local), credential_id (FK), password_secret_id (FK), size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state (locked|unlocked), append_only (bool), credential_rotated_at # Bottom block is a cached projection from `restic stats` + # Credential row, refreshed by repo.stats agent messages. Credential id, kind, username, secret_ref (encrypted), rotated_at Schedule id, host_id (FK), kind (backup|forget|prune|check), cron_expr, paths (json), excludes (json), tags (json), retention_policy (json), options (json), pre_hook, post_hook, enabled # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly, # keep_monthly, keep_yearly, keep_tag: [...]} # options: {limit_upload_kbps, limit_download_kbps} # pre_hook/post_hook: see §14.3 (encrypted at rest) Job id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled), scheduled_id (FK nullable), actor_kind (user|schedule|system), actor_id (nullable), started_at, finished_at, exit_code, stats (json), error JobLog job_id (FK), seq, ts, stream (stdout|stderr|event), payload Snapshot (cached projection from `restic snapshots --json`) id (restic id), host_id (FK), repo_id (FK), time, hostname, paths, tags, size_bytes, file_count Alert id, host_id (FK nullable), kind, severity, message, created_at, acknowledged_at, resolved_at User id, username, password_hash, role (admin|operator|viewer), created_at, last_login_at Session id, user_id (FK), created_at, expires_at, ip, ua AuditLog id, user_id (FK nullable), actor (user|agent|system), action, target_kind, target_id, ts, payload (json) ``` ## 6. API surface (control plane) ### 6.1 UI/REST (browser → server) ``` POST /api/auth/login POST /api/auth/logout GET /api/fleet/summary (aggregate: host counts by status, total bytes, open alerts; reused by /metrics) GET /api/hosts ?tag=&status=&limit=&offset= (returns Host rows incl. denormalised last_backup_*, repo_size_bytes, snapshot_count, open_alert_count, current_job_id) GET /api/hosts/:id DELETE /api/hosts/:id POST /api/hosts/:id/enrollment-token (regenerate) POST /api/hosts/:id/agent/update (force agent self-update; see §4.2) GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset= GET /api/hosts/:id/repo (full Repo projection) POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock) POST /api/hosts/:id/restore (restore wizard submit) GET /api/hosts/:id/schedules POST /api/hosts/:id/schedules PUT /api/schedules/:id DELETE /api/schedules/:id GET /api/jobs ?host_id=&kind=&status=&since=&until= &limit=&offset=&order=desc GET /api/jobs/:id GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=) WS /api/jobs/:id/stream (live progress; see §6.2 for shape) POST /api/jobs/:id/cancel GET /api/repos GET /api/repos/:id GET /api/alerts POST /api/alerts/:id/ack GET /api/audit GET /api/users (admin) POST /api/users (admin) ``` **Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens (dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit if dashboard staleness becomes a problem in practice. ### 6.2 Agent ↔ Server Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages. **Agent → server:** - `hello` (host metadata, agent_version, restic_version, OS, `protocol_version` — see "Protocol versioning" below) - `heartbeat` (every 30s) - `job.started` (job_id, kind, started_at) - `job.progress` (job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps) - `job.finished` (job_id, status, exit_code, stats, error, finished_at) - `snapshots.report` (full list after each successful backup) - `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state) - `log.stream` (live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload}) - `schedule.ack` (schedule_version) — agent confirms it has applied a schedule push; lets the server surface "this host is N versions behind" without polling **Server → agent:** - `command.run` (kind, args) - `command.cancel` (job_id) - `schedule.set` (schedule_version, schedules: [...]) — full schedule list, agent reconciles local cron and replies with `schedule.ack` - `config.update` - `agent.update.available` (new version + package source URL — informational only; agent does not self-update, see §4.2) The server fans `job.progress` and `log.stream` for a given job to all browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without transformation, so the schema is shared end-to-end. **Protocol versioning.** Agents and the server each declare an integer `protocol_version` in `hello`. The version bumps **only** on breaking wire-format changes (not human-readable software releases). The server maintains a `MinAgentProtocolVersion` constant; agents below it are disconnected with `error: protocol_too_old` and a URL pointing at the upgrade instructions. Symmetrically, an agent talking to a server that advertises a `protocol_version` it does not recognise refuses to proceed and surfaces a clear log message. This avoids the failure mode of "weird JSON parse errors when v0.3 agent meets v0.5 server." **Schedule reconciliation when the server is unreachable.** Agents keep firing the **last-known-good** schedule pushed by the server, indefinitely. Rationale: a missed backup because the controller is down is a worse outcome than firing a schedule the user has since edited. On reconnect, the server's view is canonical: the next `schedule.set` overrides whatever the agent was running, the agent replies `schedule.ack` with the new `schedule_version`, and the server updates `Host.applied_schedule_version`. The UI surfaces drift ("schedule v7 pushed, agent applied v5") when an agent has been offline. ### 6.3 Enrollment 1. Operator clicks "Add host" → server generates one-time token (TTL 1h) 2. Operator runs install script on endpoint with token 3. Agent calls `POST /api/agents/enroll` with token + host metadata 4. Server issues persistent agent credential (bearer token + TLS pin) and host record 5. Agent stores credential, opens WS connection ## 7. Security ### 7.1 Authentication - **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests - **Phase 2:** OIDC (Authelia, Keycloak, Authentik) - **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time ### 7.2 Authorization (Phase 1: simple roles) - **admin:** everything - **operator:** trigger jobs, edit schedules, restore - **viewer:** read-only ### 7.3 Secret handling - Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup) - Pushed to agents only over the authenticated WS, only when needed for a job - Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms) ### 7.4 Repo protection - Restic REST server runs with `--append-only` for routine backups - A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited ### 7.5 Audit - Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload ## 8. UI Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement. **Pages:** - **Login** - **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts) - **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings - **Job detail:** live log streaming via WS, cancel button - **Restore wizard:** host → snapshot → paths → target → confirm - **Repos:** aggregate view across hosts - **Alerts:** list, acknowledge - **Settings:** users (admin), notification channels, agent download - **Audit log** ## 9. Alerting - **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly - **Channels (Phase 1):** webhook, ntfy, email (SMTP) - **Channels (Phase 2+):** Discord, Slack, Pushover ## 10. Deployment ### 10.1 Control plane (Proxmox host or LXC) The server is HTTP-only by design — operators front it with their own TLS-terminating reverse proxy (Caddy / Traefik / nginx). Bind the container to localhost so the only public path is through the proxy. `docker-compose.yml`: ```yaml services: restic-manager: image: ghcr.io//restic-manager:latest restart: unless-stopped ports: - "127.0.0.1:8080:8080" volumes: - ./data:/data environment: - RM_DATA_DIR=/data - RM_LISTEN=:8080 - RM_BASE_URL=https://restic.lab.example - RM_SECRET_KEY_FILE=/data/secret.key - RM_TRUSTED_PROXY=172.16.0.0/12 # CIDR of your reverse proxy ``` Reference Caddy snippet (operator's own Caddyfile, outside this repo): ``` restic.lab.example { encode zstd gzip reverse_proxy 127.0.0.1:8080 } ``` Caddy provisions and renews the cert; the agent's `cert_pin_sha256` pins **Caddy's** leaf cert (that's what the agent actually sees). `RM_LISTEN` is the source of truth for the server's bind address. The `8080:8080` mapping above is just the matching default; change both sides together if you pick a different port. > ⚠️ Never expose `RM_LISTEN` directly on a public interface — the > server has no TLS, no rate limiting, and no DDoS protection. That > all belongs in the proxy. ### 10.2 Restic REST server (Unraid) Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share. ### 10.3 Agent install - **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh` - **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`) - Installer drops binary + service unit, calls enroll endpoint, starts service ## 11. Testing strategy - **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic - **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows - **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container - **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner ## 12. Repository layout ``` restic-manager/ ├── cmd/ │ ├── server/ │ └── agent/ ├── internal/ │ ├── api/ # shared API types │ ├── server/ │ │ ├── http/ │ │ ├── ws/ │ │ └── ui/ # templates, handlers │ ├── agent/ │ │ ├── service/ # systemd / windows service glue │ │ ├── runner/ # restic invocation │ │ └── scheduler/ │ ├── restic/ # restic CLI wrapper, --json parsing │ ├── store/ # sqlite layer │ ├── crypto/ # secret encryption │ └── auth/ ├── web/ │ ├── templates/ │ └── static/ ├── deploy/ │ ├── docker-compose.yml │ ├── Dockerfile.server │ └── install/ │ ├── install.sh │ └── install.ps1 ├── docs/ ├── LICENSE # PolyForm Noncommercial 1.0.0 ├── README.md ├── spec.md └── tasks.md ``` ## 13. Phased delivery - **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log - **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats - **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log - **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends - **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour ## 14. Confirmed extensions (in scope) These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below. ### 14.1 Cross-host restore Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop). - **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after - **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`) - **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root - **Phase:** 3 (with the restore wizard) ### 14.2 Bandwidth limiting Per-host upload/download caps for backup, restore, and prune jobs. - Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s) - Also overridable on run-now jobs via the UI - Persisted in `Schedule.options` (JSON blob) so the schema stays stable - **Phase:** 2 (with scheduling) ### 14.3 Pre/post backup hooks Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications. - **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden - **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host - **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable) - **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status - **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:` - **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged - **Phase:** 2 (with scheduling) ### 14.4 Prometheus `/metrics` endpoint Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list. - **Metrics (per host):** - `restic_manager_last_backup_timestamp_seconds{host=...}` - `restic_manager_last_backup_status{host=...}` (1=success, 0=failure) - `restic_manager_repo_size_bytes{host=...}` - `restic_manager_snapshot_count{host=...}` - `restic_manager_agent_online{host=...}` (1/0) - `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram) - **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info` - **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data) ## 15. Future considerations (not yet committed) - Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge