21 KiB
restic-manager — Specification
1. Overview
restic-manager is a self-hosted, browser-based, single-pane-of-glass for managing restic backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.
It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.
License: PolyForm Noncommercial 1.0.0
2. Goals & Non-Goals
Goals
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (
backup,forget,prune,check,unlock,snapshots,stats,diff,restore) - Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
Non-Goals (initial release)
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
3. Architecture
3.1 Components
┌──────────────────────────────────────────────────────────────────┐
│ Proxmox cluster │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ docker compose: restic-manager │ │
│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │
│ │ - SQLite volume │ │
│ └────────────────────────────────────────────────────────────┘ │
└────────────────────────▲─────────────────────────────────────────┘
│ HTTPS (control plane)
│ - agent → server: status, telemetry
│ - server → agent: commands, schedules
│
┌────────────────────────┴─────────────────────────────────────────┐
│ Endpoints (Linux + Windows) │
│ ┌──────────────────────┐ ┌────────────────────────────────┐ │
│ │ restic-manager- │ │ restic CLI │ │
│ │ agent (Go binary) │───▶│ invoked by agent │ │
│ │ - systemd / svc │ └─────────────┬──────────────────┘ │
│ │ - WS to server │ │ HTTPS │
│ └──────────────────────┘ │ (data plane) │
└─────────────────────────────────────────────┼────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Unraid │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Docker: restic/rest-server │ │
│ │ - per-host append-only credentials │ │
│ │ - one repo per host │ │
│ │ - storage: Unraid share │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
3.2 Data flow
- Backup data: endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane never touches backup bytes.
- Control plane: agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
- UI: browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.
3.3 Why agent (not SSH)
- Push model works through NAT/firewalls without inbound rules
- Native Windows support without OpenSSH service quirks
- Local scheduling survives controller restarts
- Self-contained
restic --jsonparsing, no remote shell quoting hazards
3.4 Why per-host repos
- Isolates corruption / lock contention
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
- Simpler
pruneorchestration (no global lock coordination) - Trivially easy to retire a host (delete its repo + credential)
4. Components in detail
4.1 Server
- Language: Go 1.22+
- Storage: SQLite (via
modernc.org/sqlite, no CGo) - HTTP:
net/http+chirouter - WebSocket:
nhooyr.io/websocket - UI: HTMX + Tailwind, server-rendered Go templates, no Node build step
- Distribution: single static binary, packaged in a Docker image; published
docker-compose.yml - Config: YAML or env vars (
RM_LISTEN,RM_DATA_DIR,RM_BASE_URL,RM_TLS_CERT,RM_TLS_KEY) - TLS: terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS)
4.2 Agent
- Language: Go (cross-compiled for
linux/amd64,linux/arm64,windows/amd64) - Service integration: systemd unit (Linux), Windows service via
golang.org/x/sys/windows/svc - Footprint goal: ≤ 15 MB binary, ≤ 50 MB RSS idle
- Persistence: local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
- Restic invocation: spawns
resticwith--json, parses streamed output, forwards to server in real time - Self-update: server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service
4.3 Restic REST server (Unraid)
- Run
restic/rest-serverDocker container --append-onlyenabled--private-reposenabled (each user only sees their own subpath)- htpasswd file with one user per host
- Storage path mapped to Unraid share
5. Domain model
Host
id, name, os, arch, agent_version, restic_version,
enrolled_at, last_seen_at, status (online/offline/degraded),
repo_id (FK), tags,
current_job_id (FK nullable),
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
repo_size_bytes, snapshot_count, open_alert_count
# Last six fields are denormalised projections, refreshed on
# job.finished, snapshots.report, repo.stats, and alert state changes.
Repo
id, name, url, kind (rest|s3|local), credential_id (FK),
password_secret_id (FK),
size_bytes, snapshot_count, dedup_ratio,
last_check_at, last_check_status, lock_state (locked|unlocked),
append_only (bool), credential_rotated_at
# Bottom block is a cached projection from `restic stats` +
# Credential row, refreshed by repo.stats agent messages.
Credential
id, kind, username, secret_ref (encrypted),
rotated_at
Schedule
id, host_id (FK), kind (backup|forget|prune|check),
cron_expr, paths (json), excludes (json), tags (json),
retention_policy (json), options (json), pre_hook, post_hook,
enabled
# retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
# keep_monthly, keep_yearly, keep_tag: [...]}
# options: {limit_upload_kbps, limit_download_kbps}
# pre_hook/post_hook: see §14.3 (encrypted at rest)
Job
id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
scheduled_id (FK nullable),
actor_kind (user|schedule|system), actor_id (nullable),
started_at, finished_at,
exit_code, stats (json), error
JobLog
job_id (FK), seq, ts, stream (stdout|stderr|event), payload
Snapshot (cached projection from `restic snapshots --json`)
id (restic id), host_id (FK), repo_id (FK),
time, hostname, paths, tags, size_bytes, file_count
Alert
id, host_id (FK nullable), kind, severity, message,
created_at, acknowledged_at, resolved_at
User
id, username, password_hash, role (admin|operator|viewer),
created_at, last_login_at
Session
id, user_id (FK), created_at, expires_at, ip, ua
AuditLog
id, user_id (FK nullable), actor (user|agent|system),
action, target_kind, target_id, ts, payload (json)
6. API surface (control plane)
6.1 UI/REST (browser → server)
POST /api/auth/login
POST /api/auth/logout
GET /api/fleet/summary (aggregate: host counts by status,
total bytes, open alerts; reused by /metrics)
GET /api/hosts ?tag=&status=&limit=&offset=
(returns Host rows incl. denormalised
last_backup_*, repo_size_bytes,
snapshot_count, open_alert_count,
current_job_id)
GET /api/hosts/:id
DELETE /api/hosts/:id
POST /api/hosts/:id/enrollment-token (regenerate)
POST /api/hosts/:id/agent/update (force agent self-update; see §4.2)
GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset=
GET /api/hosts/:id/repo (full Repo projection)
POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock)
POST /api/hosts/:id/restore (restore wizard submit)
GET /api/hosts/:id/schedules
POST /api/hosts/:id/schedules
PUT /api/schedules/:id
DELETE /api/schedules/:id
GET /api/jobs ?host_id=&kind=&status=&since=&until=
&limit=&offset=&order=desc
GET /api/jobs/:id
GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=)
WS /api/jobs/:id/stream (live progress; see §6.2 for shape)
POST /api/jobs/:id/cancel
GET /api/repos
GET /api/repos/:id
GET /api/alerts
POST /api/alerts/:id/ack
GET /api/audit
GET /api/users (admin)
POST /api/users (admin)
Realtime strategy: only /api/jobs/:id/stream uses WS. All other screens
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
if dashboard staleness becomes a problem in practice.
6.2 Agent ↔ Server
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
Agent → server:
hello(host metadata, agent version, restic version, OS)heartbeat(every 30s)job.started(job_id, kind, started_at)job.progress(job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps)job.finished(job_id, status, exit_code, stats, error, finished_at)snapshots.report(full list after each successful backup)repo.stats(size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state)log.stream(live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload})
Server → agent:
command.run(kind, args)command.cancel(job_id)schedule.set(full schedule list, agent reconciles local cron)config.updateagent.update(new version available, URL + signature)
The server fans job.progress and log.stream for a given job to all
browsers subscribed to WS /api/jobs/:id/stream (§6.1) without
transformation, so the schema is shared end-to-end.
6.3 Enrollment
- Operator clicks "Add host" → server generates one-time token (TTL 1h)
- Operator runs install script on endpoint with token
- Agent calls
POST /api/agents/enrollwith token + host metadata - Server issues persistent agent credential (bearer token + TLS pin) and host record
- Agent stores credential, opens WS connection
7. Security
7.1 Authentication
- Phase 1: username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
- Phase 2: OIDC (Authelia, Keycloak, Authentik)
- Agents: bearer token over TLS; pin server cert fingerprint at enrollment time
7.2 Authorization (Phase 1: simple roles)
- admin: everything
- operator: trigger jobs, edit schedules, restore
- viewer: read-only
7.3 Secret handling
- Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
- Pushed to agents only over the authenticated WS, only when needed for a job
- Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)
7.4 Repo protection
- Restic REST server runs with
--append-onlyfor routine backups - A separate non-append-only credential exists for
forget/pruneoperations, used only when explicitly invoked from the UI by an admin/operator and audited
7.5 Audit
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload
8. UI
Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.
Pages:
- Login
- Dashboard: fleet overview (host cards: status, last backup, repo size, alerts)
- Host detail: tabs for Snapshots / Schedules / Jobs / Repo / Settings
- Job detail: live log streaming via WS, cancel button
- Restore wizard: host → snapshot → paths → target → confirm
- Repos: aggregate view across hosts
- Alerts: list, acknowledge
- Settings: users (admin), notification channels, agent download
- Audit log
9. Alerting
- Triggers: backup failed, backup hasn't run in N hours past its schedule, repo
checkfailed, agent offline > N minutes, repo size growth anomaly - Channels (Phase 1): webhook, ntfy, email (SMTP)
- Channels (Phase 2+): Discord, Slack, Pushover
10. Deployment
10.1 Control plane (Proxmox host or LXC)
docker-compose.yml:
services:
restic-manager:
image: ghcr.io/<owner>/restic-manager:latest
restart: unless-stopped
ports:
- "8443:8443"
volumes:
- ./data:/data
- ./certs:/certs:ro
environment:
- RM_DATA_DIR=/data
- RM_LISTEN=:8443
- RM_BASE_URL=https://restic.lab.example
- RM_TLS_CERT=/certs/fullchain.pem
- RM_TLS_KEY=/certs/privkey.pem
- RM_SECRET_KEY_FILE=/data/secret.key
10.2 Restic REST server (Unraid)
Standard restic/rest-server container, --append-only, --private-repos, htpasswd mounted, data path on the share.
10.3 Agent install
- Linux:
curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh - Windows:
iwr https://restic.lab.example/install.ps1 | iex(with$env:RM_TOKEN) - Installer drops binary + service unit, calls enroll endpoint, starts service
11. Testing strategy
- Unit tests: restic JSON parsing, schedule reconciliation, retention policy logic
- Integration tests: spin up real
restic+rest-serverin Docker, exercise full backup/snapshot/restore flows - End-to-end: Playwright against a compose-up'd stack with one Linux agent in a sibling container
- Cross-platform agent CI: build matrix
linux/amd64,linux/arm64,windows/amd64; smoke test on Windows runner
12. Repository layout
restic-manager/
├── cmd/
│ ├── server/
│ └── agent/
├── internal/
│ ├── api/ # shared API types
│ ├── server/
│ │ ├── http/
│ │ ├── ws/
│ │ └── ui/ # templates, handlers
│ ├── agent/
│ │ ├── service/ # systemd / windows service glue
│ │ ├── runner/ # restic invocation
│ │ └── scheduler/
│ ├── restic/ # restic CLI wrapper, --json parsing
│ ├── store/ # sqlite layer
│ ├── crypto/ # secret encryption
│ └── auth/
├── web/
│ ├── templates/
│ └── static/
├── deploy/
│ ├── docker-compose.yml
│ ├── Dockerfile.server
│ └── install/
│ ├── install.sh
│ └── install.ps1
├── docs/
├── LICENSE # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md
13. Phased delivery
- Phase 1 (MVP): server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
- Phase 2: schedules, retention, run-now for
forget/prune/check/unlock, repo stats - Phase 3: restore wizard, alerts (webhook/ntfy/email), audit log
- Phase 4: agent self-update, OIDC, multi-user/RBAC polish, repo trends
- Phase 5: OSS readiness — docs site, contribution guide, screenshot tour
14. Confirmed extensions (in scope)
These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.
14.1 Cross-host restore
Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).
- Credential model: target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- Path remapping: UI allows rewriting source paths to target paths (e.g.
/home/alice→/home/alice-new) - Permissions: restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
- Phase: 3 (with the restore wizard)
14.2 Bandwidth limiting
Per-host upload/download caps for backup, restore, and prune jobs.
- Exposed on the schedule editor as optional
--limit-upload/--limit-download(KB/s) - Also overridable on run-now jobs via the UI
- Persisted in
Schedule.options(JSON blob) so the schema stays stable - Phase: 2 (with scheduling)
14.3 Pre/post backup hooks
Per-host shell commands run before and after a backup job. Use cases: mysqldump/pg_dump to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
- Schema:
Schedule.pre_hookandSchedule.post_hook(string, optional). For more complex cases,Host.pre_hook_default/Host.post_hook_defaultapply to all schedules on that host unless overridden - Execution: agent runs hooks via the host's default shell (
/bin/shLinux,cmd.exeor PowerShell Windows — host-configurable) - Failure semantics:
pre_hooknon-zero exit aborts the backup and marks the job failed.post_hookruns on both success and failure (withRM_JOB_STATUSenv var); its own exit code is recorded but does not change the backup job's final status - Stdout/stderr: captured into
JobLoglike restic output, prefixedpre_hook:/post_hook: - Security: hooks are stored encrypted; only admins can edit them; every edit audit-logged
- Phase: 2 (with scheduling)
14.4 Prometheus /metrics endpoint
Standard Prometheus exposition on /metrics, protected by either bearer token or IP allow-list.
- Metrics (per host):
restic_manager_last_backup_timestamp_seconds{host=...}restic_manager_last_backup_status{host=...}(1=success, 0=failure)restic_manager_repo_size_bytes{host=...}restic_manager_snapshot_count{host=...}restic_manager_agent_online{host=...}(1/0)restic_manager_job_duration_seconds_bucket{kind=...,host=...}(histogram)
- Server-level:
restic_manager_jobs_total{kind=...,status=...},restic_manager_alerts_active,restic_manager_build_info - Phase: 4 (alongside repo trend charts — both rely on the same time-series data)
15. Future considerations (not yet committed)
- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge