Files
restic-manager/spec.md
T
2026-04-30 23:55:52 +01:00

21 KiB

restic-manager — Specification

1. Overview

restic-manager is a self-hosted, browser-based, single-pane-of-glass for managing restic backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.

It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.

License: PolyForm Noncommercial 1.0.0

2. Goals & Non-Goals

Goals

  • Central visibility into backup state for every endpoint
  • Trigger any restic operation remotely (backup, forget, prune, check, unlock, snapshots, stats, diff, restore)
  • Manage per-host backup schedules from the UI
  • Live job progress streamed back to the UI
  • Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
  • Repo health surfacing (size, dedup ratio, last check, lock state)
  • Alerting on failure or staleness
  • Cross-platform agent (Linux + Windows)
  • Ransomware-resistant repo access via append-only credentials

Non-Goals (initial release)

  • Replacing restic itself or providing custom repo formats
  • Managing non-restic backup tools
  • Multi-tenancy / SaaS deployment
  • High availability of the control plane (SQLite, single-instance)
  • Mobile-native apps (responsive web only)

3. Architecture

3.1 Components

┌──────────────────────────────────────────────────────────────────┐
│  Proxmox cluster                                                 │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  docker compose: restic-manager                            │  │
│  │   - server (Go binary, REST + WS API, embedded HTMX UI)    │  │
│  │   - SQLite volume                                          │  │
│  └────────────────────────────────────────────────────────────┘  │
└────────────────────────▲─────────────────────────────────────────┘
                         │ HTTPS (control plane)
                         │  - agent → server: status, telemetry
                         │  - server → agent: commands, schedules
                         │
┌────────────────────────┴─────────────────────────────────────────┐
│  Endpoints (Linux + Windows)                                     │
│  ┌──────────────────────┐    ┌────────────────────────────────┐  │
│  │  restic-manager-     │    │  restic CLI                    │  │
│  │  agent (Go binary)   │───▶│  invoked by agent              │  │
│  │  - systemd / svc     │    └─────────────┬──────────────────┘  │
│  │  - WS to server      │                  │ HTTPS               │
│  └──────────────────────┘                  │ (data plane)        │
└─────────────────────────────────────────────┼────────────────────┘
                                              │
                                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  Unraid                                                          │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  Docker: restic/rest-server                                │  │
│  │   - per-host append-only credentials                       │  │
│  │   - one repo per host                                      │  │
│  │   - storage: Unraid share                                  │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

3.2 Data flow

  • Backup data: endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane never touches backup bytes.
  • Control plane: agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
  • UI: browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.

3.3 Why agent (not SSH)

  • Push model works through NAT/firewalls without inbound rules
  • Native Windows support without OpenSSH service quirks
  • Local scheduling survives controller restarts
  • Self-contained restic --json parsing, no remote shell quoting hazards

3.4 Why per-host repos

  • Isolates corruption / lock contention
  • Append-only credentials per host = compromised endpoint can't delete other hosts' backups
  • Simpler prune orchestration (no global lock coordination)
  • Trivially easy to retire a host (delete its repo + credential)

4. Components in detail

4.1 Server

  • Language: Go 1.22+
  • Storage: SQLite (via modernc.org/sqlite, no CGo)
  • HTTP: net/http + chi router
  • WebSocket: nhooyr.io/websocket
  • UI: HTMX + Tailwind, server-rendered Go templates, no Node build step
  • Distribution: single static binary, packaged in a Docker image; published docker-compose.yml
  • Config: YAML or env vars (RM_LISTEN, RM_DATA_DIR, RM_BASE_URL, RM_TLS_CERT, RM_TLS_KEY)
  • TLS: terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS)

4.2 Agent

  • Language: Go (cross-compiled for linux/amd64, linux/arm64, windows/amd64)
  • Service integration: systemd unit (Linux), Windows service via golang.org/x/sys/windows/svc
  • Footprint goal: ≤ 15 MB binary, ≤ 50 MB RSS idle
  • Persistence: local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
  • Restic invocation: spawns restic with --json, parses streamed output, forwards to server in real time
  • Self-update: server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service

4.3 Restic REST server (Unraid)

  • Run restic/rest-server Docker container
  • --append-only enabled
  • --private-repos enabled (each user only sees their own subpath)
  • htpasswd file with one user per host
  • Storage path mapped to Unraid share

5. Domain model

Host
  id, name, os, arch, agent_version, restic_version,
  enrolled_at, last_seen_at, status (online/offline/degraded),
  repo_id (FK), tags,
  current_job_id (FK nullable),
  last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
  repo_size_bytes, snapshot_count, open_alert_count
  # Last six fields are denormalised projections, refreshed on
  # job.finished, snapshots.report, repo.stats, and alert state changes.

Repo
  id, name, url, kind (rest|s3|local), credential_id (FK),
  password_secret_id (FK),
  size_bytes, snapshot_count, dedup_ratio,
  last_check_at, last_check_status, lock_state (locked|unlocked),
  append_only (bool), credential_rotated_at
  # Bottom block is a cached projection from `restic stats` +
  # Credential row, refreshed by repo.stats agent messages.

Credential
  id, kind, username, secret_ref (encrypted),
  rotated_at

Schedule
  id, host_id (FK), kind (backup|forget|prune|check),
  cron_expr, paths (json), excludes (json), tags (json),
  retention_policy (json), options (json), pre_hook, post_hook,
  enabled
  # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
  #                    keep_monthly, keep_yearly, keep_tag: [...]}
  # options:          {limit_upload_kbps, limit_download_kbps}
  # pre_hook/post_hook: see §14.3 (encrypted at rest)

Job
  id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
  scheduled_id (FK nullable),
  actor_kind (user|schedule|system), actor_id (nullable),
  started_at, finished_at,
  exit_code, stats (json), error

JobLog
  job_id (FK), seq, ts, stream (stdout|stderr|event), payload

Snapshot  (cached projection from `restic snapshots --json`)
  id (restic id), host_id (FK), repo_id (FK),
  time, hostname, paths, tags, size_bytes, file_count

Alert
  id, host_id (FK nullable), kind, severity, message,
  created_at, acknowledged_at, resolved_at

User
  id, username, password_hash, role (admin|operator|viewer),
  created_at, last_login_at

Session
  id, user_id (FK), created_at, expires_at, ip, ua

AuditLog
  id, user_id (FK nullable), actor (user|agent|system),
  action, target_kind, target_id, ts, payload (json)

6. API surface (control plane)

6.1 UI/REST (browser → server)

POST   /api/auth/login
POST   /api/auth/logout

GET    /api/fleet/summary                (aggregate: host counts by status,
                                          total bytes, open alerts; reused by /metrics)

GET    /api/hosts                        ?tag=&status=&limit=&offset=
                                          (returns Host rows incl. denormalised
                                          last_backup_*, repo_size_bytes,
                                          snapshot_count, open_alert_count,
                                          current_job_id)
GET    /api/hosts/:id
DELETE /api/hosts/:id
POST   /api/hosts/:id/enrollment-token   (regenerate)
POST   /api/hosts/:id/agent/update       (force agent self-update; see §4.2)

GET    /api/hosts/:id/snapshots          ?tag=&path=&since=&until=&limit=&offset=
GET    /api/hosts/:id/repo               (full Repo projection)
POST   /api/hosts/:id/jobs               (run-now: backup/forget/prune/check/unlock)
POST   /api/hosts/:id/restore            (restore wizard submit)

GET    /api/hosts/:id/schedules
POST   /api/hosts/:id/schedules
PUT    /api/schedules/:id
DELETE /api/schedules/:id

GET    /api/jobs                         ?host_id=&kind=&status=&since=&until=
                                          &limit=&offset=&order=desc
GET    /api/jobs/:id
GET    /api/jobs/:id/logs                (paginated: ?after_seq=&limit=)
WS     /api/jobs/:id/stream              (live progress; see §6.2 for shape)
POST   /api/jobs/:id/cancel

GET    /api/repos
GET    /api/repos/:id

GET    /api/alerts
POST   /api/alerts/:id/ack

GET    /api/audit
GET    /api/users   (admin)
POST   /api/users   (admin)

Realtime strategy: only /api/jobs/:id/stream uses WS. All other screens (dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit if dashboard staleness becomes a problem in practice.

6.2 Agent ↔ Server

Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.

Agent → server:

  • hello (host metadata, agent version, restic version, OS)
  • heartbeat (every 30s)
  • job.started (job_id, kind, started_at)
  • job.progress (job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps)
  • job.finished (job_id, status, exit_code, stats, error, finished_at)
  • snapshots.report (full list after each successful backup)
  • repo.stats (size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state)
  • log.stream (live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload})

Server → agent:

  • command.run (kind, args)
  • command.cancel (job_id)
  • schedule.set (full schedule list, agent reconciles local cron)
  • config.update
  • agent.update (new version available, URL + signature)

The server fans job.progress and log.stream for a given job to all browsers subscribed to WS /api/jobs/:id/stream (§6.1) without transformation, so the schema is shared end-to-end.

6.3 Enrollment

  1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
  2. Operator runs install script on endpoint with token
  3. Agent calls POST /api/agents/enroll with token + host metadata
  4. Server issues persistent agent credential (bearer token + TLS pin) and host record
  5. Agent stores credential, opens WS connection

7. Security

7.1 Authentication

  • Phase 1: username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
  • Phase 2: OIDC (Authelia, Keycloak, Authentik)
  • Agents: bearer token over TLS; pin server cert fingerprint at enrollment time

7.2 Authorization (Phase 1: simple roles)

  • admin: everything
  • operator: trigger jobs, edit schedules, restore
  • viewer: read-only

7.3 Secret handling

  • Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
  • Pushed to agents only over the authenticated WS, only when needed for a job
  • Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)

7.4 Repo protection

  • Restic REST server runs with --append-only for routine backups
  • A separate non-append-only credential exists for forget/prune operations, used only when explicitly invoked from the UI by an admin/operator and audited

7.5 Audit

  • Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload

8. UI

Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.

Pages:

  • Login
  • Dashboard: fleet overview (host cards: status, last backup, repo size, alerts)
  • Host detail: tabs for Snapshots / Schedules / Jobs / Repo / Settings
  • Job detail: live log streaming via WS, cancel button
  • Restore wizard: host → snapshot → paths → target → confirm
  • Repos: aggregate view across hosts
  • Alerts: list, acknowledge
  • Settings: users (admin), notification channels, agent download
  • Audit log

9. Alerting

  • Triggers: backup failed, backup hasn't run in N hours past its schedule, repo check failed, agent offline > N minutes, repo size growth anomaly
  • Channels (Phase 1): webhook, ntfy, email (SMTP)
  • Channels (Phase 2+): Discord, Slack, Pushover

10. Deployment

10.1 Control plane (Proxmox host or LXC)

docker-compose.yml:

services:
  restic-manager:
    image: ghcr.io/<owner>/restic-manager:latest
    restart: unless-stopped
    ports:
      - "8443:8443"
    volumes:
      - ./data:/data
      - ./certs:/certs:ro
    environment:
      - RM_DATA_DIR=/data
      - RM_LISTEN=:8443
      - RM_BASE_URL=https://restic.lab.example
      - RM_TLS_CERT=/certs/fullchain.pem
      - RM_TLS_KEY=/certs/privkey.pem
      - RM_SECRET_KEY_FILE=/data/secret.key

10.2 Restic REST server (Unraid)

Standard restic/rest-server container, --append-only, --private-repos, htpasswd mounted, data path on the share.

10.3 Agent install

  • Linux: curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh
  • Windows: iwr https://restic.lab.example/install.ps1 | iex (with $env:RM_TOKEN)
  • Installer drops binary + service unit, calls enroll endpoint, starts service

11. Testing strategy

  • Unit tests: restic JSON parsing, schedule reconciliation, retention policy logic
  • Integration tests: spin up real restic + rest-server in Docker, exercise full backup/snapshot/restore flows
  • End-to-end: Playwright against a compose-up'd stack with one Linux agent in a sibling container
  • Cross-platform agent CI: build matrix linux/amd64, linux/arm64, windows/amd64; smoke test on Windows runner

12. Repository layout

restic-manager/
├── cmd/
│   ├── server/
│   └── agent/
├── internal/
│   ├── api/             # shared API types
│   ├── server/
│   │   ├── http/
│   │   ├── ws/
│   │   └── ui/          # templates, handlers
│   ├── agent/
│   │   ├── service/     # systemd / windows service glue
│   │   ├── runner/      # restic invocation
│   │   └── scheduler/
│   ├── restic/          # restic CLI wrapper, --json parsing
│   ├── store/           # sqlite layer
│   ├── crypto/          # secret encryption
│   └── auth/
├── web/
│   ├── templates/
│   └── static/
├── deploy/
│   ├── docker-compose.yml
│   ├── Dockerfile.server
│   └── install/
│       ├── install.sh
│       └── install.ps1
├── docs/
├── LICENSE              # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md

13. Phased delivery

  • Phase 1 (MVP): server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
  • Phase 2: schedules, retention, run-now for forget/prune/check/unlock, repo stats
  • Phase 3: restore wizard, alerts (webhook/ntfy/email), audit log
  • Phase 4: agent self-update, OIDC, multi-user/RBAC polish, repo trends
  • Phase 5: OSS readiness — docs site, contribution guide, screenshot tour

14. Confirmed extensions (in scope)

These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.

14.1 Cross-host restore

Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).

  • Credential model: target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
  • Path remapping: UI allows rewriting source paths to target paths (e.g. /home/alice/home/alice-new)
  • Permissions: restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
  • Phase: 3 (with the restore wizard)

14.2 Bandwidth limiting

Per-host upload/download caps for backup, restore, and prune jobs.

  • Exposed on the schedule editor as optional --limit-upload / --limit-download (KB/s)
  • Also overridable on run-now jobs via the UI
  • Persisted in Schedule.options (JSON blob) so the schema stays stable
  • Phase: 2 (with scheduling)

14.3 Pre/post backup hooks

Per-host shell commands run before and after a backup job. Use cases: mysqldump/pg_dump to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.

  • Schema: Schedule.pre_hook and Schedule.post_hook (string, optional). For more complex cases, Host.pre_hook_default / Host.post_hook_default apply to all schedules on that host unless overridden
  • Execution: agent runs hooks via the host's default shell (/bin/sh Linux, cmd.exe or PowerShell Windows — host-configurable)
  • Failure semantics: pre_hook non-zero exit aborts the backup and marks the job failed. post_hook runs on both success and failure (with RM_JOB_STATUS env var); its own exit code is recorded but does not change the backup job's final status
  • Stdout/stderr: captured into JobLog like restic output, prefixed pre_hook: / post_hook:
  • Security: hooks are stored encrypted; only admins can edit them; every edit audit-logged
  • Phase: 2 (with scheduling)

14.4 Prometheus /metrics endpoint

Standard Prometheus exposition on /metrics, protected by either bearer token or IP allow-list.

  • Metrics (per host):
    • restic_manager_last_backup_timestamp_seconds{host=...}
    • restic_manager_last_backup_status{host=...} (1=success, 0=failure)
    • restic_manager_repo_size_bytes{host=...}
    • restic_manager_snapshot_count{host=...}
    • restic_manager_agent_online{host=...} (1/0)
    • restic_manager_job_duration_seconds_bucket{kind=...,host=...} (histogram)
  • Server-level: restic_manager_jobs_total{kind=...,status=...}, restic_manager_alerts_active, restic_manager_build_info
  • Phase: 4 (alongside repo trend charts — both rely on the same time-series data)

15. Future considerations (not yet committed)

  • Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge