Files

T

steve 41a4043af3 server: drop in-process TLS — HTTP-only behind reverse proxy

Self-hosted deployments already terminate TLS at Caddy/Traefik/nginx;
making the server do TLS too means double cert config, dual ACME
plumbing, and an untested code path. Drop RM_TLS_CERT/RM_TLS_KEY,
remove TLSEnabled() and the ListenAndServeTLS branch.

Replace the cookie's "Secure if TLS-in-process" check with a new
RM_COOKIE_SECURE flag (default true). Local HTTP-only testing sets
RM_COOKIE_SECURE=false; production is always behind a TLS proxy and
the cookie stays Secure.

Default port :8443 → :8080. docker-compose binds 127.0.0.1 only and
populates RM_TRUSTED_PROXY. spec.md §4.1/§10.1 rewritten with a
Caddyfile snippet and a hard "do not expose RM_LISTEN publicly"
warning. enrollResponse keeps cert_pin_sha256 in the shape but the
server can't introspect a cert it doesn't terminate — operator
pastes the proxy's hash into -cert-pin at install time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 11:20:41 +01:00

26 KiB

Raw Blame History

restic-manager — Specification

1. Overview

restic-manager is a self-hosted, browser-based, single-pane-of-glass for managing restic backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.

It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.

License: PolyForm Noncommercial 1.0.0

2. Goals & Non-Goals

Goals

Central visibility into backup state for every endpoint
Trigger any restic operation remotely (backup, forget, prune, check, unlock, snapshots, stats, diff, restore)
Manage per-host backup schedules from the UI
Live job progress streamed back to the UI
Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
Repo health surfacing (size, dedup ratio, last check, lock state)
Alerting on failure or staleness
Cross-platform agent (Linux + Windows)
Ransomware-resistant repo access via append-only credentials

Non-Goals (initial release)

Replacing restic itself or providing custom repo formats
Managing non-restic backup tools
Multi-tenancy / SaaS deployment
High availability of the control plane (SQLite, single-instance)
Mobile-native apps (responsive web only)

3. Architecture

3.1 Components

┌──────────────────────────────────────────────────────────────────┐
│  Proxmox cluster                                                 │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  docker compose: restic-manager                            │  │
│  │   - server (Go binary, REST + WS API, embedded HTMX UI)    │  │
│  │   - SQLite volume                                          │  │
│  └────────────────────────────────────────────────────────────┘  │
└────────────────────────▲─────────────────────────────────────────┘
                         │ HTTPS (control plane)
                         │  - agent → server: status, telemetry
                         │  - server → agent: commands, schedules
                         │
┌────────────────────────┴─────────────────────────────────────────┐
│  Endpoints (Linux + Windows)                                     │
│  ┌──────────────────────┐    ┌────────────────────────────────┐  │
│  │  restic-manager-     │    │  restic CLI                    │  │
│  │  agent (Go binary)   │───▶│  invoked by agent              │  │
│  │  - systemd / svc     │    └─────────────┬──────────────────┘  │
│  │  - WS to server      │                  │ HTTPS               │
│  └──────────────────────┘                  │ (data plane)        │
└─────────────────────────────────────────────┼────────────────────┘
                                              │
                                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  Unraid                                                          │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  Docker: restic/rest-server                                │  │
│  │   - per-host append-only credentials                       │  │
│  │   - one repo per host                                      │  │
│  │   - storage: Unraid share                                  │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

3.2 Data flow

Backup data: endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane never touches backup bytes.
Control plane: agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
UI: browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.

3.3 Why agent (not SSH)

Push model works through NAT/firewalls without inbound rules
Native Windows support without OpenSSH service quirks
Local scheduling survives controller restarts
Self-contained restic --json parsing, no remote shell quoting hazards

3.4 Why per-host repos

Isolates corruption / lock contention
Append-only credentials per host = compromised endpoint can't delete other hosts' backups
Simpler prune orchestration (no global lock coordination)
Trivially easy to retire a host (delete its repo + credential)

4. Components in detail

4.1 Server

Language: Go 1.22+
Storage: SQLite (via modernc.org/sqlite, no CGo)
HTTP: net/http + chi router
WebSocket: github.com/coder/websocket (the maintained fork of the unmaintained nhooyr.io/websocket; same API)
UI: HTMX + Tailwind, server-rendered Go templates, no Node build step
Distribution: single static binary, packaged in a Docker image; published docker-compose.yml
Config: YAML or env vars:
- RM_LISTEN — bind address, e.g. :8080 (source of truth for the port; the 8080 in the reference compose is just a default mapping). Bind to 127.0.0.1:8080 when running behind a same-host proxy.
- RM_DATA_DIR, RM_BASE_URL, RM_SECRET_KEY_FILE
- RM_TRUSTED_PROXY — comma-separated CIDR list of reverse proxies whose X-Forwarded-For / X-Forwarded-Proto we honour. Empty (the default) = trust no one. Set this when fronted by Caddy/Traefik.
- RM_COOKIE_SECURE — true (default) marks session cookies Secure. Only set to false for local HTTP-only testing.
TLS: the server speaks plain HTTP and is always expected to sit behind a TLS-terminating reverse proxy (Caddy / Traefik / nginx). This keeps cert renewal, ACME, and SNI in the proxy where operators already manage it. Agents must reach the server over HTTPS; the cert pin (cert_pin_sha256) pins whatever cert the proxy serves.

4.2 Agent

Language: Go (cross-compiled for linux/amd64, linux/arm64, windows/amd64). Phase 1 ships Linux only; Windows binaries continue to build in CI to keep the codebase portable, but Windows service integration + signed installer
- install.ps1 land in Phase 2.
Service integration: systemd unit (Linux). Windows service via golang.org/x/sys/windows/svc — Phase 2.
Footprint goal: ≤ 15 MB binary, ≤ 50 MB RSS idle
Persistence: local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
Restic invocation: spawns restic with --json, parses streamed output, forwards to server in real time
Updates: distributed via OS package manager — apt repo (Linux) and Chocolatey package (Windows), both pointing at gitea releases. No bespoke signed-binary self-update; the restic-manager-agent update command is a thin wrapper over apt-get install --only-upgrade / choco upgrade. UI surfaces "agent N versions behind server" so an operator knows when to upgrade.

4.3 Restic REST server (Unraid)

Run restic/rest-server Docker container
--append-only enabled
--private-repos enabled (each user only sees their own subpath)
htpasswd file with one user per host
Storage path mapped to Unraid share

5. Domain model

Host
  id, name, os, arch, agent_version, restic_version, protocol_version,
  enrolled_at, last_seen_at, status (online/offline/degraded),
  repo_id (FK), tags,
  current_job_id (FK nullable),
  last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
  repo_size_bytes, snapshot_count, open_alert_count,
  applied_schedule_version
  # Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
  # open_alert_count, applied_schedule_version) are denormalised
  # projections, refreshed on job.finished, snapshots.report,
  # repo.stats, and alert state changes.
  # applied_schedule_version is the schedule_version the agent most
  # recently acknowledged via `schedule.ack` — lets the UI surface
  # drift when an agent is offline.

Repo
  id, name, url, kind (rest|s3|local), credential_id (FK),
  password_secret_id (FK),
  size_bytes, snapshot_count, dedup_ratio,
  last_check_at, last_check_status, lock_state (locked|unlocked),
  append_only (bool), credential_rotated_at
  # Bottom block is a cached projection from `restic stats` +
  # Credential row, refreshed by repo.stats agent messages.

Credential
  id, kind, username, secret_ref (encrypted),
  rotated_at

Schedule
  id, host_id (FK), kind (backup|forget|prune|check),
  cron_expr, paths (json), excludes (json), tags (json),
  retention_policy (json), options (json), pre_hook, post_hook,
  enabled
  # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
  #                    keep_monthly, keep_yearly, keep_tag: [...]}
  # options:          {limit_upload_kbps, limit_download_kbps}
  # pre_hook/post_hook: see §14.3 (encrypted at rest)

Job
  id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
  scheduled_id (FK nullable),
  actor_kind (user|schedule|system), actor_id (nullable),
  started_at, finished_at,
  exit_code, stats (json), error

JobLog
  job_id (FK), seq, ts, stream (stdout|stderr|event), payload

Snapshot  (cached projection from `restic snapshots --json`)
  id (restic id), host_id (FK), repo_id (FK),
  time, hostname, paths, tags, size_bytes, file_count

Alert
  id, host_id (FK nullable), kind, severity, message,
  created_at, acknowledged_at, resolved_at

User
  id, username, password_hash, role (admin|operator|viewer),
  created_at, last_login_at

Session
  id, user_id (FK), created_at, expires_at, ip, ua

AuditLog
  id, user_id (FK nullable), actor (user|agent|system),
  action, target_kind, target_id, ts, payload (json)

6. API surface (control plane)

6.1 UI/REST (browser → server)

POST   /api/auth/login
POST   /api/auth/logout

GET    /api/fleet/summary                (aggregate: host counts by status,
                                          total bytes, open alerts; reused by /metrics)

GET    /api/hosts                        ?tag=&status=&limit=&offset=
                                          (returns Host rows incl. denormalised
                                          last_backup_*, repo_size_bytes,
                                          snapshot_count, open_alert_count,
                                          current_job_id)
GET    /api/hosts/:id
DELETE /api/hosts/:id
POST   /api/hosts/:id/enrollment-token   (regenerate)
POST   /api/hosts/:id/agent/update       (force agent self-update; see §4.2)

GET    /api/hosts/:id/snapshots          ?tag=&path=&since=&until=&limit=&offset=
GET    /api/hosts/:id/repo               (full Repo projection)
POST   /api/hosts/:id/jobs               (run-now: backup/forget/prune/check/unlock)
POST   /api/hosts/:id/restore            (restore wizard submit)

GET    /api/hosts/:id/schedules
POST   /api/hosts/:id/schedules
PUT    /api/schedules/:id
DELETE /api/schedules/:id

GET    /api/jobs                         ?host_id=&kind=&status=&since=&until=
                                          &limit=&offset=&order=desc
GET    /api/jobs/:id
GET    /api/jobs/:id/logs                (paginated: ?after_seq=&limit=)
WS     /api/jobs/:id/stream              (live progress; see §6.2 for shape)
POST   /api/jobs/:id/cancel

GET    /api/repos
GET    /api/repos/:id

GET    /api/alerts
POST   /api/alerts/:id/ack

GET    /api/audit
GET    /api/users   (admin)
POST   /api/users   (admin)

Realtime strategy: only /api/jobs/:id/stream uses WS. All other screens (dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit if dashboard staleness becomes a problem in practice.

6.2 Agent ↔ Server

Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.

Agent → server:

hello (host metadata, agent_version, restic_version, OS, protocol_version — see "Protocol versioning" below)
heartbeat (every 30s)
job.started (job_id, kind, started_at)
job.progress (job_id, percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps)
job.finished (job_id, status, exit_code, stats, error, finished_at)
snapshots.report (full list after each successful backup)
repo.stats (size_bytes, snapshot_count, dedup_ratio, last_check_at, last_check_status, lock_state)
log.stream (live stdout/stderr lines while job running; {job_id, seq, ts, stream: stdout|stderr|event, payload})
schedule.ack (schedule_version) — agent confirms it has applied a schedule push; lets the server surface "this host is N versions behind" without polling

Server → agent:

command.run (kind, args)
command.cancel (job_id)
schedule.set (schedule_version, schedules: [...]) — full schedule list, agent reconciles local cron and replies with schedule.ack
config.update
agent.update.available (new version + package source URL — informational only; agent does not self-update, see §4.2)

The server fans job.progress and log.stream for a given job to all browsers subscribed to WS /api/jobs/:id/stream (§6.1) without transformation, so the schema is shared end-to-end.

Protocol versioning. Agents and the server each declare an integer protocol_version in hello. The version bumps only on breaking wire-format changes (not human-readable software releases). The server maintains a MinAgentProtocolVersion constant; agents below it are disconnected with error: protocol_too_old and a URL pointing at the upgrade instructions. Symmetrically, an agent talking to a server that advertises a protocol_version it does not recognise refuses to proceed and surfaces a clear log message. This avoids the failure mode of "weird JSON parse errors when v0.3 agent meets v0.5 server."

Schedule reconciliation when the server is unreachable. Agents keep firing the last-known-good schedule pushed by the server, indefinitely. Rationale: a missed backup because the controller is down is a worse outcome than firing a schedule the user has since edited. On reconnect, the server's view is canonical: the next schedule.set overrides whatever the agent was running, the agent replies schedule.ack with the new schedule_version, and the server updates Host.applied_schedule_version. The UI surfaces drift ("schedule v7 pushed, agent applied v5") when an agent has been offline.

6.3 Enrollment

Operator clicks "Add host" → server generates one-time token (TTL 1h)
Operator runs install script on endpoint with token
Agent calls POST /api/agents/enroll with token + host metadata
Server issues persistent agent credential (bearer token + TLS pin) and host record
Agent stores credential, opens WS connection

7. Security

7.1 Authentication

Phase 1: username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
Phase 2: OIDC (Authelia, Keycloak, Authentik)
Agents: bearer token over TLS; pin server cert fingerprint at enrollment time

7.2 Authorization (Phase 1: simple roles)

admin: everything
operator: trigger jobs, edit schedules, restore
viewer: read-only

7.3 Secret handling

Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
Pushed to agents only over the authenticated WS, only when needed for a job
Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)

7.4 Repo protection

Restic REST server runs with --append-only for routine backups
A separate non-append-only credential exists for forget/prune operations, used only when explicitly invoked from the UI by an admin/operator and audited

7.5 Audit

Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload

8. UI

Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.

Pages:

Login
Dashboard: fleet overview (host cards: status, last backup, repo size, alerts)
Host detail: tabs for Snapshots / Schedules / Jobs / Repo / Settings
Job detail: live log streaming via WS, cancel button
Restore wizard: host → snapshot → paths → target → confirm
Repos: aggregate view across hosts
Alerts: list, acknowledge
Settings: users (admin), notification channels, agent download
Audit log

9. Alerting

Triggers: backup failed, backup hasn't run in N hours past its schedule, repo check failed, agent offline > N minutes, repo size growth anomaly
Channels (Phase 1): webhook, ntfy, email (SMTP)
Channels (Phase 2+): Discord, Slack, Pushover

10. Deployment

10.1 Control plane (Proxmox host or LXC)

The server is HTTP-only by design — operators front it with their own TLS-terminating reverse proxy (Caddy / Traefik / nginx). Bind the container to localhost so the only public path is through the proxy.

docker-compose.yml:

services:
  restic-manager:
    image: ghcr.io/<owner>/restic-manager:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - ./data:/data
    environment:
      - RM_DATA_DIR=/data
      - RM_LISTEN=:8080
      - RM_BASE_URL=https://restic.lab.example
      - RM_SECRET_KEY_FILE=/data/secret.key
      - RM_TRUSTED_PROXY=172.16.0.0/12   # CIDR of your reverse proxy

Reference Caddy snippet (operator's own Caddyfile, outside this repo):

restic.lab.example {
    encode zstd gzip
    reverse_proxy 127.0.0.1:8080
}

Caddy provisions and renews the cert; the agent's cert_pin_sha256 pins Caddy's leaf cert (that's what the agent actually sees).

RM_LISTEN is the source of truth for the server's bind address. The 8080:8080 mapping above is just the matching default; change both sides together if you pick a different port.

⚠️ Never expose RM_LISTEN directly on a public interface — the server has no TLS, no rate limiting, and no DDoS protection. That all belongs in the proxy.

10.2 Restic REST server (Unraid)

Standard restic/rest-server container, --append-only, --private-repos, htpasswd mounted, data path on the share.

10.3 Agent install

Linux: curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh
Windows: iwr https://restic.lab.example/install.ps1 | iex (with $env:RM_TOKEN)
Installer drops binary + service unit, calls enroll endpoint, starts service

11. Testing strategy

Unit tests: restic JSON parsing, schedule reconciliation, retention policy logic
Integration tests: spin up real restic + rest-server in Docker, exercise full backup/snapshot/restore flows
End-to-end: Playwright against a compose-up'd stack with one Linux agent in a sibling container
Cross-platform agent CI: build matrix linux/amd64, linux/arm64, windows/amd64; smoke test on Windows runner

12. Repository layout

restic-manager/
├── cmd/
│   ├── server/
│   └── agent/
├── internal/
│   ├── api/             # shared API types
│   ├── server/
│   │   ├── http/
│   │   ├── ws/
│   │   └── ui/          # templates, handlers
│   ├── agent/
│   │   ├── service/     # systemd / windows service glue
│   │   ├── runner/      # restic invocation
│   │   └── scheduler/
│   ├── restic/          # restic CLI wrapper, --json parsing
│   ├── store/           # sqlite layer
│   ├── crypto/          # secret encryption
│   └── auth/
├── web/
│   ├── templates/
│   └── static/
├── deploy/
│   ├── docker-compose.yml
│   ├── Dockerfile.server
│   └── install/
│       ├── install.sh
│       └── install.ps1
├── docs/
├── LICENSE              # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md

13. Phased delivery

Phase 1 (MVP): server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
Phase 2: schedules, retention, run-now for forget/prune/check/unlock, repo stats
Phase 3: restore wizard, alerts (webhook/ntfy/email), audit log
Phase 4: agent self-update, OIDC, multi-user/RBAC polish, repo trends
Phase 5: OSS readiness — docs site, contribution guide, screenshot tour

14. Confirmed extensions (in scope)

These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.

14.1 Cross-host restore

Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).

Credential model: target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
Path remapping: UI allows rewriting source paths to target paths (e.g. /home/alice → /home/alice-new)
Permissions: restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
Phase: 3 (with the restore wizard)

14.2 Bandwidth limiting

Per-host upload/download caps for backup, restore, and prune jobs.

Exposed on the schedule editor as optional --limit-upload / --limit-download (KB/s)
Also overridable on run-now jobs via the UI
Persisted in Schedule.options (JSON blob) so the schema stays stable
Phase: 2 (with scheduling)

14.3 Pre/post backup hooks

Per-host shell commands run before and after a backup job. Use cases: mysqldump/pg_dump to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.

Schema: Schedule.pre_hook and Schedule.post_hook (string, optional). For more complex cases, Host.pre_hook_default / Host.post_hook_default apply to all schedules on that host unless overridden
Applicability: hooks are only meaningful for kind = backup schedules. The API rejects non-null pre_hook / post_hook on any other schedule kind (forget, prune, check) with a clear validation error rather than silently ignoring them. The same constraint applies to Host.pre_hook_default / Host.post_hook_default: they only fire for backup schedules on that host
Execution: agent runs hooks via the host's default shell (/bin/sh Linux, cmd.exe or PowerShell Windows — host-configurable)
Failure semantics: pre_hook non-zero exit aborts the backup and marks the job failed. post_hook runs on both success and failure (with RM_JOB_STATUS env var); its own exit code is recorded but does not change the backup job's final status
Stdout/stderr: captured into JobLog like restic output, prefixed pre_hook: / post_hook:
Security: hooks are stored encrypted; only admins can edit them; every edit audit-logged
Phase: 2 (with scheduling)

14.4 Prometheus `/metrics` endpoint

Standard Prometheus exposition on /metrics, protected by either bearer token or IP allow-list.

Metrics (per host):
- restic_manager_last_backup_timestamp_seconds{host=...}
- restic_manager_last_backup_status{host=...} (1=success, 0=failure)
- restic_manager_repo_size_bytes{host=...}
- restic_manager_snapshot_count{host=...}
- restic_manager_agent_online{host=...} (1/0)
- restic_manager_job_duration_seconds_bucket{kind=...,host=...} (histogram)
Server-level: restic_manager_jobs_total{kind=...,status=...}, restic_manager_alerts_active, restic_manager_build_info
Phase: 4 (alongside repo trend charts — both rely on the same time-series data)

15. Future considerations (not yet committed)

Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge

26 KiB Raw Blame History