restic-manager/spec.md

# restic-manager — Specification

## 1. Overview

**restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.

It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.

**License:** PolyForm Noncommercial 1.0.0

## 2. Goals & Non-Goals

### Goals
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
- Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials

### Non-Goals (initial release)
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)

## 3. Architecture

### 3.1 Components

```
┌──────────────────────────────────────────────────────────────────┐
│  Proxmox cluster                                                 │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  docker compose: restic-manager                            │  │
│  │   - server (Go binary, REST + WS API, embedded HTMX UI)    │  │
│  │   - SQLite volume                                          │  │
│  └────────────────────────────────────────────────────────────┘  │
└────────────────────────▲─────────────────────────────────────────┘
                         │ HTTPS (control plane)
                         │  - agent → server: status, telemetry
                         │  - server → agent: commands, schedules
                         │
┌────────────────────────┴─────────────────────────────────────────┐
│  Endpoints (Linux + Windows)                                     │
│  ┌──────────────────────┐    ┌────────────────────────────────┐  │
│  │  restic-manager-     │    │  restic CLI                    │  │
│  │  agent (Go binary)   │───▶│  invoked by agent              │  │
│  │  - systemd / svc     │    └─────────────┬──────────────────┘  │
│  │  - WS to server      │                  │ HTTPS               │
│  └──────────────────────┘                  │ (data plane)        │
└─────────────────────────────────────────────┼────────────────────┘
                                              │
                                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  Unraid                                                          │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  Docker: restic/rest-server                                │  │
│  │   - per-host append-only credentials                       │  │
│  │   - one repo per host                                      │  │
│  │   - storage: Unraid share                                  │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘
```

### 3.2 Data flow

- **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes.
- **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
- **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.

### 3.3 Why agent (not SSH)

- Push model works through NAT/firewalls without inbound rules
- Native Windows support without OpenSSH service quirks
- Local scheduling survives controller restarts
- Self-contained `restic --json` parsing, no remote shell quoting hazards

### 3.4 Why per-host repos

- Isolates corruption / lock contention
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
- Simpler `prune` orchestration (no global lock coordination)
- Trivially easy to retire a host (delete its repo + credential)

## 4. Components in detail

### 4.1 Server

- **Language:** Go 1.22+
- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo)
- **HTTP:** `net/http` + `chi` router
- **WebSocket:** `github.com/coder/websocket` (the maintained fork of the
  unmaintained `nhooyr.io/websocket`; same API)
- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step
- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml`
- **Config:** YAML or env vars:
  - `RM_LISTEN` — bind address, e.g. `:8080` (source of truth for the port;
    the `8080` in the reference compose is just a default mapping). Bind to
    `127.0.0.1:8080` when running behind a same-host proxy.
  - `RM_DATA_DIR`, `RM_BASE_URL`, `RM_SECRET_KEY_FILE`
  - `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies
    whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the
    default) = trust no one. Set this when fronted by Caddy/Traefik.
  - `RM_COOKIE_SECURE` — `true` (default) marks session cookies `Secure`.
    Only set to `false` for local HTTP-only testing.
- **TLS:** the server speaks plain HTTP and is **always** expected to sit
  behind a TLS-terminating reverse proxy (Caddy / Traefik / nginx). This
  keeps cert renewal, ACME, and SNI in the proxy where operators already
  manage it. Agents must reach the server over HTTPS; the cert pin
  (`cert_pin_sha256`) pins whatever cert the proxy serves.

### 4.2 Agent

- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`).
  Phase 1 ships Linux only; Windows binaries continue to build in CI to keep
  the codebase portable, but Windows service integration + signed installer
  + install.ps1 land in Phase 2.
- **Service integration:** systemd unit (Linux). Windows service via
  `golang.org/x/sys/windows/svc` — Phase 2.
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
- **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
- **Updates:** distributed via OS package manager — apt repo (Linux) and
  Chocolatey package (Windows), both pointing at gitea releases. No
  bespoke signed-binary self-update; the `restic-manager-agent update`
  command is a thin wrapper over `apt-get install --only-upgrade` /
  `choco upgrade`. UI surfaces "agent N versions behind server" so an
  operator knows when to upgrade.

### 4.3 Restic REST server (Unraid)

- Run `restic/rest-server` Docker container
- `--append-only` enabled
- `--private-repos` enabled (each user only sees their own subpath)
- htpasswd file with one user per host
- Storage path mapped to Unraid share

## 5. Domain model

```
Host
  id, name, os, arch, agent_version, restic_version, protocol_version,
  enrolled_at, last_seen_at, status (online/offline/degraded),
  repo_id (FK), tags,
  current_job_id (FK nullable),
  last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
  repo_size_bytes, snapshot_count, open_alert_count,
  applied_schedule_version
  # Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
  # open_alert_count, applied_schedule_version) are denormalised
  # projections, refreshed on job.finished, snapshots.report,
  # repo.stats, and alert state changes.
  # applied_schedule_version is the schedule_version the agent most
  # recently acknowledged via `schedule.ack` — lets the UI surface
  # drift when an agent is offline.

Repo
  id, name, url, kind (rest|s3|local), credential_id (FK),
  password_secret_id (FK),
  size_bytes, snapshot_count, dedup_ratio,
  last_check_at, last_check_status, lock_state (locked|unlocked),
  append_only (bool), credential_rotated_at
  # Bottom block is a cached projection from `restic stats` +
  # Credential row, refreshed by repo.stats agent messages.

Credential
  id, kind, username, secret_ref (encrypted),
  rotated_at

Schedule
  id, host_id (FK), kind (backup|forget|prune|check),
  cron_expr, paths (json), excludes (json), tags (json),
  retention_policy (json), options (json), pre_hook, post_hook,
  enabled
  # retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
  #                    keep_monthly, keep_yearly, keep_tag: [...]}
  # options:          {limit_upload_kbps, limit_download_kbps}
  # pre_hook/post_hook: see §14.3 (encrypted at rest)

Job
  id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
  scheduled_id (FK nullable),
  actor_kind (user|schedule|system), actor_id (nullable),
  started_at, finished_at,
  exit_code, stats (json), error

JobLog
  job_id (FK), seq, ts, stream (stdout|stderr|event), payload

Snapshot  (cached projection from `restic snapshots --json`)
  id (restic id), host_id (FK), repo_id (FK),
  time, hostname, paths, tags, size_bytes, file_count

Alert
  id, host_id (FK nullable), kind, severity, message,
  created_at, acknowledged_at, resolved_at

User
  id, username, password_hash, role (admin|operator|viewer),
  created_at, last_login_at

Session
  id, user_id (FK), created_at, expires_at, ip, ua

AuditLog
  id, user_id (FK nullable), actor (user|agent|system),
  action, target_kind, target_id, ts, payload (json)
```

## 6. API surface (control plane)

### 6.1 UI/REST (browser → server)

```
POST   /api/auth/login
POST   /api/auth/logout

GET    /api/fleet/summary                (aggregate: host counts by status,
                                          total bytes, open alerts; reused by /metrics)

GET    /api/hosts                        ?tag=&status=&limit=&offset=
                                          (returns Host rows incl. denormalised
                                          last_backup_*, repo_size_bytes,
                                          snapshot_count, open_alert_count,
                                          current_job_id)
GET    /api/hosts/:id
DELETE /api/hosts/:id
POST   /api/hosts/:id/enrollment-token   (regenerate)
POST   /api/hosts/:id/agent/update       (force agent self-update; see §4.2)

GET    /api/hosts/:id/snapshots          ?tag=&path=&since=&until=&limit=&offset=
GET    /api/hosts/:id/repo               (full Repo projection)
POST   /api/hosts/:id/jobs               (run-now: backup/forget/prune/check/unlock)
POST   /api/hosts/:id/restore            (restore wizard submit)

GET    /api/hosts/:id/schedules
POST   /api/hosts/:id/schedules
PUT    /api/schedules/:id
DELETE /api/schedules/:id

GET    /api/jobs                         ?host_id=&kind=&status=&since=&until=
                                          &limit=&offset=&order=desc
GET    /api/jobs/:id
GET    /api/jobs/:id/logs                (paginated: ?after_seq=&limit=)
WS     /api/jobs/:id/stream              (live progress; see §6.2 for shape)
POST   /api/jobs/:id/cancel

GET    /api/repos
GET    /api/repos/:id

GET    /api/alerts
POST   /api/alerts/:id/ack

GET    /api/audit
GET    /api/users   (admin)
POST   /api/users   (admin)
```

**Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
if dashboard staleness becomes a problem in practice.

### 6.2 Agent ↔ Server

Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.

**Agent → server:**
- `hello` (host metadata, agent_version, restic_version, OS,
   `protocol_version` — see "Protocol versioning" below)
- `heartbeat` (every 30s)
- `job.started` (job_id, kind, started_at)
- `job.progress` (job_id, percent_done, files_done, total_files,
   bytes_done, total_bytes, eta_seconds, throughput_bps)
- `job.finished` (job_id, status, exit_code, stats, error, finished_at)
- `snapshots.report` (full list after each successful backup)
- `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at,
   last_check_status, lock_state)
- `log.stream` (live stdout/stderr lines while job running;
   {job_id, seq, ts, stream: stdout|stderr|event, payload})
- `schedule.ack` (schedule_version) — agent confirms it has applied a
   schedule push; lets the server surface "this host is N versions
   behind" without polling

**Server → agent:**
- `command.run` (kind, args)
- `command.cancel` (job_id)
- `schedule.set` (schedule_version, schedules: [...]) — full schedule
   list, agent reconciles local cron and replies with `schedule.ack`
- `config.update`
- `agent.update.available` (new version + package source URL —
   informational only; agent does not self-update, see §4.2)

The server fans `job.progress` and `log.stream` for a given job to all
browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without
transformation, so the schema is shared end-to-end.

**Protocol versioning.** Agents and the server each declare an integer
`protocol_version` in `hello`. The version bumps **only** on breaking
wire-format changes (not human-readable software releases). The server
maintains a `MinAgentProtocolVersion` constant; agents below it are
disconnected with `error: protocol_too_old` and a URL pointing at the
upgrade instructions. Symmetrically, an agent talking to a server that
advertises a `protocol_version` it does not recognise refuses to
proceed and surfaces a clear log message. This avoids the failure mode
of "weird JSON parse errors when v0.3 agent meets v0.5 server."

**Schedule reconciliation when the server is unreachable.** Agents
keep firing the **last-known-good** schedule pushed by the server,
indefinitely. Rationale: a missed backup because the controller is
down is a worse outcome than firing a schedule the user has since
edited. On reconnect, the server's view is canonical: the next
`schedule.set` overrides whatever the agent was running, the agent
replies `schedule.ack` with the new `schedule_version`, and the server
updates `Host.applied_schedule_version`. The UI surfaces drift
("schedule v7 pushed, agent applied v5") when an agent has been
offline.

### 6.3 Enrollment

1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
2. Operator runs install script on endpoint with token
3. Agent calls `POST /api/agents/enroll` with token + host metadata
4. Server issues persistent agent credential (bearer token + TLS pin) and host record
5. Agent stores credential, opens WS connection

## 7. Security

### 7.1 Authentication
- **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
- **Phase 2:** OIDC (Authelia, Keycloak, Authentik)
- **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time

### 7.2 Authorization (Phase 1: simple roles)
- **admin:** everything
- **operator:** trigger jobs, edit schedules, restore
- **viewer:** read-only

### 7.3 Secret handling
- Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
- Pushed to agents only over the authenticated WS, only when needed for a job
- Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)

### 7.4 Repo protection
- Restic REST server runs with `--append-only` for routine backups
- A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited

### 7.5 Audit
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload

## 8. UI

Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.

**Pages:**
- **Login**
- **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts)
- **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings
- **Job detail:** live log streaming via WS, cancel button
- **Restore wizard:** host → snapshot → paths → target → confirm
- **Repos:** aggregate view across hosts
- **Alerts:** list, acknowledge
- **Settings:** users (admin), notification channels, agent download
- **Audit log**

## 9. Alerting

- **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly
- **Channels (Phase 1):** webhook, ntfy, email (SMTP)
- **Channels (Phase 2+):** Discord, Slack, Pushover

## 10. Deployment

### 10.1 Control plane (Proxmox host or LXC)

The server is HTTP-only by design — operators front it with their own
TLS-terminating reverse proxy (Caddy / Traefik / nginx). Bind the
container to localhost so the only public path is through the proxy.

`docker-compose.yml`:
```yaml
services:
  restic-manager:
    image: ghcr.io/<owner>/restic-manager:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - ./data:/data
    environment:
      - RM_DATA_DIR=/data
      - RM_LISTEN=:8080
      - RM_BASE_URL=https://restic.lab.example
      - RM_SECRET_KEY_FILE=/data/secret.key
      - RM_TRUSTED_PROXY=172.16.0.0/12   # CIDR of your reverse proxy
```

Reference Caddy snippet (operator's own Caddyfile, outside this repo):
```
restic.lab.example {
    encode zstd gzip
    reverse_proxy 127.0.0.1:8080
}
```
Caddy provisions and renews the cert; the agent's `cert_pin_sha256`
pins **Caddy's** leaf cert (that's what the agent actually sees).

`RM_LISTEN` is the source of truth for the server's bind address. The
`8080:8080` mapping above is just the matching default; change both
sides together if you pick a different port.

> ⚠️ Never expose `RM_LISTEN` directly on a public interface — the
> server has no TLS, no rate limiting, and no DDoS protection. That
> all belongs in the proxy.

### 10.2 Restic REST server (Unraid)

Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share.

### 10.3 Agent install

- **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh`
- **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`)
- Installer drops binary + service unit, calls enroll endpoint, starts service

## 11. Testing strategy

- **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic
- **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows
- **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container
- **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner

## 12. Repository layout

```
restic-manager/
├── cmd/
│   ├── server/
│   └── agent/
├── internal/
│   ├── api/             # shared API types
│   ├── server/
│   │   ├── http/
│   │   ├── ws/
│   │   └── ui/          # templates, handlers
│   ├── agent/
│   │   ├── service/     # systemd / windows service glue
│   │   ├── runner/      # restic invocation
│   │   └── scheduler/
│   ├── restic/          # restic CLI wrapper, --json parsing
│   ├── store/           # sqlite layer
│   ├── crypto/          # secret encryption
│   └── auth/
├── web/
│   ├── templates/
│   └── static/
├── deploy/
│   ├── docker-compose.yml
│   ├── Dockerfile.server
│   └── install/
│       ├── install.sh
│       └── install.ps1
├── docs/
├── LICENSE              # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md
```

## 13. Phased delivery

- **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
- **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats
- **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log
- **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends
- **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour

## 14. Confirmed extensions (in scope)

These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.

### 14.1 Cross-host restore

Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).

- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`)
- **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
- **Phase:** 3 (with the restore wizard)

### 14.2 Bandwidth limiting

Per-host upload/download caps for backup, restore, and prune jobs.

- Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s)
- Also overridable on run-now jobs via the UI
- Persisted in `Schedule.options` (JSON blob) so the schema stays stable
- **Phase:** 2 (with scheduling)

### 14.3 Pre/post backup hooks

Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.

- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable)
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged
- **Phase:** 2 (with scheduling)

### 14.4 Prometheus `/metrics` endpoint

Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list.

- **Metrics (per host):**
  - `restic_manager_last_backup_timestamp_seconds{host=...}`
  - `restic_manager_last_backup_status{host=...}` (1=success, 0=failure)
  - `restic_manager_repo_size_bytes{host=...}`
  - `restic_manager_snapshot_count{host=...}`
  - `restic_manager_agent_online{host=...}` (1/0)
  - `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram)
- **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info`
- **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data)

## 15. Future considerations (not yet committed)

- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge