ee3ee241ea
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:
* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
with ReadWritePaths confined to /etc + /var/lib/restic-manager;
NoNewPrivileges blocks escalation. Install script no longer
creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
rationale (matches UrBackup / Veeam / Bareos defaults; trying to
back up "everything" as an unprivileged user creates silent skips
on /home, /root, /var/lib/* with no upside vs the threat model
the agent already implies).
* Init-repo end-to-end. New JobKind="init" wired through agent
runner, restic.Env.RunInit, server dispatcher, and a UI button
(red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
flips on init success, on backup success, or on a non-empty
snapshots.report. The "Run now" / "Init" / "Retry" branching now
drives both the dashboard host row and the host-detail panel.
Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
the safe create-new-then-rename pattern; first version corrupted
job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
affected DBs).
* rest-server creds embedded at exec time only. restic.Env gains
RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
inside envSlice() and never assigns it back to the struct, so
nothing slog-able ever sees the cleartext form. RedactURL helper
for any future surface that needs to log a URL safely. Both
helpers tested.
* Add-host UX. Repo password is now optional — server mints a
24-byte URL-safe random one and surfaces it once, alongside an
htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
the operator pastes one command on the rest-server host and one
on the endpoint. Result page also links the install snippet at
/install/install.sh (was /install.sh — 404'd before) and pipes
to bash (not sh — script uses set -o pipefail and other
bashisms; on Debian/Ubuntu sh is dash).
* Late-subscriber race in JobHub. A fast-failing job could finish
(DB write + Broadcast) before the browser's HX-Redirect → page
load → WS-connect path completed, so the JS sat forever waiting
on a job.finished that already passed. JobHub split into
Register + Send + Run; handleJobStream now subscribes first,
re-fetches the job, and sends a synthetic job.finished if the
state is already terminal.
* HTMX error visibility. New toast partial listens to
htmx:responseError and surfaces the response body as a
bottom-right toast — every server-side validation error now
becomes visible without per-handler JS wiring. Also handles
custom rm:toast events for future server-pushed notifications
via the HX-Trigger header. Themed via existing CSS vars.
* Dashboard rows are now whole-row clickable to host detail
(CSS card-link pattern: absolute-positioned anchor + .row-action
z-index restoration so the action button stays clickable).
"View →" on a running job links to /jobs/<id> rather than
/hosts/<id> since the row click already covers the host page.
* "Run first" / "Run first backup" → "Run now" everywhere for
consistency.
* runbook (docs/e2e-smoke.md) updated — live-log streaming step
now reflects P1-26; mentions the browser-driven Run-now flow.
* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
it up; .gitignore now excludes /_diag/ entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
581 lines
29 KiB
Markdown
581 lines
29 KiB
Markdown
# restic-manager — Specification
|
|
|
|
## 1. Overview
|
|
|
|
**restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.
|
|
|
|
It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.
|
|
|
|
**License:** PolyForm Noncommercial 1.0.0
|
|
|
|
## 2. Goals & Non-Goals
|
|
|
|
### Goals
|
|
- Central visibility into backup state for every endpoint
|
|
- Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
|
|
- Manage per-host backup schedules from the UI
|
|
- Live job progress streamed back to the UI
|
|
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
|
|
- Repo health surfacing (size, dedup ratio, last check, lock state)
|
|
- Alerting on failure or staleness
|
|
- Cross-platform agent (Linux + Windows)
|
|
- Ransomware-resistant repo access via append-only credentials
|
|
|
|
### Non-Goals (initial release)
|
|
- Replacing restic itself or providing custom repo formats
|
|
- Managing non-restic backup tools
|
|
- Multi-tenancy / SaaS deployment
|
|
- High availability of the control plane (SQLite, single-instance)
|
|
- Mobile-native apps (responsive web only)
|
|
|
|
## 3. Architecture
|
|
|
|
### 3.1 Components
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ Proxmox cluster │
|
|
│ ┌────────────────────────────────────────────────────────────┐ │
|
|
│ │ docker compose: restic-manager │ │
|
|
│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │
|
|
│ │ - SQLite volume │ │
|
|
│ └────────────────────────────────────────────────────────────┘ │
|
|
└────────────────────────▲─────────────────────────────────────────┘
|
|
│ HTTPS (control plane)
|
|
│ - agent → server: status, telemetry
|
|
│ - server → agent: commands, schedules
|
|
│
|
|
┌────────────────────────┴─────────────────────────────────────────┐
|
|
│ Endpoints (Linux + Windows) │
|
|
│ ┌──────────────────────┐ ┌────────────────────────────────┐ │
|
|
│ │ restic-manager- │ │ restic CLI │ │
|
|
│ │ agent (Go binary) │───▶│ invoked by agent │ │
|
|
│ │ - systemd / svc │ └─────────────┬──────────────────┘ │
|
|
│ │ - WS to server │ │ HTTPS │
|
|
│ └──────────────────────┘ │ (data plane) │
|
|
└─────────────────────────────────────────────┼────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ Unraid │
|
|
│ ┌────────────────────────────────────────────────────────────┐ │
|
|
│ │ Docker: restic/rest-server │ │
|
|
│ │ - per-host append-only credentials │ │
|
|
│ │ - one repo per host │ │
|
|
│ │ - storage: Unraid share │ │
|
|
│ └────────────────────────────────────────────────────────────┘ │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 3.2 Data flow
|
|
|
|
- **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes.
|
|
- **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
|
|
- **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.
|
|
|
|
### 3.3 Why agent (not SSH)
|
|
|
|
- Push model works through NAT/firewalls without inbound rules
|
|
- Native Windows support without OpenSSH service quirks
|
|
- Local scheduling survives controller restarts
|
|
- Self-contained `restic --json` parsing, no remote shell quoting hazards
|
|
|
|
### 3.4 Why per-host repos
|
|
|
|
- Isolates corruption / lock contention
|
|
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
|
|
- Simpler `prune` orchestration (no global lock coordination)
|
|
- Trivially easy to retire a host (delete its repo + credential)
|
|
|
|
## 4. Components in detail
|
|
|
|
### 4.1 Server
|
|
|
|
- **Language:** Go 1.22+
|
|
- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo)
|
|
- **HTTP:** `net/http` + `chi` router
|
|
- **WebSocket:** `github.com/coder/websocket` (the maintained fork of the
|
|
unmaintained `nhooyr.io/websocket`; same API)
|
|
- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step
|
|
- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml`
|
|
- **Config:** YAML or env vars:
|
|
- `RM_LISTEN` — bind address, e.g. `:8080` (source of truth for the port;
|
|
the `8080` in the reference compose is just a default mapping). Bind to
|
|
`127.0.0.1:8080` when running behind a same-host proxy.
|
|
- `RM_DATA_DIR`, `RM_BASE_URL`, `RM_SECRET_KEY_FILE`
|
|
- `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies
|
|
whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the
|
|
default) = trust no one. Set this when fronted by Caddy/Traefik.
|
|
- `RM_COOKIE_SECURE` — `true` (default) marks session cookies `Secure`.
|
|
Only set to `false` for local HTTP-only testing.
|
|
- **TLS:** the server speaks plain HTTP and is **always** expected to sit
|
|
behind a TLS-terminating reverse proxy (Caddy / Traefik / nginx). This
|
|
keeps cert renewal, ACME, and SNI in the proxy where operators already
|
|
manage it. Agents must reach the server over HTTPS; the cert pin
|
|
(`cert_pin_sha256`) pins whatever cert the proxy serves.
|
|
|
|
### 4.2 Agent
|
|
|
|
- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`).
|
|
Phase 1 ships Linux only; Windows binaries continue to build in CI to keep
|
|
the codebase portable, but Windows service integration + signed installer
|
|
+ install.ps1 land in Phase 2.
|
|
- **Service integration:** systemd unit (Linux). Windows service via
|
|
`golang.org/x/sys/windows/svc` — Phase 2.
|
|
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
|
|
- **Privilege model:** the agent runs as root, sandboxed via systemd. A
|
|
fleet-backup tool needs to read every file on the system regardless
|
|
of DAC permissions; running as a dedicated unprivileged user means
|
|
either silent skips on `/home`, `/root`, `/var/lib/<other-daemons>`,
|
|
or operators having to add the service user to every group whose
|
|
files they want backed up. Both are worse failure modes than the
|
|
threat model already implies — the agent holds long-lived repo
|
|
credentials, executes arbitrary `restic` commands, and runs
|
|
operator-defined hooks; its blast radius is already large. This
|
|
matches how every comparable tool ships (UrBackup client, Veeam
|
|
Agent, Bareos FD, BackupPC client, borgmatic via systemd). The
|
|
mitigation is aggressive systemd sandboxing of the root process:
|
|
drop the capability set to `CAP_DAC_READ_SEARCH` (read any file)
|
|
+ `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` (restore ownership);
|
|
`NoNewPrivileges=true` blocks escalation; `ProtectSystem=strict`
|
|
+ a tight `ReadWritePaths=` confines writes to `/etc/restic-manager`
|
|
and `/var/lib/restic-manager`; `ProtectHome=read-only` keeps `/home`
|
|
readable but immutable; standard `Protect*` / `Restrict*` toggles
|
|
cover the rest. Hooks (P2) run as root by default with a per-hook
|
|
override knob.
|
|
- **Persistence:** `agent.yaml` (server URL, host ID, bearer, secrets
|
|
key) + an AEAD-encrypted secrets blob (`secrets.enc`) holding the
|
|
restic repo URL + password. Both files are mode 0600 owned by root.
|
|
Phase 1 ships the encrypted-file form on Linux; Phase 2 swaps that
|
|
for OS-keyring storage (DPAPI on Windows, Secret Service / `pass`
|
|
on Linux where a session bus is available — see §7.3). A small
|
|
state DB (BoltDB or JSON) for queued reports lands when offline-
|
|
resilience work does.
|
|
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
|
|
- **Updates:** distributed via OS package manager — apt repo (Linux) and
|
|
Chocolatey package (Windows), both pointing at gitea releases. No
|
|
bespoke signed-binary self-update; the `restic-manager-agent update`
|
|
command is a thin wrapper over `apt-get install --only-upgrade` /
|
|
`choco upgrade`. UI surfaces "agent N versions behind server" so an
|
|
operator knows when to upgrade.
|
|
|
|
### 4.3 Restic REST server (Unraid)
|
|
|
|
- Run `restic/rest-server` Docker container
|
|
- `--append-only` enabled
|
|
- `--private-repos` enabled (each user only sees their own subpath)
|
|
- htpasswd file with one user per host
|
|
- Storage path mapped to Unraid share
|
|
|
|
## 5. Domain model
|
|
|
|
```
|
|
Host
|
|
id, name, os, arch, agent_version, restic_version, protocol_version,
|
|
enrolled_at, last_seen_at, status (online/offline/degraded),
|
|
repo_id (FK), tags,
|
|
current_job_id (FK nullable),
|
|
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
|
|
repo_size_bytes, snapshot_count, open_alert_count,
|
|
applied_schedule_version
|
|
# Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
|
|
# open_alert_count, applied_schedule_version) are denormalised
|
|
# projections, refreshed on job.finished, snapshots.report,
|
|
# repo.stats, and alert state changes.
|
|
# applied_schedule_version is the schedule_version the agent most
|
|
# recently acknowledged via `schedule.ack` — lets the UI surface
|
|
# drift when an agent is offline.
|
|
|
|
Repo
|
|
id, name, url, kind (rest|s3|local), credential_id (FK),
|
|
password_secret_id (FK),
|
|
size_bytes, snapshot_count, dedup_ratio,
|
|
last_check_at, last_check_status, lock_state (locked|unlocked),
|
|
append_only (bool), credential_rotated_at
|
|
# Bottom block is a cached projection from `restic stats` +
|
|
# Credential row, refreshed by repo.stats agent messages.
|
|
|
|
Credential
|
|
id, kind, username, secret_ref (encrypted),
|
|
rotated_at
|
|
|
|
Schedule
|
|
id, host_id (FK), kind (backup|forget|prune|check),
|
|
cron_expr, paths (json), excludes (json), tags (json),
|
|
retention_policy (json), options (json), pre_hook, post_hook,
|
|
enabled
|
|
# retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
|
|
# keep_monthly, keep_yearly, keep_tag: [...]}
|
|
# options: {limit_upload_kbps, limit_download_kbps}
|
|
# pre_hook/post_hook: see §14.3 (encrypted at rest)
|
|
|
|
Job
|
|
id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
|
|
scheduled_id (FK nullable),
|
|
actor_kind (user|schedule|system), actor_id (nullable),
|
|
started_at, finished_at,
|
|
exit_code, stats (json), error
|
|
|
|
JobLog
|
|
job_id (FK), seq, ts, stream (stdout|stderr|event), payload
|
|
|
|
Snapshot (cached projection from `restic snapshots --json`)
|
|
id (restic id), host_id (FK), repo_id (FK),
|
|
time, hostname, paths, tags, size_bytes, file_count
|
|
|
|
Alert
|
|
id, host_id (FK nullable), kind, severity, message,
|
|
created_at, acknowledged_at, resolved_at
|
|
|
|
User
|
|
id, username, password_hash, role (admin|operator|viewer),
|
|
created_at, last_login_at
|
|
|
|
Session
|
|
id, user_id (FK), created_at, expires_at, ip, ua
|
|
|
|
AuditLog
|
|
id, user_id (FK nullable), actor (user|agent|system),
|
|
action, target_kind, target_id, ts, payload (json)
|
|
```
|
|
|
|
## 6. API surface (control plane)
|
|
|
|
### 6.1 UI/REST (browser → server)
|
|
|
|
```
|
|
POST /api/auth/login
|
|
POST /api/auth/logout
|
|
|
|
GET /api/fleet/summary (aggregate: host counts by status,
|
|
total bytes, open alerts; reused by /metrics)
|
|
|
|
GET /api/hosts ?tag=&status=&limit=&offset=
|
|
(returns Host rows incl. denormalised
|
|
last_backup_*, repo_size_bytes,
|
|
snapshot_count, open_alert_count,
|
|
current_job_id)
|
|
GET /api/hosts/:id
|
|
DELETE /api/hosts/:id
|
|
POST /api/hosts/:id/enrollment-token (regenerate)
|
|
POST /api/hosts/:id/agent/update (force agent self-update; see §4.2)
|
|
|
|
GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset=
|
|
GET /api/hosts/:id/repo (full Repo projection)
|
|
POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock)
|
|
POST /api/hosts/:id/restore (restore wizard submit)
|
|
|
|
GET /api/hosts/:id/schedules
|
|
POST /api/hosts/:id/schedules
|
|
PUT /api/schedules/:id
|
|
DELETE /api/schedules/:id
|
|
|
|
GET /api/jobs ?host_id=&kind=&status=&since=&until=
|
|
&limit=&offset=&order=desc
|
|
GET /api/jobs/:id
|
|
GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=)
|
|
WS /api/jobs/:id/stream (live progress; see §6.2 for shape)
|
|
POST /api/jobs/:id/cancel
|
|
|
|
GET /api/repos
|
|
GET /api/repos/:id
|
|
|
|
GET /api/alerts
|
|
POST /api/alerts/:id/ack
|
|
|
|
GET /api/audit
|
|
GET /api/users (admin)
|
|
POST /api/users (admin)
|
|
```
|
|
|
|
**Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens
|
|
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
|
|
if dashboard staleness becomes a problem in practice.
|
|
|
|
### 6.2 Agent ↔ Server
|
|
|
|
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
|
|
|
|
**Agent → server:**
|
|
- `hello` (host metadata, agent_version, restic_version, OS,
|
|
`protocol_version` — see "Protocol versioning" below)
|
|
- `heartbeat` (every 30s)
|
|
- `job.started` (job_id, kind, started_at)
|
|
- `job.progress` (job_id, percent_done, files_done, total_files,
|
|
bytes_done, total_bytes, eta_seconds, throughput_bps)
|
|
- `job.finished` (job_id, status, exit_code, stats, error, finished_at)
|
|
- `snapshots.report` (full list after each successful backup)
|
|
- `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at,
|
|
last_check_status, lock_state)
|
|
- `log.stream` (live stdout/stderr lines while job running;
|
|
{job_id, seq, ts, stream: stdout|stderr|event, payload})
|
|
- `schedule.ack` (schedule_version) — agent confirms it has applied a
|
|
schedule push; lets the server surface "this host is N versions
|
|
behind" without polling
|
|
|
|
**Server → agent:**
|
|
- `command.run` (kind, args)
|
|
- `command.cancel` (job_id)
|
|
- `schedule.set` (schedule_version, schedules: [...]) — full schedule
|
|
list, agent reconciles local cron and replies with `schedule.ack`
|
|
- `config.update`
|
|
- `agent.update.available` (new version + package source URL —
|
|
informational only; agent does not self-update, see §4.2)
|
|
|
|
The server fans `job.progress` and `log.stream` for a given job to all
|
|
browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without
|
|
transformation, so the schema is shared end-to-end.
|
|
|
|
**Protocol versioning.** Agents and the server each declare an integer
|
|
`protocol_version` in `hello`. The version bumps **only** on breaking
|
|
wire-format changes (not human-readable software releases). The server
|
|
maintains a `MinAgentProtocolVersion` constant; agents below it are
|
|
disconnected with `error: protocol_too_old` and a URL pointing at the
|
|
upgrade instructions. Symmetrically, an agent talking to a server that
|
|
advertises a `protocol_version` it does not recognise refuses to
|
|
proceed and surfaces a clear log message. This avoids the failure mode
|
|
of "weird JSON parse errors when v0.3 agent meets v0.5 server."
|
|
|
|
**Schedule reconciliation when the server is unreachable.** Agents
|
|
keep firing the **last-known-good** schedule pushed by the server,
|
|
indefinitely. Rationale: a missed backup because the controller is
|
|
down is a worse outcome than firing a schedule the user has since
|
|
edited. On reconnect, the server's view is canonical: the next
|
|
`schedule.set` overrides whatever the agent was running, the agent
|
|
replies `schedule.ack` with the new `schedule_version`, and the server
|
|
updates `Host.applied_schedule_version`. The UI surfaces drift
|
|
("schedule v7 pushed, agent applied v5") when an agent has been
|
|
offline.
|
|
|
|
### 6.3 Enrollment
|
|
|
|
1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
|
|
2. Operator runs install script on endpoint with token
|
|
3. Agent calls `POST /api/agents/enroll` with token + host metadata
|
|
4. Server issues persistent agent credential (bearer token + TLS pin) and host record
|
|
5. Agent stores credential, opens WS connection
|
|
|
|
## 7. Security
|
|
|
|
### 7.1 Authentication
|
|
- **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
|
|
- **Phase 2:** OIDC (Authelia, Keycloak, Authentik)
|
|
- **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time
|
|
|
|
### 7.2 Authorization (Phase 1: simple roles)
|
|
- **admin:** everything
|
|
- **operator:** trigger jobs, edit schedules, restore
|
|
- **viewer:** read-only
|
|
|
|
### 7.3 Secret handling
|
|
- Restic repo passwords and REST-server credentials encrypted at rest
|
|
in SQLite using a server-side key (loaded from env or file at
|
|
startup, AEAD via `internal/crypto`).
|
|
- Operator supplies repo URL + username + password when minting an
|
|
enrollment token. The token row holds them as a single encrypted
|
|
blob; on `ConsumeEnrollmentToken` the blob is moved to a
|
|
`host_credentials` row keyed by `host_id` (same tx).
|
|
- Pushed to agents over the authenticated WS as a `config.update`
|
|
message — sent immediately after the agent's `hello` on every
|
|
connect, and again whenever the operator edits the credential.
|
|
Agents that connect before credentials exist proceed normally
|
|
but refuse to start backup jobs until the push arrives.
|
|
- Agent persistence:
|
|
- **Phase 1, Linux:** AEAD-encrypted file at
|
|
`/var/lib/restic-manager/secrets.enc`, key stored in
|
|
`agent.yaml` alongside the bearer (same 0600 trust boundary).
|
|
Atomic writes (tmp+fsync+rename).
|
|
- **Phase 2:** OS keyring where available — Windows DPAPI; Linux
|
|
Secret Service via `pass` / `gnome-keyring` / `kwallet` when a
|
|
session bus is present. The encrypted-file path stays as the
|
|
fallback for headless boxes.
|
|
- Plaintext repo passwords never appear in `agent.yaml`, server logs,
|
|
audit-log payloads, or job-log streams. The audit log records
|
|
*that* a credential was set/changed and by whom, never the value.
|
|
|
|
### 7.4 Repo protection
|
|
- Restic REST server runs with `--append-only` for routine backups
|
|
- A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited
|
|
|
|
### 7.5 Audit
|
|
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload
|
|
|
|
## 8. UI
|
|
|
|
Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.
|
|
|
|
**Pages:**
|
|
- **Login**
|
|
- **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts)
|
|
- **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings
|
|
- **Job detail:** live log streaming via WS, cancel button
|
|
- **Restore wizard:** host → snapshot → paths → target → confirm
|
|
- **Repos:** aggregate view across hosts
|
|
- **Alerts:** list, acknowledge
|
|
- **Settings:** users (admin), notification channels, agent download
|
|
- **Audit log**
|
|
|
|
## 9. Alerting
|
|
|
|
- **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly
|
|
- **Channels (Phase 1):** webhook, ntfy, email (SMTP)
|
|
- **Channels (Phase 2+):** Discord, Slack, Pushover
|
|
|
|
## 10. Deployment
|
|
|
|
### 10.1 Control plane (Proxmox host or LXC)
|
|
|
|
The server is HTTP-only by design — operators front it with their own
|
|
TLS-terminating reverse proxy (Caddy / Traefik / nginx). Bind the
|
|
container to localhost so the only public path is through the proxy.
|
|
|
|
`docker-compose.yml`:
|
|
```yaml
|
|
services:
|
|
restic-manager:
|
|
image: ghcr.io/<owner>/restic-manager:latest
|
|
restart: unless-stopped
|
|
ports:
|
|
- "127.0.0.1:8080:8080"
|
|
volumes:
|
|
- ./data:/data
|
|
environment:
|
|
- RM_DATA_DIR=/data
|
|
- RM_LISTEN=:8080
|
|
- RM_BASE_URL=https://restic.lab.example
|
|
- RM_SECRET_KEY_FILE=/data/secret.key
|
|
- RM_TRUSTED_PROXY=172.16.0.0/12 # CIDR of your reverse proxy
|
|
```
|
|
|
|
Reference Caddy snippet (operator's own Caddyfile, outside this repo):
|
|
```
|
|
restic.lab.example {
|
|
encode zstd gzip
|
|
reverse_proxy 127.0.0.1:8080
|
|
}
|
|
```
|
|
Caddy provisions and renews the cert; the agent's `cert_pin_sha256`
|
|
pins **Caddy's** leaf cert (that's what the agent actually sees).
|
|
|
|
`RM_LISTEN` is the source of truth for the server's bind address. The
|
|
`8080:8080` mapping above is just the matching default; change both
|
|
sides together if you pick a different port.
|
|
|
|
> ⚠️ Never expose `RM_LISTEN` directly on a public interface — the
|
|
> server has no TLS, no rate limiting, and no DDoS protection. That
|
|
> all belongs in the proxy.
|
|
|
|
### 10.2 Restic REST server (Unraid)
|
|
|
|
Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share.
|
|
|
|
### 10.3 Agent install
|
|
|
|
- **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh`
|
|
- **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`)
|
|
- Installer drops binary + service unit, calls enroll endpoint, starts service
|
|
|
|
## 11. Testing strategy
|
|
|
|
- **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic
|
|
- **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows
|
|
- **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container
|
|
- **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner
|
|
|
|
## 12. Repository layout
|
|
|
|
```
|
|
restic-manager/
|
|
├── cmd/
|
|
│ ├── server/
|
|
│ └── agent/
|
|
├── internal/
|
|
│ ├── api/ # shared API types
|
|
│ ├── server/
|
|
│ │ ├── http/
|
|
│ │ ├── ws/
|
|
│ │ └── ui/ # templates, handlers
|
|
│ ├── agent/
|
|
│ │ ├── service/ # systemd / windows service glue
|
|
│ │ ├── runner/ # restic invocation
|
|
│ │ └── scheduler/
|
|
│ ├── restic/ # restic CLI wrapper, --json parsing
|
|
│ ├── store/ # sqlite layer
|
|
│ ├── crypto/ # secret encryption
|
|
│ └── auth/
|
|
├── web/
|
|
│ ├── templates/
|
|
│ └── static/
|
|
├── deploy/
|
|
│ ├── docker-compose.yml
|
|
│ ├── Dockerfile.server
|
|
│ └── install/
|
|
│ ├── install.sh
|
|
│ └── install.ps1
|
|
├── docs/
|
|
├── LICENSE # PolyForm Noncommercial 1.0.0
|
|
├── README.md
|
|
├── spec.md
|
|
└── tasks.md
|
|
```
|
|
|
|
## 13. Phased delivery
|
|
|
|
- **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
|
|
- **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats
|
|
- **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log
|
|
- **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends
|
|
- **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour
|
|
|
|
## 14. Confirmed extensions (in scope)
|
|
|
|
These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.
|
|
|
|
### 14.1 Cross-host restore
|
|
|
|
Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).
|
|
|
|
- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
|
|
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice` → `/home/alice-new`)
|
|
- **Permissions:** restore runs as root (the agent's process; see §4.2). The agent retains `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` precisely so it can recreate ownership on the target. The "service user is non-root" warning that appeared in earlier drafts is moot.
|
|
- **Phase:** 3 (with the restore wizard)
|
|
|
|
### 14.2 Bandwidth limiting
|
|
|
|
Per-host upload/download caps for backup, restore, and prune jobs.
|
|
|
|
- Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s)
|
|
- Also overridable on run-now jobs via the UI
|
|
- Persisted in `Schedule.options` (JSON blob) so the schema stays stable
|
|
- **Phase:** 2 (with scheduling)
|
|
|
|
### 14.3 Pre/post backup hooks
|
|
|
|
Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
|
|
|
|
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
|
|
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
|
|
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable). Hooks inherit the agent's process — i.e. **root by default** (see §4.2). A per-hook `run_as` field lets the operator drop privileges for a specific hook (`run_as: postgres` for a `pg_dump` hook, etc.); the agent uses `setuid`/`setgid` before exec rather than shelling out to `sudo`. Hooks running as root is what makes `docker stop`, `mysqldump`, `systemctl reload` etc. work without per-host setup, which is what the user expects when typing them into the UI.
|
|
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
|
|
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
|
|
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged
|
|
- **Phase:** 2 (with scheduling)
|
|
|
|
### 14.4 Prometheus `/metrics` endpoint
|
|
|
|
Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list.
|
|
|
|
- **Metrics (per host):**
|
|
- `restic_manager_last_backup_timestamp_seconds{host=...}`
|
|
- `restic_manager_last_backup_status{host=...}` (1=success, 0=failure)
|
|
- `restic_manager_repo_size_bytes{host=...}`
|
|
- `restic_manager_snapshot_count{host=...}`
|
|
- `restic_manager_agent_online{host=...}` (1/0)
|
|
- `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram)
|
|
- **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info`
|
|
- **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data)
|
|
|
|
## 15. Future considerations (not yet committed)
|
|
|
|
- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge
|