Files
steve c8ead66f08 P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:

* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
  drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
  with ReadWritePaths confined to /etc + /var/lib/restic-manager;
  NoNewPrivileges blocks escalation. Install script no longer
  creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
  rationale (matches UrBackup / Veeam / Bareos defaults; trying to
  back up "everything" as an unprivileged user creates silent skips
  on /home, /root, /var/lib/* with no upside vs the threat model
  the agent already implies).

* Init-repo end-to-end. New JobKind="init" wired through agent
  runner, restic.Env.RunInit, server dispatcher, and a UI button
  (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
  flips on init success, on backup success, or on a non-empty
  snapshots.report. The "Run now" / "Init" / "Retry" branching now
  drives both the dashboard host row and the host-detail panel.
  Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
  the safe create-new-then-rename pattern; first version corrupted
  job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
  affected DBs).

* rest-server creds embedded at exec time only. restic.Env gains
  RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
  inside envSlice() and never assigns it back to the struct, so
  nothing slog-able ever sees the cleartext form. RedactURL helper
  for any future surface that needs to log a URL safely. Both
  helpers tested.

* Add-host UX. Repo password is now optional — server mints a
  24-byte URL-safe random one and surfaces it once, alongside an
  htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
  the operator pastes one command on the rest-server host and one
  on the endpoint. Result page also links the install snippet at
  /install/install.sh (was /install.sh — 404'd before) and pipes
  to bash (not sh — script uses set -o pipefail and other
  bashisms; on Debian/Ubuntu sh is dash).

* Late-subscriber race in JobHub. A fast-failing job could finish
  (DB write + Broadcast) before the browser's HX-Redirect → page
  load → WS-connect path completed, so the JS sat forever waiting
  on a job.finished that already passed. JobHub split into
  Register + Send + Run; handleJobStream now subscribes first,
  re-fetches the job, and sends a synthetic job.finished if the
  state is already terminal.

* HTMX error visibility. New toast partial listens to
  htmx:responseError and surfaces the response body as a
  bottom-right toast — every server-side validation error now
  becomes visible without per-handler JS wiring. Also handles
  custom rm:toast events for future server-pushed notifications
  via the HX-Trigger header. Themed via existing CSS vars.

* Dashboard rows are now whole-row clickable to host detail
  (CSS card-link pattern: absolute-positioned anchor + .row-action
  z-index restoration so the action button stays clickable).
  "View →" on a running job links to /jobs/<id> rather than
  /hosts/<id> since the row click already covers the host page.

* "Run first" / "Run first backup" → "Run now" everywhere for
  consistency.

* runbook (docs/e2e-smoke.md) updated — live-log streaming step
  now reflects P1-26; mentions the browser-driven Run-now flow.

* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
  it up; .gitignore now excludes /_diag/ entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:02:12 +01:00

581 lines
29 KiB
Markdown

# restic-manager — Specification
## 1. Overview
**restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.
It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.
**License:** PolyForm Noncommercial 1.0.0
## 2. Goals & Non-Goals
### Goals
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
- Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
### Non-Goals (initial release)
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
## 3. Architecture
### 3.1 Components
```
┌──────────────────────────────────────────────────────────────────┐
│ Proxmox cluster │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ docker compose: restic-manager │ │
│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │
│ │ - SQLite volume │ │
│ └────────────────────────────────────────────────────────────┘ │
└────────────────────────▲─────────────────────────────────────────┘
│ HTTPS (control plane)
│ - agent → server: status, telemetry
│ - server → agent: commands, schedules
┌────────────────────────┴─────────────────────────────────────────┐
│ Endpoints (Linux + Windows) │
│ ┌──────────────────────┐ ┌────────────────────────────────┐ │
│ │ restic-manager- │ │ restic CLI │ │
│ │ agent (Go binary) │───▶│ invoked by agent │ │
│ │ - systemd / svc │ └─────────────┬──────────────────┘ │
│ │ - WS to server │ │ HTTPS │
│ └──────────────────────┘ │ (data plane) │
└─────────────────────────────────────────────┼────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Unraid │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Docker: restic/rest-server │ │
│ │ - per-host append-only credentials │ │
│ │ - one repo per host │ │
│ │ - storage: Unraid share │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```
### 3.2 Data flow
- **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes.
- **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
- **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.
### 3.3 Why agent (not SSH)
- Push model works through NAT/firewalls without inbound rules
- Native Windows support without OpenSSH service quirks
- Local scheduling survives controller restarts
- Self-contained `restic --json` parsing, no remote shell quoting hazards
### 3.4 Why per-host repos
- Isolates corruption / lock contention
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
- Simpler `prune` orchestration (no global lock coordination)
- Trivially easy to retire a host (delete its repo + credential)
## 4. Components in detail
### 4.1 Server
- **Language:** Go 1.22+
- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo)
- **HTTP:** `net/http` + `chi` router
- **WebSocket:** `github.com/coder/websocket` (the maintained fork of the
unmaintained `nhooyr.io/websocket`; same API)
- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step
- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml`
- **Config:** YAML or env vars:
- `RM_LISTEN` — bind address, e.g. `:8080` (source of truth for the port;
the `8080` in the reference compose is just a default mapping). Bind to
`127.0.0.1:8080` when running behind a same-host proxy.
- `RM_DATA_DIR`, `RM_BASE_URL`, `RM_SECRET_KEY_FILE`
- `RM_TRUSTED_PROXY` — comma-separated CIDR list of reverse proxies
whose `X-Forwarded-For` / `X-Forwarded-Proto` we honour. Empty (the
default) = trust no one. Set this when fronted by Caddy/Traefik.
- `RM_COOKIE_SECURE``true` (default) marks session cookies `Secure`.
Only set to `false` for local HTTP-only testing.
- **TLS:** the server speaks plain HTTP and is **always** expected to sit
behind a TLS-terminating reverse proxy (Caddy / Traefik / nginx). This
keeps cert renewal, ACME, and SNI in the proxy where operators already
manage it. Agents must reach the server over HTTPS; the cert pin
(`cert_pin_sha256`) pins whatever cert the proxy serves.
### 4.2 Agent
- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`).
Phase 1 ships Linux only; Windows binaries continue to build in CI to keep
the codebase portable, but Windows service integration + signed installer
+ install.ps1 land in Phase 2.
- **Service integration:** systemd unit (Linux). Windows service via
`golang.org/x/sys/windows/svc` — Phase 2.
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
- **Privilege model:** the agent runs as root, sandboxed via systemd. A
fleet-backup tool needs to read every file on the system regardless
of DAC permissions; running as a dedicated unprivileged user means
either silent skips on `/home`, `/root`, `/var/lib/<other-daemons>`,
or operators having to add the service user to every group whose
files they want backed up. Both are worse failure modes than the
threat model already implies — the agent holds long-lived repo
credentials, executes arbitrary `restic` commands, and runs
operator-defined hooks; its blast radius is already large. This
matches how every comparable tool ships (UrBackup client, Veeam
Agent, Bareos FD, BackupPC client, borgmatic via systemd). The
mitigation is aggressive systemd sandboxing of the root process:
drop the capability set to `CAP_DAC_READ_SEARCH` (read any file)
+ `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` (restore ownership);
`NoNewPrivileges=true` blocks escalation; `ProtectSystem=strict`
+ a tight `ReadWritePaths=` confines writes to `/etc/restic-manager`
and `/var/lib/restic-manager`; `ProtectHome=read-only` keeps `/home`
readable but immutable; standard `Protect*` / `Restrict*` toggles
cover the rest. Hooks (P2) run as root by default with a per-hook
override knob.
- **Persistence:** `agent.yaml` (server URL, host ID, bearer, secrets
key) + an AEAD-encrypted secrets blob (`secrets.enc`) holding the
restic repo URL + password. Both files are mode 0600 owned by root.
Phase 1 ships the encrypted-file form on Linux; Phase 2 swaps that
for OS-keyring storage (DPAPI on Windows, Secret Service / `pass`
on Linux where a session bus is available — see §7.3). A small
state DB (BoltDB or JSON) for queued reports lands when offline-
resilience work does.
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
- **Updates:** distributed via OS package manager — apt repo (Linux) and
Chocolatey package (Windows), both pointing at gitea releases. No
bespoke signed-binary self-update; the `restic-manager-agent update`
command is a thin wrapper over `apt-get install --only-upgrade` /
`choco upgrade`. UI surfaces "agent N versions behind server" so an
operator knows when to upgrade.
### 4.3 Restic REST server (Unraid)
- Run `restic/rest-server` Docker container
- `--append-only` enabled
- `--private-repos` enabled (each user only sees their own subpath)
- htpasswd file with one user per host
- Storage path mapped to Unraid share
## 5. Domain model
```
Host
id, name, os, arch, agent_version, restic_version, protocol_version,
enrolled_at, last_seen_at, status (online/offline/degraded),
repo_id (FK), tags,
current_job_id (FK nullable),
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
repo_size_bytes, snapshot_count, open_alert_count,
applied_schedule_version
# Bottom block (last_backup_*, repo_size_bytes, snapshot_count,
# open_alert_count, applied_schedule_version) are denormalised
# projections, refreshed on job.finished, snapshots.report,
# repo.stats, and alert state changes.
# applied_schedule_version is the schedule_version the agent most
# recently acknowledged via `schedule.ack` — lets the UI surface
# drift when an agent is offline.
Repo
id, name, url, kind (rest|s3|local), credential_id (FK),
password_secret_id (FK),
size_bytes, snapshot_count, dedup_ratio,
last_check_at, last_check_status, lock_state (locked|unlocked),
append_only (bool), credential_rotated_at
# Bottom block is a cached projection from `restic stats` +
# Credential row, refreshed by repo.stats agent messages.
Credential
id, kind, username, secret_ref (encrypted),
rotated_at
Schedule
id, host_id (FK), kind (backup|forget|prune|check),
cron_expr, paths (json), excludes (json), tags (json),
retention_policy (json), options (json), pre_hook, post_hook,
enabled
# retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
# keep_monthly, keep_yearly, keep_tag: [...]}
# options: {limit_upload_kbps, limit_download_kbps}
# pre_hook/post_hook: see §14.3 (encrypted at rest)
Job
id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
scheduled_id (FK nullable),
actor_kind (user|schedule|system), actor_id (nullable),
started_at, finished_at,
exit_code, stats (json), error
JobLog
job_id (FK), seq, ts, stream (stdout|stderr|event), payload
Snapshot (cached projection from `restic snapshots --json`)
id (restic id), host_id (FK), repo_id (FK),
time, hostname, paths, tags, size_bytes, file_count
Alert
id, host_id (FK nullable), kind, severity, message,
created_at, acknowledged_at, resolved_at
User
id, username, password_hash, role (admin|operator|viewer),
created_at, last_login_at
Session
id, user_id (FK), created_at, expires_at, ip, ua
AuditLog
id, user_id (FK nullable), actor (user|agent|system),
action, target_kind, target_id, ts, payload (json)
```
## 6. API surface (control plane)
### 6.1 UI/REST (browser → server)
```
POST /api/auth/login
POST /api/auth/logout
GET /api/fleet/summary (aggregate: host counts by status,
total bytes, open alerts; reused by /metrics)
GET /api/hosts ?tag=&status=&limit=&offset=
(returns Host rows incl. denormalised
last_backup_*, repo_size_bytes,
snapshot_count, open_alert_count,
current_job_id)
GET /api/hosts/:id
DELETE /api/hosts/:id
POST /api/hosts/:id/enrollment-token (regenerate)
POST /api/hosts/:id/agent/update (force agent self-update; see §4.2)
GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset=
GET /api/hosts/:id/repo (full Repo projection)
POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock)
POST /api/hosts/:id/restore (restore wizard submit)
GET /api/hosts/:id/schedules
POST /api/hosts/:id/schedules
PUT /api/schedules/:id
DELETE /api/schedules/:id
GET /api/jobs ?host_id=&kind=&status=&since=&until=
&limit=&offset=&order=desc
GET /api/jobs/:id
GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=)
WS /api/jobs/:id/stream (live progress; see §6.2 for shape)
POST /api/jobs/:id/cancel
GET /api/repos
GET /api/repos/:id
GET /api/alerts
POST /api/alerts/:id/ack
GET /api/audit
GET /api/users (admin)
POST /api/users (admin)
```
**Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
if dashboard staleness becomes a problem in practice.
### 6.2 Agent ↔ Server
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
**Agent → server:**
- `hello` (host metadata, agent_version, restic_version, OS,
`protocol_version` — see "Protocol versioning" below)
- `heartbeat` (every 30s)
- `job.started` (job_id, kind, started_at)
- `job.progress` (job_id, percent_done, files_done, total_files,
bytes_done, total_bytes, eta_seconds, throughput_bps)
- `job.finished` (job_id, status, exit_code, stats, error, finished_at)
- `snapshots.report` (full list after each successful backup)
- `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at,
last_check_status, lock_state)
- `log.stream` (live stdout/stderr lines while job running;
{job_id, seq, ts, stream: stdout|stderr|event, payload})
- `schedule.ack` (schedule_version) — agent confirms it has applied a
schedule push; lets the server surface "this host is N versions
behind" without polling
**Server → agent:**
- `command.run` (kind, args)
- `command.cancel` (job_id)
- `schedule.set` (schedule_version, schedules: [...]) — full schedule
list, agent reconciles local cron and replies with `schedule.ack`
- `config.update`
- `agent.update.available` (new version + package source URL —
informational only; agent does not self-update, see §4.2)
The server fans `job.progress` and `log.stream` for a given job to all
browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without
transformation, so the schema is shared end-to-end.
**Protocol versioning.** Agents and the server each declare an integer
`protocol_version` in `hello`. The version bumps **only** on breaking
wire-format changes (not human-readable software releases). The server
maintains a `MinAgentProtocolVersion` constant; agents below it are
disconnected with `error: protocol_too_old` and a URL pointing at the
upgrade instructions. Symmetrically, an agent talking to a server that
advertises a `protocol_version` it does not recognise refuses to
proceed and surfaces a clear log message. This avoids the failure mode
of "weird JSON parse errors when v0.3 agent meets v0.5 server."
**Schedule reconciliation when the server is unreachable.** Agents
keep firing the **last-known-good** schedule pushed by the server,
indefinitely. Rationale: a missed backup because the controller is
down is a worse outcome than firing a schedule the user has since
edited. On reconnect, the server's view is canonical: the next
`schedule.set` overrides whatever the agent was running, the agent
replies `schedule.ack` with the new `schedule_version`, and the server
updates `Host.applied_schedule_version`. The UI surfaces drift
("schedule v7 pushed, agent applied v5") when an agent has been
offline.
### 6.3 Enrollment
1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
2. Operator runs install script on endpoint with token
3. Agent calls `POST /api/agents/enroll` with token + host metadata
4. Server issues persistent agent credential (bearer token + TLS pin) and host record
5. Agent stores credential, opens WS connection
## 7. Security
### 7.1 Authentication
- **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
- **Phase 2:** OIDC (Authelia, Keycloak, Authentik)
- **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time
### 7.2 Authorization (Phase 1: simple roles)
- **admin:** everything
- **operator:** trigger jobs, edit schedules, restore
- **viewer:** read-only
### 7.3 Secret handling
- Restic repo passwords and REST-server credentials encrypted at rest
in SQLite using a server-side key (loaded from env or file at
startup, AEAD via `internal/crypto`).
- Operator supplies repo URL + username + password when minting an
enrollment token. The token row holds them as a single encrypted
blob; on `ConsumeEnrollmentToken` the blob is moved to a
`host_credentials` row keyed by `host_id` (same tx).
- Pushed to agents over the authenticated WS as a `config.update`
message — sent immediately after the agent's `hello` on every
connect, and again whenever the operator edits the credential.
Agents that connect before credentials exist proceed normally
but refuse to start backup jobs until the push arrives.
- Agent persistence:
- **Phase 1, Linux:** AEAD-encrypted file at
`/var/lib/restic-manager/secrets.enc`, key stored in
`agent.yaml` alongside the bearer (same 0600 trust boundary).
Atomic writes (tmp+fsync+rename).
- **Phase 2:** OS keyring where available — Windows DPAPI; Linux
Secret Service via `pass` / `gnome-keyring` / `kwallet` when a
session bus is present. The encrypted-file path stays as the
fallback for headless boxes.
- Plaintext repo passwords never appear in `agent.yaml`, server logs,
audit-log payloads, or job-log streams. The audit log records
*that* a credential was set/changed and by whom, never the value.
### 7.4 Repo protection
- Restic REST server runs with `--append-only` for routine backups
- A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited
### 7.5 Audit
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload
## 8. UI
Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.
**Pages:**
- **Login**
- **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts)
- **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings
- **Job detail:** live log streaming via WS, cancel button
- **Restore wizard:** host → snapshot → paths → target → confirm
- **Repos:** aggregate view across hosts
- **Alerts:** list, acknowledge
- **Settings:** users (admin), notification channels, agent download
- **Audit log**
## 9. Alerting
- **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly
- **Channels (Phase 1):** webhook, ntfy, email (SMTP)
- **Channels (Phase 2+):** Discord, Slack, Pushover
## 10. Deployment
### 10.1 Control plane (Proxmox host or LXC)
The server is HTTP-only by design — operators front it with their own
TLS-terminating reverse proxy (Caddy / Traefik / nginx). Bind the
container to localhost so the only public path is through the proxy.
`docker-compose.yml`:
```yaml
services:
restic-manager:
image: ghcr.io/<owner>/restic-manager:latest
restart: unless-stopped
ports:
- "127.0.0.1:8080:8080"
volumes:
- ./data:/data
environment:
- RM_DATA_DIR=/data
- RM_LISTEN=:8080
- RM_BASE_URL=https://restic.lab.example
- RM_SECRET_KEY_FILE=/data/secret.key
- RM_TRUSTED_PROXY=172.16.0.0/12 # CIDR of your reverse proxy
```
Reference Caddy snippet (operator's own Caddyfile, outside this repo):
```
restic.lab.example {
encode zstd gzip
reverse_proxy 127.0.0.1:8080
}
```
Caddy provisions and renews the cert; the agent's `cert_pin_sha256`
pins **Caddy's** leaf cert (that's what the agent actually sees).
`RM_LISTEN` is the source of truth for the server's bind address. The
`8080:8080` mapping above is just the matching default; change both
sides together if you pick a different port.
> ⚠️ Never expose `RM_LISTEN` directly on a public interface — the
> server has no TLS, no rate limiting, and no DDoS protection. That
> all belongs in the proxy.
### 10.2 Restic REST server (Unraid)
Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share.
### 10.3 Agent install
- **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh`
- **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`)
- Installer drops binary + service unit, calls enroll endpoint, starts service
## 11. Testing strategy
- **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic
- **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows
- **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container
- **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner
## 12. Repository layout
```
restic-manager/
├── cmd/
│ ├── server/
│ └── agent/
├── internal/
│ ├── api/ # shared API types
│ ├── server/
│ │ ├── http/
│ │ ├── ws/
│ │ └── ui/ # templates, handlers
│ ├── agent/
│ │ ├── service/ # systemd / windows service glue
│ │ ├── runner/ # restic invocation
│ │ └── scheduler/
│ ├── restic/ # restic CLI wrapper, --json parsing
│ ├── store/ # sqlite layer
│ ├── crypto/ # secret encryption
│ └── auth/
├── web/
│ ├── templates/
│ └── static/
├── deploy/
│ ├── docker-compose.yml
│ ├── Dockerfile.server
│ └── install/
│ ├── install.sh
│ └── install.ps1
├── docs/
├── LICENSE # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md
```
## 13. Phased delivery
- **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
- **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats
- **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log
- **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends
- **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour
## 14. Confirmed extensions (in scope)
These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.
### 14.1 Cross-host restore
Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).
- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice``/home/alice-new`)
- **Permissions:** restore runs as root (the agent's process; see §4.2). The agent retains `CAP_DAC_OVERRIDE`/`CAP_FOWNER`/`CAP_CHOWN` precisely so it can recreate ownership on the target. The "service user is non-root" warning that appeared in earlier drafts is moot.
- **Phase:** 3 (with the restore wizard)
### 14.2 Bandwidth limiting
Per-host upload/download caps for backup, restore, and prune jobs.
- Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s)
- Also overridable on run-now jobs via the UI
- Persisted in `Schedule.options` (JSON blob) so the schema stays stable
- **Phase:** 2 (with scheduling)
### 14.3 Pre/post backup hooks
Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
- **Applicability:** hooks are only meaningful for `kind = backup` schedules. The API rejects non-null `pre_hook` / `post_hook` on any other schedule kind (`forget`, `prune`, `check`) with a clear validation error rather than silently ignoring them. The same constraint applies to `Host.pre_hook_default` / `Host.post_hook_default`: they only fire for backup schedules on that host
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable). Hooks inherit the agent's process — i.e. **root by default** (see §4.2). A per-hook `run_as` field lets the operator drop privileges for a specific hook (`run_as: postgres` for a `pg_dump` hook, etc.); the agent uses `setuid`/`setgid` before exec rather than shelling out to `sudo`. Hooks running as root is what makes `docker stop`, `mysqldump`, `systemctl reload` etc. work without per-host setup, which is what the user expects when typing them into the UI.
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged
- **Phase:** 2 (with scheduling)
### 14.4 Prometheus `/metrics` endpoint
Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list.
- **Metrics (per host):**
- `restic_manager_last_backup_timestamp_seconds{host=...}`
- `restic_manager_last_backup_status{host=...}` (1=success, 0=failure)
- `restic_manager_repo_size_bytes{host=...}`
- `restic_manager_snapshot_count{host=...}`
- `restic_manager_agent_online{host=...}` (1/0)
- `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram)
- **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info`
- **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data)
## 15. Future considerations (not yet committed)
- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge