P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,73 @@
|
||||
# Alerts and notifications
|
||||
|
||||
restic-manager raises alerts on conditions that need human
|
||||
attention. The alert engine evaluates rules on a 60s tick and
|
||||
on every job-finished / host-online event.
|
||||
|
||||
## Built-in alert kinds
|
||||
|
||||
| Kind | Trigger | Severity |
|
||||
|---------------------|---------|----------|
|
||||
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
|
||||
| `forget_failed` | A forget job ends in `failed` | warning |
|
||||
| `prune_failed` | A prune job ends in `failed` | critical |
|
||||
| `check_failed` | A check job ends in `failed` | critical |
|
||||
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
|
||||
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
|
||||
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
|
||||
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
|
||||
|
||||
Each alert has a `dedup_key` so re-firing the same condition
|
||||
just bumps `last_seen_at` — the operator gets one row per
|
||||
condition, not a thousand.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
```
|
||||
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
|
||||
│ │
|
||||
└────────auto-resolve──────┘
|
||||
(e.g. agent_offline auto-resolves on agent_online)
|
||||
```
|
||||
|
||||
- **Acknowledge** says "I've seen this, stop notifying about it".
|
||||
- **Resolve** says "the underlying condition is gone".
|
||||
- Some alerts auto-resolve when the condition clears
|
||||
(`agent_offline` is the canonical example).
|
||||
|
||||
## Notification channels
|
||||
|
||||
Configure under **Settings → Notifications**. Each channel can
|
||||
subscribe to all alerts or filter by severity.
|
||||
|
||||
### Webhook
|
||||
|
||||
Posts a JSON envelope to a URL of your choice. Useful for
|
||||
piping into Slack via an Incoming Webhook URL or into your own
|
||||
alerting tooling.
|
||||
|
||||
### ntfy
|
||||
|
||||
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
|
||||
topic. Configure the topic URL; optional bearer token if you
|
||||
self-host with auth.
|
||||
|
||||
### SMTP
|
||||
|
||||
Plain SMTP (with optional TLS). Configure host, port,
|
||||
username, password, and the recipient list.
|
||||
|
||||
## Test fire
|
||||
|
||||
Each channel exposes a **Test fire** button that dispatches a
|
||||
single synthetic alert through the channel without touching the
|
||||
alert engine. Use this when you've added a channel and want to
|
||||
verify connectivity before the next real failure happens.
|
||||
|
||||
## What gets logged
|
||||
|
||||
Every alert raise / acknowledge / resolve writes an audit log
|
||||
entry. The audit log UI at **Settings → Audit log** filters by
|
||||
user, action, target, and time range — useful for the
|
||||
post-incident "who clicked acknowledge on the prune-failure
|
||||
alert" question.
|
||||
@@ -0,0 +1,73 @@
|
||||
# Backups and restores
|
||||
|
||||
## Running a backup
|
||||
|
||||
Three ways to trigger one:
|
||||
|
||||
1. **Scheduled** — the agent's local cron fires at the time set
|
||||
on the schedule.
|
||||
2. **Run-now** — operator clicks **Run now** on the host detail
|
||||
right rail. Posts to `/hosts/{id}/run-backup` (defaults to all
|
||||
source groups) or to a per-group form for finer control.
|
||||
3. **API** — `POST /api/hosts/{id}/jobs` with the appropriate
|
||||
payload. Same audit + dispatch path.
|
||||
|
||||
In every case the server creates a `jobs` row, broadcasts a
|
||||
`command.run` to the host, and lands the operator on the live
|
||||
job log page (HTMX `HX-Redirect`).
|
||||
|
||||
## Cancelling a job
|
||||
|
||||
Any running job — backup, forget, prune, restore, anything —
|
||||
exposes a **Cancel** button on its detail page. The server
|
||||
broadcasts `command.cancel`, and the agent kills the running
|
||||
restic subprocess via context cancel: SIGTERM first, SIGKILL
|
||||
after a 5s grace (`cmd.Cancel` + `cmd.WaitDelay`). On Windows the
|
||||
SIGTERM step is replaced with `os.Kill` because Windows can't
|
||||
deliver SIGTERM. Result: a cancelled job lands as `cancelled`
|
||||
within a couple of hundred milliseconds.
|
||||
|
||||
## Restore wizard
|
||||
|
||||
Restoring a file or path goes through a four-step wizard at
|
||||
`/hosts/{id}/restore`:
|
||||
|
||||
1. **Pick a snapshot.** Search by id or by date; the page is
|
||||
pre-populated when you launched the wizard from a snapshot row.
|
||||
2. **Browse the snapshot tree.** Lazy-loaded children via the
|
||||
`MsgTreeList` synchronous WS RPC; results are cached
|
||||
per-wizard-session for 30 minutes. Pick the absolute paths
|
||||
you want.
|
||||
3. **Choose a target.** Either **In place** (overwrites the
|
||||
live filesystem; requires you to type the hostname to
|
||||
confirm) or **New directory** (default
|
||||
`$HOME/rm-restore/<job-id>/`; agent expands `$HOME` /
|
||||
`${HOME}` / `~/` and creates the directory chain).
|
||||
4. **Review and submit.** Server mints a job, dispatches
|
||||
`command.run` with a `RestorePayload`, and `HX-Redirect`s to
|
||||
the live job log.
|
||||
|
||||
`--no-ownership` is gated on restic ≥ 0.17 (the flag was added
|
||||
in that release). Hosts running 0.16 don't get the flag and
|
||||
restore as the running user instead.
|
||||
|
||||
## Snapshot diff
|
||||
|
||||
Two snapshot ids in the **Diff** form on the host detail page →
|
||||
a `JobDiff` job that runs `restic diff <a> <b>`. Output streams
|
||||
to the standard live job log. Useful when investigating a
|
||||
suspiciously-sized backup.
|
||||
|
||||
## Job log artefacts
|
||||
|
||||
Every job's log is persisted in `job_logs` (one row per line),
|
||||
not just streamed in-memory. That gives you:
|
||||
|
||||
- A live view at `/jobs/{id}` while the job runs.
|
||||
- Two download formats from the same page header dropdown:
|
||||
- **txt** — one line per row, `HH:MM:SS.mmm TAG payload`.
|
||||
- **ndjson** — one self-contained JSON object per line
|
||||
(`{seq, ts, stream, payload}`), perfect for `jq`.
|
||||
|
||||
Downloads work whether the job is running or finished —
|
||||
the source is the DB, not the live socket.
|
||||
@@ -0,0 +1,61 @@
|
||||
# Observability with Prometheus
|
||||
|
||||
restic-manager can expose a Prometheus scrape endpoint at
|
||||
`GET /metrics`. The endpoint is **opt-in** — without an explicit
|
||||
auth gate it isn't even mounted, so a forgotten config can't
|
||||
accidentally publish fleet state.
|
||||
|
||||
The full reference lives at
|
||||
[`docs/prometheus.md`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/docs/prometheus.md);
|
||||
the short version follows.
|
||||
|
||||
## Enable the endpoint
|
||||
|
||||
Set at least one of:
|
||||
|
||||
- `RM_METRICS_TOKEN` — `Authorization: Bearer <token>` required.
|
||||
- `RM_METRICS_TRUSTED_CIDR` — restricts source IPs (comma-CIDR).
|
||||
|
||||
Both ANDed when both set. Constant-time token compare; CIDR
|
||||
honours `X-Forwarded-For` only when the immediate hop matches
|
||||
`RM_TRUSTED_PROXY`.
|
||||
|
||||
## Metrics emitted
|
||||
|
||||
- **Server gauges**: `rm_hosts_total`, `rm_hosts_online`,
|
||||
`rm_active_alerts{severity}`, `rm_build_info{...}`.
|
||||
- **Per-host gauges**: `rm_host_agent_online`,
|
||||
`rm_host_last_backup_timestamp_seconds`,
|
||||
`rm_host_last_backup_success`, `rm_host_repo_size_bytes`,
|
||||
`rm_host_snapshot_count`, `rm_host_open_alerts`,
|
||||
`rm_host_repo_status`.
|
||||
- **Histogram**:
|
||||
`rm_job_duration_seconds{kind,status,le=…}` (buckets
|
||||
`1, 5, 30, 60, 300, 1800, 3600, 21600, 86400, +Inf`).
|
||||
|
||||
In-memory histogram only. Prometheus persists the scrapes; if
|
||||
you need durable history at hourly resolution that's
|
||||
Prometheus's job.
|
||||
|
||||
## Sample Grafana dashboard
|
||||
|
||||
[`deploy/grafana/restic-manager-dashboard.json`](https://gitea.dcglab.co.uk/steve/restic-manager/src/branch/main/deploy/grafana/restic-manager-dashboard.json)
|
||||
imports through Grafana's **+ → Import → Upload JSON file**.
|
||||
Six panels:
|
||||
|
||||
1. Fleet status (online / total).
|
||||
2. Open alerts by severity.
|
||||
3. Backups failing on most-recent run.
|
||||
4. Hosts table — last backup, repo size, snapshots, open alerts.
|
||||
5. Repo size over time, one line per host.
|
||||
6. Job-duration p95 over a 1h window per kind.
|
||||
|
||||
## Alerting
|
||||
|
||||
restic-manager already has a built-in alert engine
|
||||
([Alerts](./alerts.md)). The dashboard intentionally doesn't
|
||||
duplicate it as Prometheus alert rules. If you want
|
||||
Prometheus-side alerts on top, write your own based on the
|
||||
metrics above — `rm_host_last_backup_success == 0`,
|
||||
`time() - rm_host_last_backup_timestamp_seconds > <max age>`,
|
||||
or whatever suits your environment.
|
||||
@@ -0,0 +1,50 @@
|
||||
# Updating agents
|
||||
|
||||
Server updates are a `docker compose pull && up -d` away.
|
||||
Agents update via the control plane.
|
||||
|
||||
## Single-host update
|
||||
|
||||
Each host's detail page shows an **Update agent** button when
|
||||
the agent's reported version is older than the server's. The
|
||||
button:
|
||||
|
||||
1. Dispatches a `command.update` to that host.
|
||||
2. The agent fetches the appropriate binary from
|
||||
`$RM_SERVER/agent/binary?os=…&arch=…` to
|
||||
`<binary-path>.new`.
|
||||
3. Copies the running binary to `<binary-path>.old` (one
|
||||
revision back, in case rollback is needed).
|
||||
4. Atomic-renames `.new` over the running binary.
|
||||
5. Exits cleanly. systemd's `Restart=always` (or Windows SCM)
|
||||
brings the process back on the new binary.
|
||||
|
||||
A 90-second timer on the server side waits for a hello at the
|
||||
target version and marks the update succeeded — or, if the
|
||||
agent doesn't reconnect at the expected version in time, marks
|
||||
the update **failed** and raises an `update_failed` alert.
|
||||
|
||||
## Fleet update
|
||||
|
||||
The admin-only **Settings → Fleet update** page drives a rolling
|
||||
update across every host in the fleet:
|
||||
|
||||
- One host at a time.
|
||||
- Wait for hello-with-target-version (max 95s).
|
||||
- On any host failing, **halt** the rollout, raise a
|
||||
`fleet_update_halted` alert, leave the rest of the fleet on
|
||||
the old version. No surprise mass-failures.
|
||||
|
||||
You can cancel an in-progress fleet update; the worker stops
|
||||
after the current host finishes.
|
||||
|
||||
## TLS and corruption
|
||||
|
||||
Updates rely on the reverse proxy's TLS to detect corruption in
|
||||
transit. There's no separate sha256 verification step — we
|
||||
chose the simpler model on the basis that the same TLS already
|
||||
gates every other byte the server hands to the agent.
|
||||
|
||||
If you'd like a separate signature step before applying updates,
|
||||
that's a future-phase enhancement (see `tasks.md` Phase 6
|
||||
candidates).
|
||||
Reference in New Issue
Block a user