P5: OSS readiness — docs site, contributor onboarding, e2e harness

P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.
2026-05-07 23:56:02 +01:00
parent ff8a5dbead
commit bb4ed3502d
47 changed files with 2818 additions and 61 deletions
@@ -0,0 +1,73 @@
+# Alerts and notifications
+
+restic-manager raises alerts on conditions that need human
+attention. The alert engine evaluates rules on a 60s tick and
+on every job-finished / host-online event.
+
+## Built-in alert kinds
+
+| Kind                | Trigger | Severity |
+|---------------------|---------|----------|
+| `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
+| `forget_failed`     | A forget job ends in `failed` | warning |
+| `prune_failed`      | A prune job ends in `failed` | critical |
+| `check_failed`      | A check job ends in `failed` | critical |
+| `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
+| `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
+| `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
+| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
+
+Each alert has a `dedup_key` so re-firing the same condition
+just bumps `last_seen_at` — the operator gets one row per
+condition, not a thousand.
+
+## Lifecycle
+
+```
+raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
+   │                          │
+   └────────auto-resolve──────┘
+   (e.g. agent_offline auto-resolves on agent_online)
+```
+
+- **Acknowledge** says "I've seen this, stop notifying about it".
+- **Resolve** says "the underlying condition is gone".
+- Some alerts auto-resolve when the condition clears
+  (`agent_offline` is the canonical example).
+
+## Notification channels
+
+Configure under **Settings → Notifications**. Each channel can
+subscribe to all alerts or filter by severity.
+
+### Webhook
+
+Posts a JSON envelope to a URL of your choice. Useful for
+piping into Slack via an Incoming Webhook URL or into your own
+alerting tooling.
+
+### ntfy
+
+Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
+topic. Configure the topic URL; optional bearer token if you
+self-host with auth.
+
+### SMTP
+
+Plain SMTP (with optional TLS). Configure host, port,
+username, password, and the recipient list.
+
+## Test fire
+
+Each channel exposes a **Test fire** button that dispatches a
+single synthetic alert through the channel without touching the
+alert engine. Use this when you've added a channel and want to
+verify connectivity before the next real failure happens.
+
+## What gets logged
+
+Every alert raise / acknowledge / resolve writes an audit log
+entry. The audit log UI at **Settings → Audit log** filters by
+user, action, target, and time range — useful for the
+post-incident "who clicked acknowledge on the prune-failure
+alert" question.