P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
This commit is contained in:
@@ -0,0 +1,73 @@
|
||||
# Alerts and notifications
|
||||
|
||||
restic-manager raises alerts on conditions that need human
|
||||
attention. The alert engine evaluates rules on a 60s tick and
|
||||
on every job-finished / host-online event.
|
||||
|
||||
## Built-in alert kinds
|
||||
|
||||
| Kind | Trigger | Severity |
|
||||
|---------------------|---------|----------|
|
||||
| `backup_failed` | A backup job ends in `failed` or `cancelled` | warning |
|
||||
| `forget_failed` | A forget job ends in `failed` | warning |
|
||||
| `prune_failed` | A prune job ends in `failed` | critical |
|
||||
| `check_failed` | A check job ends in `failed` | critical |
|
||||
| `agent_offline` | A host has been offline more than 90s past its heartbeat cadence | warning |
|
||||
| `stale_schedule` | A schedule's "last run" is more than 1.5 × its interval ago | warning |
|
||||
| `update_failed` | An agent self-update returned a fail or didn't reconnect within 90s | warning |
|
||||
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |
|
||||
|
||||
Each alert has a `dedup_key` so re-firing the same condition
|
||||
just bumps `last_seen_at` — the operator gets one row per
|
||||
condition, not a thousand.
|
||||
|
||||
## Lifecycle
|
||||
|
||||
```
|
||||
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
|
||||
│ │
|
||||
└────────auto-resolve──────┘
|
||||
(e.g. agent_offline auto-resolves on agent_online)
|
||||
```
|
||||
|
||||
- **Acknowledge** says "I've seen this, stop notifying about it".
|
||||
- **Resolve** says "the underlying condition is gone".
|
||||
- Some alerts auto-resolve when the condition clears
|
||||
(`agent_offline` is the canonical example).
|
||||
|
||||
## Notification channels
|
||||
|
||||
Configure under **Settings → Notifications**. Each channel can
|
||||
subscribe to all alerts or filter by severity.
|
||||
|
||||
### Webhook
|
||||
|
||||
Posts a JSON envelope to a URL of your choice. Useful for
|
||||
piping into Slack via an Incoming Webhook URL or into your own
|
||||
alerting tooling.
|
||||
|
||||
### ntfy
|
||||
|
||||
Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
|
||||
topic. Configure the topic URL; optional bearer token if you
|
||||
self-host with auth.
|
||||
|
||||
### SMTP
|
||||
|
||||
Plain SMTP (with optional TLS). Configure host, port,
|
||||
username, password, and the recipient list.
|
||||
|
||||
## Test fire
|
||||
|
||||
Each channel exposes a **Test fire** button that dispatches a
|
||||
single synthetic alert through the channel without touching the
|
||||
alert engine. Use this when you've added a channel and want to
|
||||
verify connectivity before the next real failure happens.
|
||||
|
||||
## What gets logged
|
||||
|
||||
Every alert raise / acknowledge / resolve writes an audit log
|
||||
entry. The audit log UI at **Settings → Audit log** filters by
|
||||
user, action, target, and time range — useful for the
|
||||
post-incident "who clicked acknowledge on the prune-failure
|
||||
alert" question.
|
||||
Reference in New Issue
Block a user