P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.
P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.
P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.
P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.
P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2.6 KiB
Alerts and notifications
restic-manager raises alerts on conditions that need human attention. The alert engine evaluates rules on a 60s tick and on every job-finished / host-online event.
Built-in alert kinds
| Kind | Trigger | Severity |
|---|---|---|
backup_failed |
A backup job ends in failed or cancelled |
warning |
forget_failed |
A forget job ends in failed |
warning |
prune_failed |
A prune job ends in failed |
critical |
check_failed |
A check job ends in failed |
critical |
agent_offline |
A host has been offline more than 90s past its heartbeat cadence | warning |
stale_schedule |
A schedule's "last run" is more than 1.5 × its interval ago | warning |
update_failed |
An agent self-update returned a fail or didn't reconnect within 90s | warning |
fleet_update_halted |
The rolling fleet-update worker stopped on a failure | critical |
Each alert has a dedup_key so re-firing the same condition
just bumps last_seen_at — the operator gets one row per
condition, not a thousand.
Lifecycle
raised ──acknowledge──▶ acknowledged ──resolve──▶ resolved
│ │
└────────auto-resolve──────┘
(e.g. agent_offline auto-resolves on agent_online)
- Acknowledge says "I've seen this, stop notifying about it".
- Resolve says "the underlying condition is gone".
- Some alerts auto-resolve when the condition clears
(
agent_offlineis the canonical example).
Notification channels
Configure under Settings → Notifications. Each channel can subscribe to all alerts or filter by severity.
Webhook
Posts a JSON envelope to a URL of your choice. Useful for piping into Slack via an Incoming Webhook URL or into your own alerting tooling.
ntfy
Pushes a plain-text alert to an ntfy.sh topic. Configure the topic URL; optional bearer token if you self-host with auth.
SMTP
Plain SMTP (with optional TLS). Configure host, port, username, password, and the recipient list.
Test fire
Each channel exposes a Test fire button that dispatches a single synthetic alert through the channel without touching the alert engine. Use this when you've added a channel and want to verify connectivity before the next real failure happens.
What gets logged
Every alert raise / acknowledge / resolve writes an audit log entry. The audit log UI at Settings → Audit log filters by user, action, target, and time range — useful for the post-incident "who clicked acknowledge on the prune-failure alert" question.