Files
restic-manager/docs/book/src/operations/alerts.md
T
steve bb4ed3502d P5: OSS readiness — docs site, contributor onboarding, e2e harness
P5-01 — Documentation site under docs/book/ rendered with mdBook
(downloaded via Makefile, same static-binary pattern as Tailwind).
Structured chapters: getting started, concepts, operations,
security, reference. `make docs` / `make docs-watch`. Generated
output gitignored.

P5-02 — CONTRIBUTING.md rewritten from placeholder to a full
guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a
single-maintainer project. .gitea/issue_template/{bug,feature}.md
and PULL_REQUEST_TEMPLATE.md.

P5-04 — Six README screenshots captured live from a fresh server
bootstrap (login, empty dashboard, add-host, alerts, settings,
audit log). README rewritten to centre the screenshot grid and
link out to the docs site.

P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day
default window), scope in/out, threat-model summary, operator
hardening checklist. Mirrored as a docs-site chapter.

P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up
server + sibling Linux agent (alpine + restic) + restic/rest-server.
Agent uses announce-and-approve so Playwright can drive the full
operator flow: bootstrap → login → accept pending → backup →
verify terminal status. Second spec scrapes /metrics to assert
the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every
PR; local how-to in docs/e2e.md.
2026-05-08 20:08:23 +01:00

2.6 KiB
Raw Blame History

Alerts and notifications

restic-manager raises alerts on conditions that need human attention. The alert engine evaluates rules on a 60s tick and on every job-finished / host-online event.

Built-in alert kinds

Kind Trigger Severity
backup_failed A backup job ends in failed or cancelled warning
forget_failed A forget job ends in failed warning
prune_failed A prune job ends in failed critical
check_failed A check job ends in failed critical
agent_offline A host has been offline more than 90s past its heartbeat cadence warning
stale_schedule A schedule's "last run" is more than 1.5 × its interval ago warning
update_failed An agent self-update returned a fail or didn't reconnect within 90s warning
fleet_update_halted The rolling fleet-update worker stopped on a failure critical

Each alert has a dedup_key so re-firing the same condition just bumps last_seen_at — the operator gets one row per condition, not a thousand.

Lifecycle

raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
   │                          │
   └────────auto-resolve──────┘
   (e.g. agent_offline auto-resolves on agent_online)
  • Acknowledge says "I've seen this, stop notifying about it".
  • Resolve says "the underlying condition is gone".
  • Some alerts auto-resolve when the condition clears (agent_offline is the canonical example).

Notification channels

Configure under Settings → Notifications. Each channel can subscribe to all alerts or filter by severity.

Webhook

Posts a JSON envelope to a URL of your choice. Useful for piping into Slack via an Incoming Webhook URL or into your own alerting tooling.

ntfy

Pushes a plain-text alert to an ntfy.sh topic. Configure the topic URL; optional bearer token if you self-host with auth.

SMTP

Plain SMTP (with optional TLS). Configure host, port, username, password, and the recipient list.

Test fire

Each channel exposes a Test fire button that dispatches a single synthetic alert through the channel without touching the alert engine. Use this when you've added a channel and want to verify connectivity before the next real failure happens.

What gets logged

Every alert raise / acknowledge / resolve writes an audit log entry. The audit log UI at Settings → Audit log filters by user, action, target, and time range — useful for the post-incident "who clicked acknowledge on the prune-failure alert" question.