# Alerts and notifications

restic-manager raises alerts on conditions that need human
attention. The alert engine evaluates rules on a 60s tick and
on every job-finished / host-online event.

## Built-in alert kinds

| Kind                | Trigger | Severity |
|---------------------|---------|----------|
| `backup_failed`     | A backup job ends in `failed` or `cancelled` | warning |
| `forget_failed`     | A forget job ends in `failed` | warning |
| `prune_failed`      | A prune job ends in `failed` | critical |
| `check_failed`      | A check job ends in `failed` | critical |
| `agent_offline`     | A host has been offline more than 90s past its heartbeat cadence | warning |
| `stale_schedule`    | A schedule's "last run" is more than 1.5 × its interval ago | warning |
| `update_failed`     | An agent self-update returned a fail or didn't reconnect within 90s | warning |
| `fleet_update_halted`| The rolling fleet-update worker stopped on a failure | critical |

Each alert has a `dedup_key` so re-firing the same condition
just bumps `last_seen_at` — the operator gets one row per
condition, not a thousand.

## Lifecycle

```
raised  ──acknowledge──▶  acknowledged  ──resolve──▶  resolved
   │                          │
   └────────auto-resolve──────┘
   (e.g. agent_offline auto-resolves on agent_online)
```

- **Acknowledge** says "I've seen this, stop notifying about it".
- **Resolve** says "the underlying condition is gone".
- Some alerts auto-resolve when the condition clears
  (`agent_offline` is the canonical example).

## Notification channels

Configure under **Settings → Notifications**. Each channel can
subscribe to all alerts or filter by severity.

### Webhook

Posts a JSON envelope to a URL of your choice. Useful for
piping into Slack via an Incoming Webhook URL or into your own
alerting tooling.

### ntfy

Pushes a plain-text alert to an [ntfy.sh](https://ntfy.sh/)
topic. Configure the topic URL; optional bearer token if you
self-host with auth.

### SMTP

Plain SMTP (with optional TLS). Configure host, port,
username, password, and the recipient list.

## Test fire

Each channel exposes a **Test fire** button that dispatches a
single synthetic alert through the channel without touching the
alert engine. Use this when you've added a channel and want to
verify connectivity before the next real failure happens.

## What gets logged

Every alert raise / acknowledge / resolve writes an audit log
entry. The audit log UI at **Settings → Audit log** filters by
user, action, target, and time range — useful for the
post-incident "who clicked acknowledge on the prune-failure
alert" question.