Commit Graph

3 Commits

Author SHA1 Message Date
steve 350be3f19d feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
2026-05-04 22:59:48 +01:00
steve 04dde93acd fix: dispatch alert.acknowledged + alert.resolved on UI ack/resolve
Spotted during the live Playwright sweep: clicking Acknowledge or
Resolve updated the alert row but never fanned out a notification.
The handlers went straight to Store.Acknowledge/Resolve, bypassing
the hub.

Add Engine.Acknowledge and Engine.Resolve that wrap the store call
and dispatch the matching event to every enabled channel. The UI
handlers prefer the engine path when wired, and fall back to the
direct store call so unit tests that construct a Server without an
engine still work.

Use context.WithoutCancel for the goroutine dispatch — the request
context is cancelled the instant the handler returns 204, so the
naive 'go e.hub.Dispatch(ctx, ...)' was racing the response and
losing the channel-list query with 'context canceled'.
2026-05-04 21:00:44 +01:00
steve 5e655d756d alert: rule logic for the six v1 rules 2026-05-04 19:50:33 +01:00