Phase 3 — Alerts: per-source-group dedup #8

Merged
steve merged 1 commits from p3-alerts-dedup into main 2026-05-04 23:11:09 +01:00
Owner

Summary

Stacks on top of #7 (p3-alerts).

Until now alert dedup keyed on (host_id, kind), so two source groups
failing on one host collapsed onto a single open backup_failed row —
second failure touched last_seen_at + overwrote the message but
fired no fan-out. Operators saw one apparently-flapping alert
instead of two distinct broken things.

This PR widens the open-alert key to (host_id, kind, dedup_key)
where dedup_key is the source-group id for backup_failed and ''
for the host-scoped alerts (forget/prune/check stay repo-scoped,
agent_offline/stale_schedule are already one-per-host).

Changes

  • Migrations (column-level ALTER, no rebuild):
    • 0015_jobs_source_group_id.sql — FK to source_groups, indexed.
    • 0016_alerts_dedup_key.sql — drops the old alerts_open
      partial index and replaces it with a UNIQUE partial index on
      (host_id, kind, dedup_key) WHERE resolved_at IS NULL. The
      index is the dedup primitive now.
  • RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
  • engine.JobFinishedEvent gains SourceGroupID;
    handleJobFinished threads it through for backup only.
  • ws.handler.go reads it off the freshly-loaded job row.
  • dispatchJobWithPayload gains a *string sourceGroupID arg;
    per-group Run-now (run_group.go) and schedule.fire pass &g.ID.

Test plan

  • go test ./...
  • New TestRaiseOrTouchDedupsPerSourceGroup proves the
    multi-group case
  • Smoke env: trigger fake failures for two distinct groups on
    one host via cmd/_fake_alert -dedup-key, verify two open rows
    on /alerts and two fan-outs

Notes

Base is p3-alerts; merge that one first or merge both as a pair.

## Summary Stacks on top of #7 (p3-alerts). Until now alert dedup keyed on (host_id, kind), so two source groups failing on one host collapsed onto a single open backup_failed row — second failure touched last_seen_at + overwrote the message but fired no fan-out. Operators saw one apparently-flapping alert instead of two distinct broken things. This PR widens the open-alert key to (host_id, kind, dedup_key) where dedup_key is the source-group id for backup_failed and '' for the host-scoped alerts (forget/prune/check stay repo-scoped, agent_offline/stale_schedule are already one-per-host). ## Changes - **Migrations** (column-level ALTER, no rebuild): - `0015_jobs_source_group_id.sql` — FK to source_groups, indexed. - `0016_alerts_dedup_key.sql` — drops the old `alerts_open` partial index and replaces it with a UNIQUE partial index on `(host_id, kind, dedup_key) WHERE resolved_at IS NULL`. The index is the dedup primitive now. - `RaiseOrTouch` / `AutoResolve` / `Alert` struct gain `dedup_key`. - `engine.JobFinishedEvent` gains `SourceGroupID`; `handleJobFinished` threads it through for backup only. - `ws.handler.go` reads it off the freshly-loaded job row. - `dispatchJobWithPayload` gains a `*string sourceGroupID` arg; per-group Run-now (`run_group.go`) and schedule.fire pass `&g.ID`. ## Test plan - [x] `go test ./...` - [x] New `TestRaiseOrTouchDedupsPerSourceGroup` proves the multi-group case - [ ] Smoke env: trigger fake failures for two distinct groups on one host via cmd/_fake_alert -dedup-key, verify two open rows on /alerts and two fan-outs ## Notes Base is `p3-alerts`; merge that one first or merge both as a pair.
steve changed target branch from p3-alerts to main 2026-05-04 23:00:09 +01:00
steve added 1 commit 2026-05-04 23:00:09 +01:00
Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
steve merged commit 84814b1386 into main 2026-05-04 23:11:09 +01:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/restic-manager#8