feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.
This commit is contained in:
@@ -0,0 +1,16 @@
|
||||
-- 0015_jobs_source_group_id.sql
|
||||
--
|
||||
-- Add source_group_id to jobs so the alert engine can dedup
|
||||
-- backup/forget/prune/check failures per source group rather than
|
||||
-- collapsing every failed thing on a host onto one open alert per
|
||||
-- kind. Backup jobs always have one set (each group is its own
|
||||
-- restic invocation); forget/prune/check are repo-scoped and leave
|
||||
-- it NULL.
|
||||
--
|
||||
-- Column-level ALTER is safe under foreign_keys=ON (CLAUDE.md). The
|
||||
-- existing rebuild pattern in 0012 should not be repeated here.
|
||||
|
||||
ALTER TABLE jobs ADD COLUMN source_group_id TEXT
|
||||
REFERENCES source_groups(id) ON DELETE SET NULL;
|
||||
|
||||
CREATE INDEX jobs_source_group_id ON jobs(source_group_id);
|
||||
@@ -0,0 +1,23 @@
|
||||
-- 0016_alerts_dedup_key.sql
|
||||
--
|
||||
-- Widen the open-alert uniqueness key from (host_id, kind) to
|
||||
-- (host_id, kind, dedup_key) so two distinct failing source groups
|
||||
-- on the same host produce two open alerts instead of collapsing
|
||||
-- onto one. dedup_key is the source_group_id for
|
||||
-- backup/forget/prune/check failures and the empty string for
|
||||
-- agent_offline / stale_schedule (one-per-host alerts).
|
||||
--
|
||||
-- The original alerts_open partial index keyed on host_id only.
|
||||
-- That was a coarse "is this host happy?" lookup; we replace it
|
||||
-- with a proper partial unique index that the dedup logic relies
|
||||
-- on. NOT NULL DEFAULT '' so existing rows backfill cleanly.
|
||||
--
|
||||
-- Column-level ALTER is safe under foreign_keys=ON.
|
||||
|
||||
ALTER TABLE alerts ADD COLUMN dedup_key TEXT NOT NULL DEFAULT '';
|
||||
|
||||
DROP INDEX IF EXISTS alerts_open;
|
||||
|
||||
CREATE UNIQUE INDEX alerts_open_unique
|
||||
ON alerts(host_id, kind, dedup_key)
|
||||
WHERE resolved_at IS NULL;
|
||||
Reference in New Issue
Block a user