feat(alerts): per-source-group dedup so two failing backups produce two alerts
Until now the open-alert key was (host_id, kind, resolved_at IS NULL). A host with two source groups both failing collapsed onto one backup_failed row — second failure bumped last_seen_at and overwrote the message but never re-fan-out. Operators saw one alert that appeared to flap, not two distinct broken things. Schema changes (column-level ALTER, no rebuild): - 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL, index). Populated for backup jobs in CreateJob. - 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open partial index gets dropped and replaced with a UNIQUE partial index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL — the index is now the actual dedup primitive. Plumbing: - RaiseOrTouch / AutoResolve / Alert struct gain dedup_key. - engine.JobFinishedEvent gains SourceGroupID; handleJobFinished passes it through for backup_failed only (forget/prune/check stay repo-scoped with key=''). - ws.handler reads SourceGroupID off the freshly-loaded job row. - dispatchJobWithPayload gains a *string sourceGroupID arg; the per-group Run-now path and schedule.fire path pass &g.ID. Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two distinct groups produce two distinct open alerts and that resolving one does not auto-resolve the other. Dev tool: cmd/_fake_alert gains -dedup-key flag.
This commit is contained in:
@@ -43,7 +43,7 @@ func (s *Server) DispatchMaintenance(ctx context.Context, decisions []maintenanc
|
||||
"host_id", d.HostID)
|
||||
continue
|
||||
}
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobForget, payload)
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobForget, nil, payload)
|
||||
if code != "" {
|
||||
slog.Warn("maintenance: forget dispatch failed",
|
||||
"host_id", d.HostID, "code", code, "msg", msg)
|
||||
@@ -65,14 +65,14 @@ func (s *Server) DispatchMaintenance(ctx context.Context, decisions []maintenanc
|
||||
continue
|
||||
}
|
||||
payload := api.CommandRunPayload{RequiresAdminCreds: true}
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobPrune, payload)
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobPrune, nil, payload)
|
||||
if code != "" {
|
||||
slog.Warn("maintenance: prune dispatch failed",
|
||||
"host_id", d.HostID, "code", code, "msg", msg)
|
||||
}
|
||||
case "check":
|
||||
payload := api.CommandRunPayload{Args: []string{strconv.Itoa(d.SubsetPct)}}
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobCheck, payload)
|
||||
_, _, code, msg := s.dispatchJobWithPayload(ctx, nil, d.HostID, api.JobCheck, nil, payload)
|
||||
if code != "" {
|
||||
slog.Warn("maintenance: check dispatch failed",
|
||||
"host_id", d.HostID, "code", code, "msg", msg)
|
||||
|
||||
Reference in New Issue
Block a user