feat(alerts): per-source-group dedup so two failing backups produce two alerts

Until now the open-alert key was (host_id, kind, resolved_at IS NULL).
A host with two source groups both failing collapsed onto one
backup_failed row — second failure bumped last_seen_at and
overwrote the message but never re-fan-out. Operators saw one
alert that appeared to flap, not two distinct broken things.

Schema changes (column-level ALTER, no rebuild):

- 0015 jobs.source_group_id (FK → source_groups, ON DELETE SET NULL,
  index). Populated for backup jobs in CreateJob.
- 0016 alerts.dedup_key (NOT NULL DEFAULT ''). The old alerts_open
  partial index gets dropped and replaced with a UNIQUE partial
  index on (host_id, kind, dedup_key) WHERE resolved_at IS NULL —
  the index is now the actual dedup primitive.

Plumbing:

- RaiseOrTouch / AutoResolve / Alert struct gain dedup_key.
- engine.JobFinishedEvent gains SourceGroupID; handleJobFinished
  passes it through for backup_failed only (forget/prune/check stay
  repo-scoped with key='').
- ws.handler reads SourceGroupID off the freshly-loaded job row.
- dispatchJobWithPayload gains a *string sourceGroupID arg; the
  per-group Run-now path and schedule.fire path pass &g.ID.

Test coverage: TestRaiseOrTouchDedupsPerSourceGroup proves two
distinct groups produce two distinct open alerts and that resolving
one does not auto-resolve the other.

Dev tool: cmd/_fake_alert gains -dedup-key flag.
This commit is contained in:
2026-05-04 22:58:29 +01:00
parent 9d7a714102
commit 350be3f19d
15 changed files with 214 additions and 95 deletions
+13 -8
View File
@@ -65,7 +65,7 @@ func (s *Server) handleRunNow(w stdhttp.ResponseWriter, r *stdhttp.Request) {
func (s *Server) dispatchJob(ctx context.Context, user *store.User,
hostID string, kind api.JobKind, args []string,
) (res runNowResponse, status int, code, msg string) {
return s.dispatchJobWithPayload(ctx, user, hostID, kind, api.CommandRunPayload{
return s.dispatchJobWithPayload(ctx, user, hostID, kind, nil, api.CommandRunPayload{
Kind: kind,
Args: args,
})
@@ -75,8 +75,12 @@ func (s *Server) dispatchJob(ctx context.Context, user *store.User,
// fill in structured fields (Includes/Excludes/Tag/ForgetGroups/RequiresAdminCreds)
// — used by the per-source-group Run-now path. JobID is filled in
// here; callers leave it zero on the input payload.
//
// sourceGroupID is the dedup key the alert engine will key on for
// backup_failed. Pass non-nil for backups; nil for prune/check/unlock
// (those are repo-scoped and dedup at host_id only).
func (s *Server) dispatchJobWithPayload(ctx context.Context, user *store.User,
hostID string, kind api.JobKind, payload api.CommandRunPayload,
hostID string, kind api.JobKind, sourceGroupID *string, payload api.CommandRunPayload,
) (res runNowResponse, status int, code, msg string) {
if !validJobKind(kind) {
return res, stdhttp.StatusBadRequest, "invalid_kind",
@@ -100,12 +104,13 @@ func (s *Server) dispatchJobWithPayload(ctx context.Context, user *store.User,
actorID = &user.ID
}
if err := s.deps.Store.CreateJob(ctx, store.Job{
ID: jobID,
HostID: host.ID,
Kind: string(kind),
ActorKind: actor,
ActorID: actorID,
CreatedAt: now,
ID: jobID,
HostID: host.ID,
Kind: string(kind),
SourceGroupID: sourceGroupID,
ActorKind: actor,
ActorID: actorID,
CreatedAt: now,
}); err != nil {
return res, stdhttp.StatusInternalServerError, "internal", ""
}