De-flake TestDrainPendingSerializesPerHost (CI stability) #33

2026-06-16T13:30:15+01:00

steve commented

2026-06-16 13:30:15 +01:00

Fixes the intermittent server-http CI failures that forced merging-past-red on recent PRs.

Root cause

-race reports no data race — the per-host drain mutex is correct. Under parallel CI load the test's WS conn could be dropped/unregistered, so DrainPending correctly no-ops when Hub.Conn(hostID) == nil. The test client stopped reading the socket during its concurrent burst (a real agent is always in a read loop), so under load the server-side conn keepalive lapsed → unregistered → drains no-op'd. The test then saw a partial/empty drain (observed both 1 job / 4 pending and 0 jobs / 5 pending) and failed its assertion of immediate, complete drainage — which production only guarantees eventually, via repeated drains (30s tick / on reconnect).

Fix

Keep the test client actively reading in a background goroutine → conn stays alive/registered (faithful to a real agent).
Drain to completion via condition polling instead of asserting one-shot completeness (mirrors how production drains repeatedly).
The exactly-5-jobs assertion (which actually proves the mutex prevents double-dispatch) is unchanged — the test isn't weakened.

No production code changed; test-only.

Verification

0 failures over 25 full-package runs under parallel load (previously flaked ~1-in-5 to 1-in-8).
0 failures + no data race over 4 -race iterations.

Note: the also-suspected TestRunPruneRefusesWithoutAdminCreds did not recur across any of those 29 runs; leaving it untouched rather than changing it without reproducible evidence.

Fixes the intermittent `server-http` CI failures that forced merging-past-red on recent PRs. ## Root cause `-race` reports **no data race** — the per-host drain mutex is correct. Under parallel CI load the test's WS conn could be dropped/unregistered, so `DrainPending` correctly no-ops when `Hub.Conn(hostID) == nil`. The test client stopped reading the socket during its concurrent burst (a real agent is *always* in a read loop), so under load the server-side conn keepalive lapsed → unregistered → drains no-op'd. The test then saw a partial/empty drain (observed both `1 job / 4 pending` and `0 jobs / 5 pending`) and failed its assertion of *immediate, complete* drainage — which production only guarantees *eventually*, via repeated drains (30s tick / on reconnect). ## Fix - Keep the test client actively reading in a background goroutine → conn stays alive/registered (faithful to a real agent). - Drain to completion via condition polling instead of asserting one-shot completeness (mirrors how production drains repeatedly). - The exactly-`5`-jobs assertion (which actually proves the mutex prevents double-dispatch) is unchanged — the test isn't weakened. No production code changed; test-only. ## Verification - 0 failures over **25** full-package runs under parallel load (previously flaked ~1-in-5 to 1-in-8). - 0 failures + no data race over **4 `-race` iterations**. > Note: the also-suspected `TestRunPruneRefusesWithoutAdminCreds` did not recur across any of those 29 runs; leaving it untouched rather than changing it without reproducible evidence.

steve added 1 commit 2026-06-16 13:30:16 +01:00

test(pending-drain): de-flake TestDrainPendingSerializesPerHost

CI / Test (store) (pull_request) Successful in 8s

Details

CI / Test (rest) (pull_request) Successful in 12s

Details

CI / Build (windows/amd64) (pull_request) Successful in 15s

Details

CI / Lint (pull_request) Successful in 19s

Details

CI / Build (linux/amd64) (pull_request) Successful in 12s

Details

CI / Build (linux/arm64) (pull_request) Successful in 44s

Details

CI / Test (server-http) (pull_request) Successful in 2m55s

Details

e2e / Playwright vs docker-compose (pull_request) Successful in 2m45s

Details

e64075d5d7

Keep the test WS client actively reading (a real agent always is) so
the server-side conn stays registered under parallel load, and drain to
completion via condition polling instead of asserting one-shot
completeness. The conn could be dropped/unregistered under CI load,
making DrainPending correctly no-op (conn==nil) and the test observe a
partial/empty drain. -race confirms no production data race; the
exactly-5-jobs assertion (proving the per-host mutex blocks
double-dispatch) is unchanged. Verified: 0 failures over 25 loaded runs
+ 4 -race iterations.

steve merged commit 6c6b962e24 into main

2026-06-16 15:44:47 +01:00

steve deleted branch fix-flaky-server-http-tests

2026-06-16 15:44:48 +01:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: steve/restic-manager#33