test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)

CI run #48 failed with: --- FAIL: TestRunInitShipsStartedAndFinished RunInit: ... fork/exec /tmp/.../restic: text file busy setupScript and setupScriptBin used os.WriteFile to write a shell script directly at the final path, then exec'd it. Under -race + many t.Parallel tests, a fork-from-another-goroutine could inherit the still-open writable fd from one of those WriteFile calls; the kernel returns ETXTBSY when the freshly-execed binary still has a writable fd anywhere on the system. Fix: write to "<path>.tmp", then os.Rename into place. The rename is a pure dirent op; by the time the final path exists, no process has a writable fd on its inode and exec is safe. -race + -count=5 on both runner packages now passes consistently.
api+agent: document protocol-version stability and forget back-compat decisions
2026-05-04 09:43:27 +01:00 · 2026-05-04 00:33:51 +01:00 · 2026-05-04 00:33:13 +01:00 · 2026-05-04 00:29:52 +01:00 · 2026-05-04 00:20:19 +01:00 · 2026-05-04 00:19:50 +01:00
2 changed files with 4 additions and 373 deletions
@@ -1,46 +1,3 @@
-# CI workflow — runs on every PR into main.
-#
-# Notes for anyone editing this file:
-#
-# Self-hosted runner expectations
-#   The Gitea runners are provisioned via scripts/provision-gitea-runner.sh.
-#   That script bind-mounts persistent host volumes for /root/go/pkg/mod
-#   (GOMODCACHE), /root/.cache/go-build (GOCACHE), and /root/.cache/act
-#   (action clones) into every job container. As a result:
-#     * `cache: true` on actions/setup-go is intentionally OMITTED — the
-#       action would otherwise tar/untar GOMODCACHE+GOCACHE through the
-#       Gitea cache backend on every job, undoing the host-volume cache
-#       and adding ~10s of redundant zstd round-trip per job.
-#     * Common GitHub actions (actions/checkout, actions/setup-go,
-#       actions/upload-artifact, golangci/golangci-lint-action) are
-#       pre-cloned into /root/.cache/act on the runner, so the per-job
-#       "git clone https://github.com/actions/..." step is a fetch, not
-#       a full clone.
-#     * golangci-lint is pre-installed at /usr/local/bin/golangci-lint
-#       on the runner (latest v2.x). The golangci-lint-action below
-#       still pins a specific version and re-downloads — that's fine
-#       (deterministic CI > marginal speed) but means the host-installed
-#       binary is currently unused. Drop the `version:` arg below to
-#       use the host-installed one if you want to trade determinism
-#       for speed.
-#
-# Build matrix
-#   Linux amd64 + arm64 + Windows amd64. CGO_ENABLED=0 throughout —
-#   modernc.org/sqlite is pure-Go so no cross-compile toolchain is
-#   needed. -trimpath + -ldflags="-s -w" for reproducible, smaller
-#   binaries.
-#
-# Go version
-#   The GO_VERSION env var anchors all three jobs. Floor is set by the
-#   heaviest dep (modernc.org/sqlite v1.50+ requires Go 1.23+ today;
-#   we run 1.25 so golangci-lint's Go-version compatibility check is
-#   happy — see the version pin in the lint job).
-#
-# upload-artifact
-#   Pinned at v3 historically; v3 was deprecated upstream. v4 should
-#   work but hasn't been validated against this runner's act_runner
-#   version yet. Bump when convenient.
-
 name: CI

 on:
@@ -48,6 +5,7 @@ on:
    branches: [main]

 env:
+  # Floor is set by the heaviest dep (modernc.org/sqlite v1.50+).
  GO_VERSION: "1.25"

 jobs:
@@ -59,7 +17,7 @@ jobs:
      - uses: actions/setup-go@v5
        with:
          go-version: ${{ env.GO_VERSION }}
-          # cache: true intentionally omitted — see header notes.
+          cache: true
      - name: go vet
        run: go vet ./...
      - name: go test
@@ -75,7 +33,7 @@ jobs:
      - uses: actions/setup-go@v5
        with:
          go-version: ${{ env.GO_VERSION }}
-          # cache: true intentionally omitted — see header notes.
+          cache: true
      - uses: golangci/golangci-lint-action@v7
        with:
          # Must be built against the same Go release as go.mod targets,
@@ -105,7 +63,7 @@ jobs:
      - uses: actions/setup-go@v5
        with:
          go-version: ${{ env.GO_VERSION }}
-          # cache: true intentionally omitted — see header notes.
+          cache: true
      - name: build server + agent
        env:
          GOOS: ${{ matrix.goos }}
@@ -1,327 +0,0 @@
-#!/usr/bin/env bash
-#
-# provision-gitea-runner.sh — one-shot, idempotent host setup for an
-# act_runner LXC. Speeds up Gitea Actions runs by:
-#
-#   1. Disabling forced docker pulls (image refresh moves to a cron).
-#   2. Mounting persistent host volumes for Go module/build caches and
-#      the act-actions clone cache.
-#   3. Pre-pulling the runner-images container image.
-#   4. Pre-cloning a configurable list of GitHub actions into the
-#      act cache so jobs don't fetch them on every run.
-#   5. Installing golangci-lint (latest v2.x) at /usr/local/bin.
-#   6. Setting up a nightly cron to refresh image + action clones +
-#      golangci-lint.
-#
-# The script is generic — no per-project state. Point it at any LXC
-# running act_runner as a systemd service and it will provision the
-# host. Re-runs are safe; they reconcile state.
-#
-# Usage:  sudo ./provision-gitea-runner.sh
-#
-# Configurable via environment variables (defaults shown):
-#
-#   CACHE_BASE=/var/cache/gitea-runner
-#   ACT_RUNNER_CONFIG=/etc/act_runner/config.yaml
-#   RUNNER_IMAGE=docker.gitea.com/runner-images:ubuntu-latest
-#   ACTIONS_TO_PRECLONE=(actions/checkout@v4 actions/setup-go@v5
-#                        actions/upload-artifact@v4
-#                        golangci/golangci-lint-action@v7)
-#
-# To add more pre-cloned actions later, edit /etc/cron.d/gitea-runner-refresh
-# (the ACTIONS list is materialised into the cron script).
-
-set -euo pipefail
-
-# ---------- defaults ---------------------------------------------------
-
-: "${CACHE_BASE:=/var/cache/gitea-runner}"
-: "${ACT_RUNNER_CONFIG:=/etc/act_runner/config.yaml}"
-: "${RUNNER_IMAGE:=docker.gitea.com/runner-images:ubuntu-latest}"
-
-DEFAULT_ACTIONS=(
-  "actions/checkout@v4"
-  "actions/setup-go@v5"
-  "actions/upload-artifact@v4"
-  "golangci/golangci-lint-action@v7"
-)
-# Allow caller to override by exporting ACTIONS_TO_PRECLONE as a
-# space-separated string (env vars can't carry arrays cleanly).
-if [[ -n "${ACTIONS_TO_PRECLONE:-}" ]]; then
-  read -r -a ACTIONS <<<"${ACTIONS_TO_PRECLONE}"
-else
-  ACTIONS=("${DEFAULT_ACTIONS[@]}")
-fi
-
-# ---------- helpers ----------------------------------------------------
-
-log()  { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
-warn() { printf '\033[1;33m==>\033[0m %s\n' "$*" >&2; }
-die()  { printf '\033[1;31m==>\033[0m %s\n' "$*" >&2; exit 1; }
-
-require_cmd() {
-  command -v "$1" >/dev/null 2>&1 || die "missing: $1 (install it first)"
-}
-
-# sha256_url <url> — act_runner names its action-clone dirs after
-# sha256(URL). Verified against a real run log:
-#   url=https://github.com/actions/checkout
-#   sha256=c3fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab
-sha256_url() {
-  printf '%s' "$1" | sha256sum | awk '{print $1}'
-}
-
-# ---------- pre-flight -------------------------------------------------
-
-[[ $EUID -eq 0 ]] || die "run as root (the act_runner service writes /var/lib/act_runner as root)"
-
-require_cmd systemctl
-require_cmd docker
-require_cmd git
-require_cmd curl
-require_cmd python3
-
-# PyYAML for the config edit. Install if missing — Ubuntu 24.04 ships
-# python3-yaml in the default repos.
-if ! python3 -c 'import yaml' 2>/dev/null; then
-  log "installing python3-yaml (needed for safe YAML edits)"
-  apt-get update -qq
-  apt-get install -y -qq python3-yaml
-fi
-
-[[ -f "$ACT_RUNNER_CONFIG" ]] || die "$ACT_RUNNER_CONFIG not found — is act_runner installed?"
-
-systemctl list-unit-files act_runner.service >/dev/null 2>&1 || \
-  die "act_runner.service not found — register the runner first"
-
-log "pre-flight OK"
-log "  cache base       : $CACHE_BASE"
-log "  config file      : $ACT_RUNNER_CONFIG"
-log "  runner image     : $RUNNER_IMAGE"
-log "  actions to clone : ${ACTIONS[*]}"
-
-# ---------- 1. cache directories ---------------------------------------
-
-log "creating cache directories under $CACHE_BASE"
-for sub in go-mod go-build act-actions; do
-  install -d -m 0755 -o root -g root "$CACHE_BASE/$sub"
-done
-
-# ---------- 2. edit /etc/act_runner/config.yaml ------------------------
-#
-# Three keys are reconciled to known values:
-#
-#   container.force_pull   : false  (we keep the image fresh via cron)
-#   container.options      : "-v <cache mounts...>"  (auto-mount caches
-#                             into every job container)
-#   container.valid_volumes: [<our cache paths>]  (whitelist so the
-#                             container.options mounts are accepted)
-#
-# Other keys are preserved verbatim. The edit is idempotent: re-running
-# yields the same file content.
-
-log "patching $ACT_RUNNER_CONFIG"
-
-# Backup once (only if no .pre-provision backup exists yet).
-if [[ ! -f "${ACT_RUNNER_CONFIG}.pre-provision" ]]; then
-  cp -p "$ACT_RUNNER_CONFIG" "${ACT_RUNNER_CONFIG}.pre-provision"
-  log "  saved pristine copy to ${ACT_RUNNER_CONFIG}.pre-provision"
-fi
-
-CONTAINER_OPTIONS_VALUE="-v ${CACHE_BASE}/go-mod:/root/go/pkg/mod:rw -v ${CACHE_BASE}/go-build:/root/.cache/go-build:rw -v ${CACHE_BASE}/act-actions:/root/.cache/act:rw"
-
-CACHE_BASE="$CACHE_BASE" CONTAINER_OPTIONS_VALUE="$CONTAINER_OPTIONS_VALUE" \
-ACT_RUNNER_CONFIG="$ACT_RUNNER_CONFIG" \
-python3 - <<'PY'
-import os, sys, yaml
-cfg_path = os.environ['ACT_RUNNER_CONFIG']
-cache_base = os.environ['CACHE_BASE']
-container_options = os.environ['CONTAINER_OPTIONS_VALUE']
-
-with open(cfg_path) as f:
-    cfg = yaml.safe_load(f) or {}
-
-cfg.setdefault('container', {})
-cfg['container']['force_pull'] = False
-cfg['container']['options'] = container_options
-
-# Whitelist every cache subdir explicitly so jobs that try to bind-mount
-# them via workflow-side `volumes:` (rare but possible) are accepted.
-desired_vols = [
-    f"{cache_base}/go-mod",
-    f"{cache_base}/go-build",
-    f"{cache_base}/act-actions",
-]
-existing = cfg['container'].get('valid_volumes') or []
-merged = list(dict.fromkeys(existing + desired_vols))  # de-dup, preserve order
-cfg['container']['valid_volumes'] = merged
-
-# Write back with stable formatting. yaml.dump preserves enough
-# structure for act_runner to parse; comments in the original config
-# do get stripped — that's why we preserve the .pre-provision backup.
-with open(cfg_path + '.tmp', 'w') as f:
-    yaml.safe_dump(cfg, f, default_flow_style=False, sort_keys=False)
-os.replace(cfg_path + '.tmp', cfg_path)
-print(f"  container.force_pull   : false")
-print(f"  container.options      : {container_options}")
-print(f"  container.valid_volumes: {merged}")
-PY
-
-# ---------- 3. pre-pull the runner image -------------------------------
-
-log "pulling $RUNNER_IMAGE (one-time; cron refreshes it nightly)"
-docker pull "$RUNNER_IMAGE"
-
-# ---------- 4. pre-clone the actions list ------------------------------
-#
-# act_runner expects clones at $cache/<sha256(url)> with the ref already
-# checked out. We clone the default branch then fetch + check out the
-# requested ref. Re-running fetches updates rather than re-cloning.
-
-log "pre-cloning actions into $CACHE_BASE/act-actions"
-for spec in "${ACTIONS[@]}"; do
-  if [[ "$spec" != *@* ]]; then
-    warn "  skip '$spec' — must be owner/repo@ref"
-    continue
-  fi
-  repo="${spec%@*}"
-  ref="${spec##*@}"
-  url="https://github.com/${repo}"
-  dir="${CACHE_BASE}/act-actions/$(sha256_url "$url")"
-
-  if [[ -d "$dir/.git" ]]; then
-    log "  refresh $repo @ $ref"
-    git -C "$dir" fetch --quiet --tags --prune origin
-  else
-    log "  clone   $repo @ $ref → $dir"
-    git clone --quiet "$url" "$dir"
-  fi
-  # Detach onto the requested ref. Works for branches, tags, and SHAs.
-  if ! git -C "$dir" -c advice.detachedHead=false checkout --quiet "$ref" 2>/dev/null; then
-    # If `ref` is a remote branch we haven't tracked yet, try origin/<ref>.
-    git -C "$dir" -c advice.detachedHead=false checkout --quiet "origin/$ref"
-  fi
-done
-
-# ---------- 5. golangci-lint -------------------------------------------
-#
-# Install the latest v2.x at /usr/local/bin/golangci-lint. Workflows
-# that pin a specific version via the action's `version:` arg will
-# still re-download — but jobs that don't pin (or pin to "latest"/"v2")
-# get the host-installed binary for free.
-
-log "installing/updating golangci-lint (latest v2.x) → /usr/local/bin"
-GOLANGCI_INSTALL_URL="https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh"
-# `-b` = install dir, `-d` = quiet "downloading" lines, no version arg
-# means "latest" — which install.sh resolves to the latest v2 release
-# from GitHub releases.
-curl -fsSL "$GOLANGCI_INSTALL_URL" | sh -s -- -b /usr/local/bin >/dev/null
-/usr/local/bin/golangci-lint --version || warn "golangci-lint install verification failed"
-
-# ---------- 6. nightly refresh cron ------------------------------------
-#
-# Re-pulls the runner image, refreshes the action clones, and updates
-# golangci-lint. Runs at 03:17 to dodge top-of-hour CI bursts.
-
-CRON_PATH=/etc/cron.d/gitea-runner-refresh
-REFRESH_SCRIPT=/usr/local/sbin/gitea-runner-refresh
-
-log "writing $REFRESH_SCRIPT and $CRON_PATH"
-
-# Materialise the actions list into the script so the cron is
-# self-contained and surviving an edit to this file.
-ACTIONS_LITERAL=""
-for s in "${ACTIONS[@]}"; do
-  ACTIONS_LITERAL="${ACTIONS_LITERAL}  \"$s\"\n"
-done
-
-cat >"$REFRESH_SCRIPT" <<EOF
-#!/usr/bin/env bash
-# Auto-generated by provision-gitea-runner.sh. Re-running the
-# provisioning script regenerates this file.
-set -euo pipefail
-CACHE_BASE="$CACHE_BASE"
-RUNNER_IMAGE="$RUNNER_IMAGE"
-ACTIONS=(
-$(printf '  "%s"\n' "${ACTIONS[@]}")
-)
-
-sha256_url() { printf '%s' "\$1" | sha256sum | awk '{print \$1}'; }
-
-# 1. Refresh the runner-images base.
-docker pull -q "\$RUNNER_IMAGE" >/dev/null
-
-# 2. Refresh action clones.
-for spec in "\${ACTIONS[@]}"; do
-  [[ "\$spec" == *@* ]] || continue
-  repo="\${spec%@*}"; ref="\${spec##*@}"
-  url="https://github.com/\$repo"
-  dir="\$CACHE_BASE/act-actions/\$(sha256_url "\$url")"
-  if [[ -d "\$dir/.git" ]]; then
-    git -C "\$dir" fetch --quiet --tags --prune origin || true
-    git -C "\$dir" -c advice.detachedHead=false checkout --quiet "\$ref" 2>/dev/null \\
-      || git -C "\$dir" -c advice.detachedHead=false checkout --quiet "origin/\$ref" || true
-  fi
-done
-
-# 3. Refresh golangci-lint (latest v2.x). Tolerate transient
-#    GitHub-rate-limit failures — next night will retry.
-curl -fsSL https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh \\
-  | sh -s -- -b /usr/local/bin >/dev/null 2>&1 || true
-EOF
-chmod 0755 "$REFRESH_SCRIPT"
-
-cat >"$CRON_PATH" <<EOF
-# Auto-generated by provision-gitea-runner.sh. Refreshes the runner
-# image, action clones, and golangci-lint every night at 03:17.
-SHELL=/bin/bash
-PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
-17 3 * * * root $REFRESH_SCRIPT >> /var/log/gitea-runner-refresh.log 2>&1
-EOF
-chmod 0644 "$CRON_PATH"
-
-# ---------- 7. restart act_runner --------------------------------------
-
-log "restarting act_runner.service to pick up the new config"
-systemctl restart act_runner.service
-sleep 2
-systemctl is-active --quiet act_runner.service \
-  || die "act_runner did not come back up — check 'journalctl -u act_runner -n 50'"
-
-# ---------- 8. container-create benchmark ------------------------------
-#
-# Reports cold + warm `docker run --rm <image> true` time. Sanity check
-# that overlay setup is fast on this host. Numbers > ~5s indicate a
-# slow filesystem or DNS issue worth investigating separately.
-
-log "benchmark: docker run --rm $RUNNER_IMAGE true"
-{
-  printf '  cold (post-pull) : '
-  /usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
-  printf '  warm (immediate) : '
-  /usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
-} || warn "benchmark failed — non-fatal"
-
-# ---------- done -------------------------------------------------------
-
-cat <<EOF
-
-\033[1;32m==> Provisioning complete\033[0m
-
-What changed on this host:
-  * /etc/act_runner/config.yaml — force_pull off, container.options +
-    valid_volumes set for the cache mounts.  Pristine copy preserved
-    at ${ACT_RUNNER_CONFIG}.pre-provision.
-  * $CACHE_BASE/{go-mod,go-build,act-actions} — persistent caches.
-  * /usr/local/bin/golangci-lint — latest v2.x.
-  * $REFRESH_SCRIPT and $CRON_PATH — nightly refresh @ 03:17.
-  * Runner image pre-pulled.
-
-\033[1;33mNote on Go cache + setup-go:\033[0m if your workflow uses
-\`actions/setup-go\` with \`cache: true\`, the action will still tar/untar
-the cache via the Gitea cache backend on every job — partially
-defeating the persistent volume.  For full speed-up, drop \`cache: true\`
-from the workflow once the persistent volume is warm.  Per-project
-decision; this script doesn't touch workflows.
-
-EOF
Author	SHA1	Message	Date
steve	aa80a3418e	test: write-then-rename script-bin helpers (avoid ETXTBSY under -race) CI / Test (linux/amd64) (pull_request) Successful in 48s Details CI / Lint (pull_request) Successful in 23s Details CI / Build (linux/amd64) (pull_request) Successful in 17s Details CI / Build (linux/arm64) (pull_request) Successful in 18s Details CI / Build (windows/amd64) (pull_request) Failing after 15m8s Details CI run #48 failed with: --- FAIL: TestRunInitShipsStartedAndFinished RunInit: ... fork/exec /tmp/.../restic: text file busy setupScript and setupScriptBin used os.WriteFile to write a shell script directly at the final path, then exec'd it. Under -race + many t.Parallel tests, a fork-from-another-goroutine could inherit the still-open writable fd from one of those WriteFile calls; the kernel returns ETXTBSY when the freshly-execed binary still has a writable fd anywhere on the system. Fix: write to "<path>.tmp", then os.Rename into place. The rename is a pure dirent op; by the time the final path exists, no process has a writable fd on its inode and exec is safe. -race + -count=5 on both runner packages now passes consistently.	2026-05-04 09:43:27 +01:00
steve	ac9d7b92ed	api+agent: document protocol-version stability and forget back-compat decisions CI / Test (linux/amd64) (pull_request) Failing after 1m16s Details CI / Lint (pull_request) Successful in 23s Details CI / Build (windows/amd64) (pull_request) Successful in 23s Details CI / Build (linux/amd64) (pull_request) Successful in 25s Details CI / Build (linux/arm64) (pull_request) Successful in 22s Details version.go: add a comment block explaining why Phase 5's wire changes (CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade path, smoke env restage enforces it. Notes where a version bump to 2 would be required if a multi-version path is ever introduced. cmd/agent/main.go: document why the JobForget handler hard-errors on empty ForgetGroups rather than falling back to a single-policy form. The maintenance ticker is the only writer and always populates the field; the fallback was specced but skipped given lockstep deploy.	2026-05-04 00:33:51 +01:00
steve	556d65d77f	server: serialize DrainPending per host (avoid drain double-dispatch) Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on the Server struct. DrainPending acquires it with TryLock: if a drain is already in-flight for this host, the call returns immediately — the running drain will see every pending row. This prevents the on-hello goroutine and the 30s tick from both listing the same host's rows and dispatching them twice. Update three existing tests that called srv.DrainPending explicitly after the on-hello goroutine had already been spawned: replace the now-redundant direct call with a waitForPendingCount poll so they don't race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost which fires 10 concurrent DrainPending goroutines against a 5-row queue and asserts exactly 5 job rows result.	2026-05-04 00:33:13 +01:00
steve	7ee8d2311b	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 00:29:52 +01:00
steve	e4dd7f96d6	tasks: tick P2R-03 through P2R-08 done	2026-05-04 00:20:19 +01:00
steve	6afde3ce23	diag: phase 5 Playwright sweep screenshots	2026-05-04 00:19:50 +01:00
steve	775e988340	server/ws: persist repo.stats into host_repo_stats	2026-05-04 00:10:41 +01:00
steve	e2cf9a68f6	server: drainer abandons only on ErrNotFound, not transient errors GetSourceGroup errors in drainOne now gate on errors.Is(err, store.ErrNotFound) before calling abandonPending, mirroring the existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context cancellation) now log a warning and return without deleting the row. Add regression test TestDrainPendingDropsRowsForGoneSourceGroup confirming the ErrNotFound path still abandons correctly. Also add a comment above the backoff-doubling loop explaining the progression.	2026-05-04 00:07:33 +01:00
steve	1e212db24e	server: drainer uses dispatch-core to avoid duplicate pending_run enqueue Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on failure) from dispatchBackupForGroup. drainOne now calls the core directly so a failed Send only bumps the existing pending_runs row via BumpPendingRunAttempt — not create a second row — stopping the geometric duplication on repeated drain failures. dispatchBackupForGroup (schedule.fire path) wraps the core and keeps its enqueue-on-failure behaviour unchanged. TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row remains after a send failure (was tolerating >=1 duplicate rows).	2026-05-04 00:01:42 +01:00
steve	a9185424d3	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-03 23:57:08 +01:00
steve	9c5037ec54	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-03 23:53:57 +01:00
steve	ed839aacb4	server: fix stale RetentionPolicy comment + check Scan errors in maintenance test	2026-05-03 23:50:05 +01:00
steve	7d2c2ae1c2	server: maintenance ticker drives forget/prune/check on cadence Wires a 60s server-side ticker to the pure-logic maintenance.Decide introduced in the previous commit. Decisions flow through a new DispatchMaintenance method on Server, which: - skips offline hosts (no pending_runs queueing — maintenance is not a backup, missed fires shouldn't pile up) - silently skips prune when admin creds aren't bound - pushes admin creds before prune, then dispatches with RequiresAdminCreds=true (same as operator-driven prune) - persists job rows with actor_kind="system" Reshapes the forget wire payload from a single RetentionPolicy to a ForgetGroups list (one tag + per-group keep- per source group). The agent walks the groups and runs `restic forget --tag <name> --keep-*` once per group. Dead-code removed: CommandRunPayload.RetentionPolicy, the old forget JSON-decode in cmd/agent, and the single-policy form of restic.RunForget.	2026-05-03 23:40:35 +01:00
steve	a131419b1a	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-03 23:36:48 +01:00
steve	9d727a7b3a	ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in host_repo.html to match the existing pattern on host_sources.html and host_schedules.html. Fix all-blank admin-credentials save to redirect without ?saved= query string so no false-positive banner is shown; strengthen the corresponding test to assert Location has no ?saved=. Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.	2026-05-03 23:29:01 +01:00
steve	1cbc856514	ui: Slice E — admin creds form + run-now buttons + repo health panel - hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online, and StatsView (pre-dereferenced projection of host_repo_stats). - loadHostRepoPage loads the admin slot (tolerating ErrNotFound), hub.Connected, and stats (tolerating ErrNotFound). - renderRepoPage gains an adminErr parameter; all callers updated. - handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added (form-POST handlers mirroring the repo-creds pattern, with audit). - Routes /hosts/{id}/admin-credentials POST and /delete POST registered. - Template: Admin credentials form after Connection, Run-now HTMX buttons after Maintenance, Repo health stats panel in right rail. - Tests: 9 new tests covering rendering, disabled states, save/delete round-trips, audit rows, and idempotent delete.	2026-05-03 23:18:16 +01:00
steve	fb24e42c6e	server: populate audit UserID on credential mutations + slog prune push errors Switch handleSetHostCredentials, handleSetAdminCredentials, and handleDeleteAdminCredentials from authedUser (bool) to requireUser (*store.User) so AuditEntry.UserID and Actor are populated correctly. Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in handleRunRepoPrune so decrypt/send failures surface in the server log rather than appearing as a generic host_offline 503.	2026-05-03 23:09:09 +01:00
steve	a899cc2d04	server: cover HTMX auth-redirect path in repo-ops tests	2026-05-03 23:00:38 +01:00
steve	ef2a30a82d	server: HTTP run-now for prune / check / unlock Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer routes for HTMX form posts). Prune pushes the admin-cred slot via pushAdminCredsToAgent before dispatch and refuses with admin_creds_required when the slot is not set. Check reads check_subset_pct from host_repo_maintenance (overridable via ?subset=N, clamped 0-100; non-numeric override falls back to DB value silently). Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect response split as the per-source-group run-now endpoint.	2026-05-03 22:57:07 +01:00
steve	e7960151fb	server: admin-credentials REST + Slot:admin push helper Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that mirror the existing repo-credentials endpoints but write to store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped away from the repo slot to prevent cross-binding). PUT immediately pushes a config.update(Slot:"admin") to the agent when it is connected, and the new pushAdminCredsToAgent helper is wired for use by the upcoming prune run-now endpoint (D2) to push on-demand before dispatch.	2026-05-03 22:55:09 +01:00
steve	2d27a23e99	agent: secrets fail-loud on corrupt blob + small polish Save and SaveAdmin now propagate loadBundle errors instead of silently overwriting a corrupt file (data-loss fix). Tests added for both paths. reportStats logs a Debug on RunStats failure; r in runJob gets a comment explaining the prune-runner asymmetry; runner_test comment tightened.	2026-05-03 22:49:12 +01:00
steve	9d9773cad4	agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock RunCheck and RunUnlock were calling sendFinished before reportStats, inverting the required job.started → log.stream → repo.stats → job.finished envelope order. Move reportStats ahead of sendFinished in both functions to match the pattern already correct in RunPrune. Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus, and TestRunUnlockClearsLock with the same position-index ordering assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions would have failed against the pre-fix code.	2026-05-03 22:43:57 +01:00
steve	b3d033fa11	agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats with LastPruneAt before job.finished), RunCheck (ships stats with LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships LockPresent=false on success), and reportStats (fills size fields via RunStats when caller didn't populate them). Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach MsgConfigUpdate about the Slot discriminator for admin vs repo creds; add strconv import for subset-pct parsing.	2026-05-03 22:39:23 +01:00
steve	b7033fcfcd	agent/secrets: separate admin slot with backwards-compatible decode Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs are detected at load time by the presence of "repo_url" at the top level and transparently promoted into the new shape on the next Save/SaveAdmin. Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.	2026-05-03 22:34:33 +01:00
steve	b1b0d9d1e9	api: stats partial-update payload + ConfigUpdate.Slot + CommandRun.RequiresAdminCreds Reshape RepoStatsPayload into pointer-field partial-update form matching store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload for admin vs repo credential routing; add RequiresAdminCreds flag to CommandRunPayload for prune/unlock jobs that need delete authority.	2026-05-03 22:33:12 +01:00
steve	f711593549	restic: tighten RunCheck lock sniff + RunStats zero-snapshot test Narrow the LockPresent predicate from bare "locked" (too broad) to "stale lock" and "already locked" — the two phrases restic actually emits. Replace TestRunCheckParsesLock with table-driven TestRunCheckLockSniff covering both trigger phrases and a benign "locked-file" line that must not set LockPresent. Add TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot JSON without error.	2026-05-03 22:29:09 +01:00
steve	dfc2cd314d	restic: RunUnlock + RunStats (raw-data mode) Add RunUnlock (delegates straight to runWithPump) and RunStats which runs `restic stats --json --mode raw-data`, captures the single JSON line from stdout into RepoStats, and returns an error if no JSON arrives. Tests cover arg plumbing for unlock, JSON parsing, and the no-JSON error path.	2026-05-03 22:22:19 +01:00
steve	5e2b88c6dd	restic: RunCheck with subset% + lock-state sniffing Add CheckResult (LockPresent, ErrorsFound) and RunCheck. subsetPct>0 passes --read-data-subset N% to limit data reads. Stderr is sniffed for "Found stale lock"/"locked" to set LockPresent; a non-zero exit from restic is absorbed as ErrorsFound=true rather than an error so the caller can always persist last_check_status. Tests cover lock detection, exit-1 absorption, and subset-arg plumbing.	2026-05-03 22:21:48 +01:00
steve	768972d870	restic: RunPrune + runWithPump helper, refactor Forget/Init onto it Add RunPrune for admin-credential prune invocations. Extract runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget and RunInit to delegate to it (RunInit preserves the "config file already exists" soft-success sniff by wrapping the handler before the call). Add runner_test.go with TestRunPruneInvokesPrune.	2026-05-03 22:20:48 +01:00
steve	82a73fad85	store: tighten CHECK constraint on host_repo_stats.last_check_status	2026-05-03 22:15:57 +01:00
steve	26bb881c12	store: wrap UpsertHostRepoStats in a transaction (concurrency safety)	2026-05-03 22:15:35 +01:00
steve	3873bd9d34	store: assert CHECK constraint on host_credentials.kind	2026-05-03 22:10:29 +01:00
steve	1bb31b9c49	store: HostRepoStats projection (size, lock, last-check, last-prune)	2026-05-03 22:07:24 +01:00
steve	4985050a0a	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-03 22:06:05 +01:00
steve	1c7b471e75	store: migration 0009 — admin-creds kind + host_repo_stats	2026-05-03 22:05:53 +01:00
steve	88216d29d0	plan: P2 redesign Phase 5 (P2R-03..P2R-08)	2026-05-03 21:59:40 +01:00