test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)

CI run #48 failed with: --- FAIL: TestRunInitShipsStartedAndFinished RunInit: ... fork/exec /tmp/.../restic: text file busy setupScript and setupScriptBin used os.WriteFile to write a shell script directly at the final path, then exec'd it. Under -race + many t.Parallel tests, a fork-from-another-goroutine could inherit the still-open writable fd from one of those WriteFile calls; the kernel returns ETXTBSY when the freshly-execed binary still has a writable fd anywhere on the system. Fix: write to "<path>.tmp", then os.Rename into place. The rename is a pure dirent op; by the time the final path exists, no process has a writable fd on its inode and exec is safe. -race + -count=5 on both runner packages now passes consistently.
api+agent: document protocol-version stability and forget back-compat decisions
2026-05-04 10:15:18 +01:00 · 2026-05-04 10:15:18 +01:00 · 2026-05-04 10:15:18 +01:00 · 2026-05-04 10:15:18 +01:00 · 2026-05-04 10:15:18 +01:00 · 2026-05-04 10:15:18 +01:00
2 changed files with 331 additions and 5 deletions
@@ -3,11 +3,10 @@
 # Notes for anyone editing this file:
 #
 # Self-hosted runner expectations
-#   The Gitea runners are provisioned out-of-band (the infra team owns
+#   The Gitea runners are provisioned via scripts/provision-gitea-runner.sh.
-#   the script). Each runner host bind-mounts persistent volumes for
+#   That script bind-mounts persistent host volumes for /root/go/pkg/mod
-#   /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE), and
+#   (GOMODCACHE), /root/.cache/go-build (GOCACHE), and /root/.cache/act
-#   /root/.cache/act (action clones) into every job container. As a
+#   (action clones) into every job container. As a result:
 #   result:
 #     * `cache: true` on actions/setup-go is intentionally OMITTED — the
 #       action would otherwise tar/untar GOMODCACHE+GOCACHE through the
 #       Gitea cache backend on every job, undoing the host-volume cache
@@ -0,0 +1,327 @@
 #!/usr/bin/env bash
 #
 # provision-gitea-runner.sh — one-shot, idempotent host setup for an
 # act_runner LXC. Speeds up Gitea Actions runs by:
 #
 #   1. Disabling forced docker pulls (image refresh moves to a cron).
 #   2. Mounting persistent host volumes for Go module/build caches and
 #      the act-actions clone cache.
 #   3. Pre-pulling the runner-images container image.
 #   4. Pre-cloning a configurable list of GitHub actions into the
 #      act cache so jobs don't fetch them on every run.
 #   5. Installing golangci-lint (latest v2.x) at /usr/local/bin.
 #   6. Setting up a nightly cron to refresh image + action clones +
 #      golangci-lint.
 #
 # The script is generic — no per-project state. Point it at any LXC
 # running act_runner as a systemd service and it will provision the
 # host. Re-runs are safe; they reconcile state.
 #
 # Usage:  sudo ./provision-gitea-runner.sh
 #
 # Configurable via environment variables (defaults shown):
 #
 #   CACHE_BASE=/var/cache/gitea-runner
 #   ACT_RUNNER_CONFIG=/etc/act_runner/config.yaml
 #   RUNNER_IMAGE=docker.gitea.com/runner-images:ubuntu-latest
 #   ACTIONS_TO_PRECLONE=(actions/checkout@v4 actions/setup-go@v5
 #                        actions/upload-artifact@v4
 #                        golangci/golangci-lint-action@v7)
 #
 # To add more pre-cloned actions later, edit /etc/cron.d/gitea-runner-refresh
 # (the ACTIONS list is materialised into the cron script).
 set -euo pipefail
 # ---------- defaults ---------------------------------------------------
 : "${CACHE_BASE:=/var/cache/gitea-runner}"
 : "${ACT_RUNNER_CONFIG:=/etc/act_runner/config.yaml}"
 : "${RUNNER_IMAGE:=docker.gitea.com/runner-images:ubuntu-latest}"
 DEFAULT_ACTIONS=(
  "actions/checkout@v4"
  "actions/setup-go@v5"
  "actions/upload-artifact@v4"
  "golangci/golangci-lint-action@v7"
 )
 # Allow caller to override by exporting ACTIONS_TO_PRECLONE as a
 # space-separated string (env vars can't carry arrays cleanly).
 if [[ -n "${ACTIONS_TO_PRECLONE:-}" ]]; then
  read -r -a ACTIONS <<<"${ACTIONS_TO_PRECLONE}"
 else
  ACTIONS=("${DEFAULT_ACTIONS[@]}")
 fi
 # ---------- helpers ----------------------------------------------------
 log()  { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
 warn() { printf '\033[1;33m==>\033[0m %s\n' "$*" >&2; }
 die()  { printf '\033[1;31m==>\033[0m %s\n' "$*" >&2; exit 1; }
 require_cmd() {
  command -v "$1" >/dev/null 2>&1 || die "missing: $1 (install it first)"
 }
 # sha256_url <url> — act_runner names its action-clone dirs after
 # sha256(URL). Verified against a real run log:
 #   url=https://github.com/actions/checkout
 #   sha256=c3fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab
 sha256_url() {
  printf '%s' "$1" | sha256sum | awk '{print $1}'
 }
 # ---------- pre-flight -------------------------------------------------
 [[ $EUID -eq 0 ]] || die "run as root (the act_runner service writes /var/lib/act_runner as root)"
 require_cmd systemctl
 require_cmd docker
 require_cmd git
 require_cmd curl
 require_cmd python3
 # PyYAML for the config edit. Install if missing — Ubuntu 24.04 ships
 # python3-yaml in the default repos.
 if ! python3 -c 'import yaml' 2>/dev/null; then
  log "installing python3-yaml (needed for safe YAML edits)"
  apt-get update -qq
  apt-get install -y -qq python3-yaml
 fi
 [[ -f "$ACT_RUNNER_CONFIG" ]] || die "$ACT_RUNNER_CONFIG not found — is act_runner installed?"
 systemctl list-unit-files act_runner.service >/dev/null 2>&1 || \
  die "act_runner.service not found — register the runner first"
 log "pre-flight OK"
 log "  cache base       : $CACHE_BASE"
 log "  config file      : $ACT_RUNNER_CONFIG"
 log "  runner image     : $RUNNER_IMAGE"
 log "  actions to clone : ${ACTIONS[*]}"
 # ---------- 1. cache directories ---------------------------------------
 log "creating cache directories under $CACHE_BASE"
 for sub in go-mod go-build act-actions; do
  install -d -m 0755 -o root -g root "$CACHE_BASE/$sub"
 done
 # ---------- 2. edit /etc/act_runner/config.yaml ------------------------
 #
 # Three keys are reconciled to known values:
 #
 #   container.force_pull   : false  (we keep the image fresh via cron)
 #   container.options      : "-v <cache mounts...>"  (auto-mount caches
 #                             into every job container)
 #   container.valid_volumes: [<our cache paths>]  (whitelist so the
 #                             container.options mounts are accepted)
 #
 # Other keys are preserved verbatim. The edit is idempotent: re-running
 # yields the same file content.
 log "patching $ACT_RUNNER_CONFIG"
 # Backup once (only if no .pre-provision backup exists yet).
 if [[ ! -f "${ACT_RUNNER_CONFIG}.pre-provision" ]]; then
  cp -p "$ACT_RUNNER_CONFIG" "${ACT_RUNNER_CONFIG}.pre-provision"
  log "  saved pristine copy to ${ACT_RUNNER_CONFIG}.pre-provision"
 fi
 CONTAINER_OPTIONS_VALUE="-v ${CACHE_BASE}/go-mod:/root/go/pkg/mod:rw -v ${CACHE_BASE}/go-build:/root/.cache/go-build:rw -v ${CACHE_BASE}/act-actions:/root/.cache/act:rw"
 CACHE_BASE="$CACHE_BASE" CONTAINER_OPTIONS_VALUE="$CONTAINER_OPTIONS_VALUE" \
 ACT_RUNNER_CONFIG="$ACT_RUNNER_CONFIG" \
 python3 - <<'PY'
 import os, sys, yaml
 cfg_path = os.environ['ACT_RUNNER_CONFIG']
 cache_base = os.environ['CACHE_BASE']
 container_options = os.environ['CONTAINER_OPTIONS_VALUE']
 with open(cfg_path) as f:
    cfg = yaml.safe_load(f) or {}
 cfg.setdefault('container', {})
 cfg['container']['force_pull'] = False
 cfg['container']['options'] = container_options
 # Whitelist every cache subdir explicitly so jobs that try to bind-mount
 # them via workflow-side `volumes:` (rare but possible) are accepted.
 desired_vols = [
    f"{cache_base}/go-mod",
    f"{cache_base}/go-build",
    f"{cache_base}/act-actions",
 ]
 existing = cfg['container'].get('valid_volumes') or []
 merged = list(dict.fromkeys(existing + desired_vols))  # de-dup, preserve order
 cfg['container']['valid_volumes'] = merged
 # Write back with stable formatting. yaml.dump preserves enough
 # structure for act_runner to parse; comments in the original config
 # do get stripped — that's why we preserve the .pre-provision backup.
 with open(cfg_path + '.tmp', 'w') as f:
    yaml.safe_dump(cfg, f, default_flow_style=False, sort_keys=False)
 os.replace(cfg_path + '.tmp', cfg_path)
 print(f"  container.force_pull   : false")
 print(f"  container.options      : {container_options}")
 print(f"  container.valid_volumes: {merged}")
 PY
 # ---------- 3. pre-pull the runner image -------------------------------
 log "pulling $RUNNER_IMAGE (one-time; cron refreshes it nightly)"
 docker pull "$RUNNER_IMAGE"
 # ---------- 4. pre-clone the actions list ------------------------------
 #
 # act_runner expects clones at $cache/<sha256(url)> with the ref already
 # checked out. We clone the default branch then fetch + check out the
 # requested ref. Re-running fetches updates rather than re-cloning.
 log "pre-cloning actions into $CACHE_BASE/act-actions"
 for spec in "${ACTIONS[@]}"; do
  if [[ "$spec" != *@* ]]; then
    warn "  skip '$spec' — must be owner/repo@ref"
    continue
  fi
  repo="${spec%@*}"
  ref="${spec##*@}"
  url="https://github.com/${repo}"
  dir="${CACHE_BASE}/act-actions/$(sha256_url "$url")"
  if [[ -d "$dir/.git" ]]; then
    log "  refresh $repo @ $ref"
    git -C "$dir" fetch --quiet --tags --prune origin
  else
    log "  clone   $repo @ $ref → $dir"
    git clone --quiet "$url" "$dir"
  fi
  # Detach onto the requested ref. Works for branches, tags, and SHAs.
  if ! git -C "$dir" -c advice.detachedHead=false checkout --quiet "$ref" 2>/dev/null; then
    # If `ref` is a remote branch we haven't tracked yet, try origin/<ref>.
    git -C "$dir" -c advice.detachedHead=false checkout --quiet "origin/$ref"
  fi
 done
 # ---------- 5. golangci-lint -------------------------------------------
 #
 # Install the latest v2.x at /usr/local/bin/golangci-lint. Workflows
 # that pin a specific version via the action's `version:` arg will
 # still re-download — but jobs that don't pin (or pin to "latest"/"v2")
 # get the host-installed binary for free.
 log "installing/updating golangci-lint (latest v2.x) → /usr/local/bin"
 GOLANGCI_INSTALL_URL="https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh"
 # `-b` = install dir, `-d` = quiet "downloading" lines, no version arg
 # means "latest" — which install.sh resolves to the latest v2 release
 # from GitHub releases.
 curl -fsSL "$GOLANGCI_INSTALL_URL" | sh -s -- -b /usr/local/bin >/dev/null
 /usr/local/bin/golangci-lint --version || warn "golangci-lint install verification failed"
 # ---------- 6. nightly refresh cron ------------------------------------
 #
 # Re-pulls the runner image, refreshes the action clones, and updates
 # golangci-lint. Runs at 03:17 to dodge top-of-hour CI bursts.
 CRON_PATH=/etc/cron.d/gitea-runner-refresh
 REFRESH_SCRIPT=/usr/local/sbin/gitea-runner-refresh
 log "writing $REFRESH_SCRIPT and $CRON_PATH"
 # Materialise the actions list into the script so the cron is
 # self-contained and surviving an edit to this file.
 ACTIONS_LITERAL=""
 for s in "${ACTIONS[@]}"; do
  ACTIONS_LITERAL="${ACTIONS_LITERAL}  \"$s\"\n"
 done
 cat >"$REFRESH_SCRIPT" <<EOF
 #!/usr/bin/env bash
 # Auto-generated by provision-gitea-runner.sh. Re-running the
 # provisioning script regenerates this file.
 set -euo pipefail
 CACHE_BASE="$CACHE_BASE"
 RUNNER_IMAGE="$RUNNER_IMAGE"
 ACTIONS=(
 $(printf '  "%s"\n' "${ACTIONS[@]}")
 )
 sha256_url() { printf '%s' "\$1" | sha256sum | awk '{print \$1}'; }
 # 1. Refresh the runner-images base.
 docker pull -q "\$RUNNER_IMAGE" >/dev/null
 # 2. Refresh action clones.
 for spec in "\${ACTIONS[@]}"; do
  [[ "\$spec" == *@* ]] || continue
  repo="\${spec%@*}"; ref="\${spec##*@}"
  url="https://github.com/\$repo"
  dir="\$CACHE_BASE/act-actions/\$(sha256_url "\$url")"
  if [[ -d "\$dir/.git" ]]; then
    git -C "\$dir" fetch --quiet --tags --prune origin || true
    git -C "\$dir" -c advice.detachedHead=false checkout --quiet "\$ref" 2>/dev/null \\
      || git -C "\$dir" -c advice.detachedHead=false checkout --quiet "origin/\$ref" || true
  fi
 done
 # 3. Refresh golangci-lint (latest v2.x). Tolerate transient
 #    GitHub-rate-limit failures — next night will retry.
 curl -fsSL https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh \\
  | sh -s -- -b /usr/local/bin >/dev/null 2>&1 || true
 EOF
 chmod 0755 "$REFRESH_SCRIPT"
 cat >"$CRON_PATH" <<EOF
 # Auto-generated by provision-gitea-runner.sh. Refreshes the runner
 # image, action clones, and golangci-lint every night at 03:17.
 SHELL=/bin/bash
 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 17 3 * * * root $REFRESH_SCRIPT >> /var/log/gitea-runner-refresh.log 2>&1
 EOF
 chmod 0644 "$CRON_PATH"
 # ---------- 7. restart act_runner --------------------------------------
 log "restarting act_runner.service to pick up the new config"
 systemctl restart act_runner.service
 sleep 2
 systemctl is-active --quiet act_runner.service \
  || die "act_runner did not come back up — check 'journalctl -u act_runner -n 50'"
 # ---------- 8. container-create benchmark ------------------------------
 #
 # Reports cold + warm `docker run --rm <image> true` time. Sanity check
 # that overlay setup is fast on this host. Numbers > ~5s indicate a
 # slow filesystem or DNS issue worth investigating separately.
 log "benchmark: docker run --rm $RUNNER_IMAGE true"
 {
  printf '  cold (post-pull) : '
  /usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
  printf '  warm (immediate) : '
  /usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
 } || warn "benchmark failed — non-fatal"
 # ---------- done -------------------------------------------------------
 cat <<EOF
 \033[1;32m==> Provisioning complete\033[0m
 What changed on this host:
  * /etc/act_runner/config.yaml — force_pull off, container.options +
    valid_volumes set for the cache mounts.  Pristine copy preserved
    at ${ACT_RUNNER_CONFIG}.pre-provision.
  * $CACHE_BASE/{go-mod,go-build,act-actions} — persistent caches.
  * /usr/local/bin/golangci-lint — latest v2.x.
  * $REFRESH_SCRIPT and $CRON_PATH — nightly refresh @ 03:17.
  * Runner image pre-pulled.
 \033[1;33mNote on Go cache + setup-go:\033[0m if your workflow uses
 \`actions/setup-go\` with \`cache: true\`, the action will still tar/untar
 the cache via the Gitea cache backend on every job — partially
 defeating the persistent volume.  For full speed-up, drop \`cache: true\`
 from the workflow once the persistent volume is warm.  Per-project
 decision; this script doesn't touch workflows.
 EOF
Author	SHA1	Message	Date
steve	e1b60f02ae	test: write-then-rename script-bin helpers (avoid ETXTBSY under -race) CI / Build (windows/amd64) (pull_request) Successful in 18s Details CI / Build (linux/amd64) (pull_request) Successful in 18s Details CI / Build (linux/arm64) (pull_request) Successful in 18s Details CI / Test (linux/amd64) (pull_request) Failing after 2m40s Details CI / Lint (pull_request) Successful in 3m27s Details CI run #48 failed with: --- FAIL: TestRunInitShipsStartedAndFinished RunInit: ... fork/exec /tmp/.../restic: text file busy setupScript and setupScriptBin used os.WriteFile to write a shell script directly at the final path, then exec'd it. Under -race + many t.Parallel tests, a fork-from-another-goroutine could inherit the still-open writable fd from one of those WriteFile calls; the kernel returns ETXTBSY when the freshly-execed binary still has a writable fd anywhere on the system. Fix: write to "<path>.tmp", then os.Rename into place. The rename is a pure dirent op; by the time the final path exists, no process has a writable fd on its inode and exec is safe. -race + -count=5 on both runner packages now passes consistently.	2026-05-04 10:15:18 +01:00
steve	e4c0256543	api+agent: document protocol-version stability and forget back-compat decisions version.go: add a comment block explaining why Phase 5's wire changes (CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade path, smoke env restage enforces it. Notes where a version bump to 2 would be required if a multi-version path is ever introduced. cmd/agent/main.go: document why the JobForget handler hard-errors on empty ForgetGroups rather than falling back to a single-policy form. The maintenance ticker is the only writer and always populates the field; the fallback was specced but skipped given lockstep deploy.	2026-05-04 10:15:18 +01:00
steve	7236f8dc14	server: serialize DrainPending per host (avoid drain double-dispatch) Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on the Server struct. DrainPending acquires it with TryLock: if a drain is already in-flight for this host, the call returns immediately — the running drain will see every pending row. This prevents the on-hello goroutine and the 30s tick from both listing the same host's rows and dispatching them twice. Update three existing tests that called srv.DrainPending explicitly after the on-hello goroutine had already been spawned: replace the now-redundant direct call with a waitForPendingCount poll so they don't race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost which fires 10 concurrent DrainPending goroutines against a 5-row queue and asserts exactly 5 job rows result.	2026-05-04 10:15:18 +01:00
steve	440ac5cc18	store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire) Widen the SQL query to consider all statuses (queued, running, succeeded, failed, cancelled) rather than terminal-only. An in-flight prune that outlasts the 60s tick interval previously produced ErrNotFound, causing the ticker to anchor at now-24h and fire a second prune concurrently with the first. Update the doc comment and test: remove the "queued job filtered out" case, add assertions that a running job and a queued job are each returned as the latest.	2026-05-04 10:15:18 +01:00
steve	beecc32851	tasks: tick P2R-03 through P2R-08 done	2026-05-04 10:15:18 +01:00
steve	d85c8ab61a	diag: phase 5 Playwright sweep screenshots	2026-05-04 10:15:18 +01:00
steve	6635e82255	server/ws: persist repo.stats into host_repo_stats	2026-05-04 10:15:18 +01:00
steve	91fb7cf69b	server: drainer abandons only on ErrNotFound, not transient errors GetSourceGroup errors in drainOne now gate on errors.Is(err, store.ErrNotFound) before calling abandonPending, mirroring the existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context cancellation) now log a warning and return without deleting the row. Add regression test TestDrainPendingDropsRowsForGoneSourceGroup confirming the ErrNotFound path still abandons correctly. Also add a comment above the backoff-doubling loop explaining the progression.	2026-05-04 10:15:18 +01:00
steve	02dbe59d68	server: drainer uses dispatch-core to avoid duplicate pending_run enqueue Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on failure) from dispatchBackupForGroup. drainOne now calls the core directly so a failed Send only bumps the existing pending_runs row via BumpPendingRunAttempt — not create a second row — stopping the geometric duplication on repeated drain failures. dispatchBackupForGroup (schedule.fire path) wraps the core and keeps its enqueue-on-failure behaviour unchanged. TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row remains after a send failure (was tolerating >=1 duplicate rows).	2026-05-04 10:15:18 +01:00
steve	81c611264d	server: drain pending_runs on tick + on agent reconnect Two trigger paths land here: - A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It walks pending_runs rows whose next_attempt_at <= now, dedupes by host, skips offline hosts, and per online host runs DrainPending. - onAgentHello spawns a background DrainPending(hostID). When a host comes back, every pending row for it is dispatchable now — due-ness becomes irrelevant once the wire is back. Each row's schedule + group are reloaded; ErrNotFound or disabled-schedule or gone-group abandons the row with a pending_run.abandoned audit. attempt >= retry_max also abandons. Otherwise dispatchBackupForGroup is invoked; success deletes the row, failure bumps attempt with exponential backoff capped at 30m.	2026-05-04 10:15:18 +01:00
steve	194e6c9719	server: enqueue pending_runs when scheduled-job dispatch fails When dispatchBackupForGroup's conn.Send errors, queue a pending_runs row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds) instead of silently dropping the fire. The orphaned queued job row is left behind for forensic visibility — the drainer will create a fresh job row on its retry. Also adds Store.ListPendingRunsForHost — the on-reconnect drain walks every row for the host, regardless of due-ness, since the host being back makes 'due' irrelevant.	2026-05-04 10:15:18 +01:00
steve	f29a9e49d3	server: fix stale RetentionPolicy comment + check Scan errors in maintenance test	2026-05-04 10:15:18 +01:00
steve	e283d70c27	server: maintenance ticker drives forget/prune/check on cadence Wires a 60s server-side ticker to the pure-logic maintenance.Decide introduced in the previous commit. Decisions flow through a new DispatchMaintenance method on Server, which: - skips offline hosts (no pending_runs queueing — maintenance is not a backup, missed fires shouldn't pile up) - silently skips prune when admin creds aren't bound - pushes admin creds before prune, then dispatches with RequiresAdminCreds=true (same as operator-driven prune) - persists job rows with actor_kind="system" Reshapes the forget wire payload from a single RetentionPolicy to a ForgetGroups list (one tag + per-group keep- per source group). The agent walks the groups and runs `restic forget --tag <name> --keep-*` once per group. Dead-code removed: CommandRunPayload.RetentionPolicy, the old forget JSON-decode in cmd/agent, and the single-policy form of restic.RunForget.	2026-05-04 10:15:18 +01:00
steve	dd35133459	maintenance: pure-logic ticker decides forget/prune/check fires	2026-05-04 10:15:18 +01:00
steve	edce90d196	ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in host_repo.html to match the existing pattern on host_sources.html and host_schedules.html. Fix all-blank admin-credentials save to redirect without ?saved= query string so no false-positive banner is shown; strengthen the corresponding test to assert Location has no ?saved=. Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.	2026-05-04 10:15:18 +01:00
steve	ccccc6aa33	ui: Slice E — admin creds form + run-now buttons + repo health panel - hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online, and StatsView (pre-dereferenced projection of host_repo_stats). - loadHostRepoPage loads the admin slot (tolerating ErrNotFound), hub.Connected, and stats (tolerating ErrNotFound). - renderRepoPage gains an adminErr parameter; all callers updated. - handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added (form-POST handlers mirroring the repo-creds pattern, with audit). - Routes /hosts/{id}/admin-credentials POST and /delete POST registered. - Template: Admin credentials form after Connection, Run-now HTMX buttons after Maintenance, Repo health stats panel in right rail. - Tests: 9 new tests covering rendering, disabled states, save/delete round-trips, audit rows, and idempotent delete.	2026-05-04 10:15:18 +01:00
steve	b07cb14320	server: populate audit UserID on credential mutations + slog prune push errors Switch handleSetHostCredentials, handleSetAdminCredentials, and handleDeleteAdminCredentials from authedUser (bool) to requireUser (*store.User) so AuditEntry.UserID and Actor are populated correctly. Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in handleRunRepoPrune so decrypt/send failures surface in the server log rather than appearing as a generic host_offline 503.	2026-05-04 10:15:18 +01:00
steve	56dd7ab411	server: cover HTMX auth-redirect path in repo-ops tests	2026-05-04 10:15:18 +01:00
steve	0095e80fe9	server: HTTP run-now for prune / check / unlock Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer routes for HTMX form posts). Prune pushes the admin-cred slot via pushAdminCredsToAgent before dispatch and refuses with admin_creds_required when the slot is not set. Check reads check_subset_pct from host_repo_maintenance (overridable via ?subset=N, clamped 0-100; non-numeric override falls back to DB value silently). Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect response split as the per-source-group run-now endpoint.	2026-05-04 10:15:18 +01:00
steve	b66eb10524	server: admin-credentials REST + Slot:admin push helper Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that mirror the existing repo-credentials endpoints but write to store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped away from the repo slot to prevent cross-binding). PUT immediately pushes a config.update(Slot:"admin") to the agent when it is connected, and the new pushAdminCredsToAgent helper is wired for use by the upcoming prune run-now endpoint (D2) to push on-demand before dispatch.	2026-05-04 10:15:18 +01:00
steve	ee8538c928	agent: secrets fail-loud on corrupt blob + small polish Save and SaveAdmin now propagate loadBundle errors instead of silently overwriting a corrupt file (data-loss fix). Tests added for both paths. reportStats logs a Debug on RunStats failure; r in runJob gets a comment explaining the prune-runner asymmetry; runner_test comment tightened.	2026-05-04 10:15:18 +01:00
steve	6a86856cee	agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock RunCheck and RunUnlock were calling sendFinished before reportStats, inverting the required job.started → log.stream → repo.stats → job.finished envelope order. Move reportStats ahead of sendFinished in both functions to match the pattern already correct in RunPrune. Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus, and TestRunUnlockClearsLock with the same position-index ordering assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions would have failed against the pre-fix code.	2026-05-04 10:15:18 +01:00
steve	bc6a91b064	agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats with LastPruneAt before job.finished), RunCheck (ships stats with LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships LockPresent=false on success), and reportStats (fills size fields via RunStats when caller didn't populate them). Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach MsgConfigUpdate about the Slot discriminator for admin vs repo creds; add strconv import for subset-pct parsing.	2026-05-04 10:15:18 +01:00
steve	1cc375389a	agent/secrets: separate admin slot with backwards-compatible decode Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs are detected at load time by the presence of "repo_url" at the top level and transparently promoted into the new shape on the next Save/SaveAdmin. Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.	2026-05-04 10:15:18 +01:00
steve	7f33d5dad6	api: stats partial-update payload + ConfigUpdate.Slot + CommandRun.RequiresAdminCreds Reshape RepoStatsPayload into pointer-field partial-update form matching store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload for admin vs repo credential routing; add RequiresAdminCreds flag to CommandRunPayload for prune/unlock jobs that need delete authority.	2026-05-04 10:15:18 +01:00
steve	237a86dee5	restic: tighten RunCheck lock sniff + RunStats zero-snapshot test Narrow the LockPresent predicate from bare "locked" (too broad) to "stale lock" and "already locked" — the two phrases restic actually emits. Replace TestRunCheckParsesLock with table-driven TestRunCheckLockSniff covering both trigger phrases and a benign "locked-file" line that must not set LockPresent. Add TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot JSON without error.	2026-05-04 10:15:18 +01:00
steve	3336485d02	restic: RunUnlock + RunStats (raw-data mode) Add RunUnlock (delegates straight to runWithPump) and RunStats which runs `restic stats --json --mode raw-data`, captures the single JSON line from stdout into RepoStats, and returns an error if no JSON arrives. Tests cover arg plumbing for unlock, JSON parsing, and the no-JSON error path.	2026-05-04 10:15:18 +01:00
steve	4cc6962ec6	restic: RunCheck with subset% + lock-state sniffing Add CheckResult (LockPresent, ErrorsFound) and RunCheck. subsetPct>0 passes --read-data-subset N% to limit data reads. Stderr is sniffed for "Found stale lock"/"locked" to set LockPresent; a non-zero exit from restic is absorbed as ErrorsFound=true rather than an error so the caller can always persist last_check_status. Tests cover lock detection, exit-1 absorption, and subset-arg plumbing.	2026-05-04 10:15:18 +01:00
steve	fe04bba0fa	restic: RunPrune + runWithPump helper, refactor Forget/Init onto it Add RunPrune for admin-credential prune invocations. Extract runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget and RunInit to delegate to it (RunInit preserves the "config file already exists" soft-success sniff by wrapping the handler before the call). Add runner_test.go with TestRunPruneInvokesPrune.	2026-05-04 10:15:18 +01:00
steve	ea2590e27b	store: tighten CHECK constraint on host_repo_stats.last_check_status	2026-05-04 10:15:18 +01:00
steve	779f5aac47	store: wrap UpsertHostRepoStats in a transaction (concurrency safety)	2026-05-04 10:15:18 +01:00
steve	d4821714a5	store: assert CHECK constraint on host_credentials.kind	2026-05-04 10:15:18 +01:00
steve	d92aa6d65c	store: HostRepoStats projection (size, lock, last-check, last-prune)	2026-05-04 10:15:18 +01:00
steve	2055ce360b	store: host_credentials becomes kind-aware (repo + admin slots)	2026-05-04 10:15:18 +01:00
steve	65f65f87aa	store: migration 0009 — admin-creds kind + host_repo_stats	2026-05-04 10:15:18 +01:00
steve	a30c8d61d5	plan: P2 redesign Phase 5 (P2R-03..P2R-08)	2026-05-04 10:15:18 +01:00