36 Commits

Author SHA1 Message Date
steve e1b60f02ae test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)
CI / Build (windows/amd64) (pull_request) Successful in 18s
CI / Build (linux/amd64) (pull_request) Successful in 18s
CI / Build (linux/arm64) (pull_request) Successful in 18s
CI / Test (linux/amd64) (pull_request) Failing after 2m40s
CI / Lint (pull_request) Successful in 3m27s
CI run #48 failed with:

  --- FAIL: TestRunInitShipsStartedAndFinished
      RunInit: ... fork/exec /tmp/.../restic: text file busy

setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.

Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
2026-05-04 10:15:18 +01:00
steve e4c0256543 api+agent: document protocol-version stability and forget back-compat decisions
version.go: add a comment block explaining why Phase 5's wire changes
(CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did
not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade
path, smoke env restage enforces it. Notes where a version bump to 2
would be required if a multi-version path is ever introduced.

cmd/agent/main.go: document why the JobForget handler hard-errors on
empty ForgetGroups rather than falling back to a single-policy form.
The maintenance ticker is the only writer and always populates the
field; the fallback was specced but skipped given lockstep deploy.
2026-05-04 10:15:18 +01:00
steve 7236f8dc14 server: serialize DrainPending per host (avoid drain double-dispatch)
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.

Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
2026-05-04 10:15:18 +01:00
steve 440ac5cc18 store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire)
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.

Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
2026-05-04 10:15:18 +01:00
steve beecc32851 tasks: tick P2R-03 through P2R-08 done 2026-05-04 10:15:18 +01:00
steve d85c8ab61a diag: phase 5 Playwright sweep screenshots 2026-05-04 10:15:18 +01:00
steve 6635e82255 server/ws: persist repo.stats into host_repo_stats 2026-05-04 10:15:18 +01:00
steve 91fb7cf69b server: drainer abandons only on ErrNotFound, not transient errors
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.

Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
2026-05-04 10:15:18 +01:00
steve 02dbe59d68 server: drainer uses dispatch-core to avoid duplicate pending_run enqueue
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.

dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.

TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
2026-05-04 10:15:18 +01:00
steve 81c611264d server: drain pending_runs on tick + on agent reconnect
Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
2026-05-04 10:15:18 +01:00
steve 194e6c9719 server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
2026-05-04 10:15:18 +01:00
steve f29a9e49d3 server: fix stale RetentionPolicy comment + check Scan errors in maintenance test 2026-05-04 10:15:18 +01:00
steve e283d70c27 server: maintenance ticker drives forget/prune/check on cadence
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
2026-05-04 10:15:18 +01:00
steve dd35133459 maintenance: pure-logic ticker decides forget/prune/check fires 2026-05-04 10:15:18 +01:00
steve edce90d196 ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
2026-05-04 10:15:18 +01:00
steve ccccc6aa33 ui: Slice E — admin creds form + run-now buttons + repo health panel
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
  and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
  hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
  (form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
  buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
  round-trips, audit rows, and idempotent delete.
2026-05-04 10:15:18 +01:00
steve b07cb14320 server: populate audit UserID on credential mutations + slog prune push errors
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
2026-05-04 10:15:18 +01:00
steve 56dd7ab411 server: cover HTMX auth-redirect path in repo-ops tests 2026-05-04 10:15:18 +01:00
steve 0095e80fe9 server: HTTP run-now for prune / check / unlock
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
2026-05-04 10:15:18 +01:00
steve b66eb10524 server: admin-credentials REST + Slot:admin push helper
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
2026-05-04 10:15:18 +01:00
steve ee8538c928 agent: secrets fail-loud on corrupt blob + small polish
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
2026-05-04 10:15:18 +01:00
steve 6a86856cee agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.

Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
2026-05-04 10:15:18 +01:00
steve bc6a91b064 agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).

Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
2026-05-04 10:15:18 +01:00
steve 1cc375389a agent/secrets: separate admin slot with backwards-compatible decode
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
2026-05-04 10:15:18 +01:00
steve 7f33d5dad6 api: stats partial-update payload + ConfigUpdate.Slot + CommandRun.RequiresAdminCreds
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
2026-05-04 10:15:18 +01:00
steve 237a86dee5 restic: tighten RunCheck lock sniff + RunStats zero-snapshot test
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
2026-05-04 10:15:18 +01:00
steve 3336485d02 restic: RunUnlock + RunStats (raw-data mode)
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives.  Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
2026-05-04 10:15:18 +01:00
steve 4cc6962ec6 restic: RunCheck with subset% + lock-state sniffing
Add CheckResult (LockPresent, ErrorsFound) and RunCheck.  subsetPct>0
passes --read-data-subset N% to limit data reads.  Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status.  Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
2026-05-04 10:15:18 +01:00
steve fe04bba0fa restic: RunPrune + runWithPump helper, refactor Forget/Init onto it
Add RunPrune for admin-credential prune invocations.  Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call).  Add runner_test.go with TestRunPruneInvokesPrune.
2026-05-04 10:15:18 +01:00
steve ea2590e27b store: tighten CHECK constraint on host_repo_stats.last_check_status 2026-05-04 10:15:18 +01:00
steve 779f5aac47 store: wrap UpsertHostRepoStats in a transaction (concurrency safety) 2026-05-04 10:15:18 +01:00
steve d4821714a5 store: assert CHECK constraint on host_credentials.kind 2026-05-04 10:15:18 +01:00
steve d92aa6d65c store: HostRepoStats projection (size, lock, last-check, last-prune) 2026-05-04 10:15:18 +01:00
steve 2055ce360b store: host_credentials becomes kind-aware (repo + admin slots) 2026-05-04 10:15:18 +01:00
steve 65f65f87aa store: migration 0009 — admin-creds kind + host_repo_stats 2026-05-04 10:15:18 +01:00
steve a30c8d61d5 plan: P2 redesign Phase 5 (P2R-03..P2R-08) 2026-05-04 10:15:18 +01:00
2 changed files with 331 additions and 5 deletions
+4 -5
View File
@@ -3,11 +3,10 @@
# Notes for anyone editing this file: # Notes for anyone editing this file:
# #
# Self-hosted runner expectations # Self-hosted runner expectations
# The Gitea runners are provisioned out-of-band (the infra team owns # The Gitea runners are provisioned via scripts/provision-gitea-runner.sh.
# the script). Each runner host bind-mounts persistent volumes for # That script bind-mounts persistent host volumes for /root/go/pkg/mod
# /root/go/pkg/mod (GOMODCACHE), /root/.cache/go-build (GOCACHE), and # (GOMODCACHE), /root/.cache/go-build (GOCACHE), and /root/.cache/act
# /root/.cache/act (action clones) into every job container. As a # (action clones) into every job container. As a result:
# result:
# * `cache: true` on actions/setup-go is intentionally OMITTED — the # * `cache: true` on actions/setup-go is intentionally OMITTED — the
# action would otherwise tar/untar GOMODCACHE+GOCACHE through the # action would otherwise tar/untar GOMODCACHE+GOCACHE through the
# Gitea cache backend on every job, undoing the host-volume cache # Gitea cache backend on every job, undoing the host-volume cache
+327
View File
@@ -0,0 +1,327 @@
#!/usr/bin/env bash
#
# provision-gitea-runner.sh — one-shot, idempotent host setup for an
# act_runner LXC. Speeds up Gitea Actions runs by:
#
# 1. Disabling forced docker pulls (image refresh moves to a cron).
# 2. Mounting persistent host volumes for Go module/build caches and
# the act-actions clone cache.
# 3. Pre-pulling the runner-images container image.
# 4. Pre-cloning a configurable list of GitHub actions into the
# act cache so jobs don't fetch them on every run.
# 5. Installing golangci-lint (latest v2.x) at /usr/local/bin.
# 6. Setting up a nightly cron to refresh image + action clones +
# golangci-lint.
#
# The script is generic — no per-project state. Point it at any LXC
# running act_runner as a systemd service and it will provision the
# host. Re-runs are safe; they reconcile state.
#
# Usage: sudo ./provision-gitea-runner.sh
#
# Configurable via environment variables (defaults shown):
#
# CACHE_BASE=/var/cache/gitea-runner
# ACT_RUNNER_CONFIG=/etc/act_runner/config.yaml
# RUNNER_IMAGE=docker.gitea.com/runner-images:ubuntu-latest
# ACTIONS_TO_PRECLONE=(actions/checkout@v4 actions/setup-go@v5
# actions/upload-artifact@v4
# golangci/golangci-lint-action@v7)
#
# To add more pre-cloned actions later, edit /etc/cron.d/gitea-runner-refresh
# (the ACTIONS list is materialised into the cron script).
set -euo pipefail
# ---------- defaults ---------------------------------------------------
: "${CACHE_BASE:=/var/cache/gitea-runner}"
: "${ACT_RUNNER_CONFIG:=/etc/act_runner/config.yaml}"
: "${RUNNER_IMAGE:=docker.gitea.com/runner-images:ubuntu-latest}"
DEFAULT_ACTIONS=(
"actions/checkout@v4"
"actions/setup-go@v5"
"actions/upload-artifact@v4"
"golangci/golangci-lint-action@v7"
)
# Allow caller to override by exporting ACTIONS_TO_PRECLONE as a
# space-separated string (env vars can't carry arrays cleanly).
if [[ -n "${ACTIONS_TO_PRECLONE:-}" ]]; then
read -r -a ACTIONS <<<"${ACTIONS_TO_PRECLONE}"
else
ACTIONS=("${DEFAULT_ACTIONS[@]}")
fi
# ---------- helpers ----------------------------------------------------
log() { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
warn() { printf '\033[1;33m==>\033[0m %s\n' "$*" >&2; }
die() { printf '\033[1;31m==>\033[0m %s\n' "$*" >&2; exit 1; }
require_cmd() {
command -v "$1" >/dev/null 2>&1 || die "missing: $1 (install it first)"
}
# sha256_url <url> — act_runner names its action-clone dirs after
# sha256(URL). Verified against a real run log:
# url=https://github.com/actions/checkout
# sha256=c3fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab
sha256_url() {
printf '%s' "$1" | sha256sum | awk '{print $1}'
}
# ---------- pre-flight -------------------------------------------------
[[ $EUID -eq 0 ]] || die "run as root (the act_runner service writes /var/lib/act_runner as root)"
require_cmd systemctl
require_cmd docker
require_cmd git
require_cmd curl
require_cmd python3
# PyYAML for the config edit. Install if missing — Ubuntu 24.04 ships
# python3-yaml in the default repos.
if ! python3 -c 'import yaml' 2>/dev/null; then
log "installing python3-yaml (needed for safe YAML edits)"
apt-get update -qq
apt-get install -y -qq python3-yaml
fi
[[ -f "$ACT_RUNNER_CONFIG" ]] || die "$ACT_RUNNER_CONFIG not found — is act_runner installed?"
systemctl list-unit-files act_runner.service >/dev/null 2>&1 || \
die "act_runner.service not found — register the runner first"
log "pre-flight OK"
log " cache base : $CACHE_BASE"
log " config file : $ACT_RUNNER_CONFIG"
log " runner image : $RUNNER_IMAGE"
log " actions to clone : ${ACTIONS[*]}"
# ---------- 1. cache directories ---------------------------------------
log "creating cache directories under $CACHE_BASE"
for sub in go-mod go-build act-actions; do
install -d -m 0755 -o root -g root "$CACHE_BASE/$sub"
done
# ---------- 2. edit /etc/act_runner/config.yaml ------------------------
#
# Three keys are reconciled to known values:
#
# container.force_pull : false (we keep the image fresh via cron)
# container.options : "-v <cache mounts...>" (auto-mount caches
# into every job container)
# container.valid_volumes: [<our cache paths>] (whitelist so the
# container.options mounts are accepted)
#
# Other keys are preserved verbatim. The edit is idempotent: re-running
# yields the same file content.
log "patching $ACT_RUNNER_CONFIG"
# Backup once (only if no .pre-provision backup exists yet).
if [[ ! -f "${ACT_RUNNER_CONFIG}.pre-provision" ]]; then
cp -p "$ACT_RUNNER_CONFIG" "${ACT_RUNNER_CONFIG}.pre-provision"
log " saved pristine copy to ${ACT_RUNNER_CONFIG}.pre-provision"
fi
CONTAINER_OPTIONS_VALUE="-v ${CACHE_BASE}/go-mod:/root/go/pkg/mod:rw -v ${CACHE_BASE}/go-build:/root/.cache/go-build:rw -v ${CACHE_BASE}/act-actions:/root/.cache/act:rw"
CACHE_BASE="$CACHE_BASE" CONTAINER_OPTIONS_VALUE="$CONTAINER_OPTIONS_VALUE" \
ACT_RUNNER_CONFIG="$ACT_RUNNER_CONFIG" \
python3 - <<'PY'
import os, sys, yaml
cfg_path = os.environ['ACT_RUNNER_CONFIG']
cache_base = os.environ['CACHE_BASE']
container_options = os.environ['CONTAINER_OPTIONS_VALUE']
with open(cfg_path) as f:
cfg = yaml.safe_load(f) or {}
cfg.setdefault('container', {})
cfg['container']['force_pull'] = False
cfg['container']['options'] = container_options
# Whitelist every cache subdir explicitly so jobs that try to bind-mount
# them via workflow-side `volumes:` (rare but possible) are accepted.
desired_vols = [
f"{cache_base}/go-mod",
f"{cache_base}/go-build",
f"{cache_base}/act-actions",
]
existing = cfg['container'].get('valid_volumes') or []
merged = list(dict.fromkeys(existing + desired_vols)) # de-dup, preserve order
cfg['container']['valid_volumes'] = merged
# Write back with stable formatting. yaml.dump preserves enough
# structure for act_runner to parse; comments in the original config
# do get stripped — that's why we preserve the .pre-provision backup.
with open(cfg_path + '.tmp', 'w') as f:
yaml.safe_dump(cfg, f, default_flow_style=False, sort_keys=False)
os.replace(cfg_path + '.tmp', cfg_path)
print(f" container.force_pull : false")
print(f" container.options : {container_options}")
print(f" container.valid_volumes: {merged}")
PY
# ---------- 3. pre-pull the runner image -------------------------------
log "pulling $RUNNER_IMAGE (one-time; cron refreshes it nightly)"
docker pull "$RUNNER_IMAGE"
# ---------- 4. pre-clone the actions list ------------------------------
#
# act_runner expects clones at $cache/<sha256(url)> with the ref already
# checked out. We clone the default branch then fetch + check out the
# requested ref. Re-running fetches updates rather than re-cloning.
log "pre-cloning actions into $CACHE_BASE/act-actions"
for spec in "${ACTIONS[@]}"; do
if [[ "$spec" != *@* ]]; then
warn " skip '$spec' — must be owner/repo@ref"
continue
fi
repo="${spec%@*}"
ref="${spec##*@}"
url="https://github.com/${repo}"
dir="${CACHE_BASE}/act-actions/$(sha256_url "$url")"
if [[ -d "$dir/.git" ]]; then
log " refresh $repo @ $ref"
git -C "$dir" fetch --quiet --tags --prune origin
else
log " clone $repo @ $ref$dir"
git clone --quiet "$url" "$dir"
fi
# Detach onto the requested ref. Works for branches, tags, and SHAs.
if ! git -C "$dir" -c advice.detachedHead=false checkout --quiet "$ref" 2>/dev/null; then
# If `ref` is a remote branch we haven't tracked yet, try origin/<ref>.
git -C "$dir" -c advice.detachedHead=false checkout --quiet "origin/$ref"
fi
done
# ---------- 5. golangci-lint -------------------------------------------
#
# Install the latest v2.x at /usr/local/bin/golangci-lint. Workflows
# that pin a specific version via the action's `version:` arg will
# still re-download — but jobs that don't pin (or pin to "latest"/"v2")
# get the host-installed binary for free.
log "installing/updating golangci-lint (latest v2.x) → /usr/local/bin"
GOLANGCI_INSTALL_URL="https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh"
# `-b` = install dir, `-d` = quiet "downloading" lines, no version arg
# means "latest" — which install.sh resolves to the latest v2 release
# from GitHub releases.
curl -fsSL "$GOLANGCI_INSTALL_URL" | sh -s -- -b /usr/local/bin >/dev/null
/usr/local/bin/golangci-lint --version || warn "golangci-lint install verification failed"
# ---------- 6. nightly refresh cron ------------------------------------
#
# Re-pulls the runner image, refreshes the action clones, and updates
# golangci-lint. Runs at 03:17 to dodge top-of-hour CI bursts.
CRON_PATH=/etc/cron.d/gitea-runner-refresh
REFRESH_SCRIPT=/usr/local/sbin/gitea-runner-refresh
log "writing $REFRESH_SCRIPT and $CRON_PATH"
# Materialise the actions list into the script so the cron is
# self-contained and surviving an edit to this file.
ACTIONS_LITERAL=""
for s in "${ACTIONS[@]}"; do
ACTIONS_LITERAL="${ACTIONS_LITERAL} \"$s\"\n"
done
cat >"$REFRESH_SCRIPT" <<EOF
#!/usr/bin/env bash
# Auto-generated by provision-gitea-runner.sh. Re-running the
# provisioning script regenerates this file.
set -euo pipefail
CACHE_BASE="$CACHE_BASE"
RUNNER_IMAGE="$RUNNER_IMAGE"
ACTIONS=(
$(printf ' "%s"\n' "${ACTIONS[@]}")
)
sha256_url() { printf '%s' "\$1" | sha256sum | awk '{print \$1}'; }
# 1. Refresh the runner-images base.
docker pull -q "\$RUNNER_IMAGE" >/dev/null
# 2. Refresh action clones.
for spec in "\${ACTIONS[@]}"; do
[[ "\$spec" == *@* ]] || continue
repo="\${spec%@*}"; ref="\${spec##*@}"
url="https://github.com/\$repo"
dir="\$CACHE_BASE/act-actions/\$(sha256_url "\$url")"
if [[ -d "\$dir/.git" ]]; then
git -C "\$dir" fetch --quiet --tags --prune origin || true
git -C "\$dir" -c advice.detachedHead=false checkout --quiet "\$ref" 2>/dev/null \\
|| git -C "\$dir" -c advice.detachedHead=false checkout --quiet "origin/\$ref" || true
fi
done
# 3. Refresh golangci-lint (latest v2.x). Tolerate transient
# GitHub-rate-limit failures — next night will retry.
curl -fsSL https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh \\
| sh -s -- -b /usr/local/bin >/dev/null 2>&1 || true
EOF
chmod 0755 "$REFRESH_SCRIPT"
cat >"$CRON_PATH" <<EOF
# Auto-generated by provision-gitea-runner.sh. Refreshes the runner
# image, action clones, and golangci-lint every night at 03:17.
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
17 3 * * * root $REFRESH_SCRIPT >> /var/log/gitea-runner-refresh.log 2>&1
EOF
chmod 0644 "$CRON_PATH"
# ---------- 7. restart act_runner --------------------------------------
log "restarting act_runner.service to pick up the new config"
systemctl restart act_runner.service
sleep 2
systemctl is-active --quiet act_runner.service \
|| die "act_runner did not come back up — check 'journalctl -u act_runner -n 50'"
# ---------- 8. container-create benchmark ------------------------------
#
# Reports cold + warm `docker run --rm <image> true` time. Sanity check
# that overlay setup is fast on this host. Numbers > ~5s indicate a
# slow filesystem or DNS issue worth investigating separately.
log "benchmark: docker run --rm $RUNNER_IMAGE true"
{
printf ' cold (post-pull) : '
/usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
printf ' warm (immediate) : '
/usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
} || warn "benchmark failed — non-fatal"
# ---------- done -------------------------------------------------------
cat <<EOF
\033[1;32m==> Provisioning complete\033[0m
What changed on this host:
* /etc/act_runner/config.yaml — force_pull off, container.options +
valid_volumes set for the cache mounts. Pristine copy preserved
at ${ACT_RUNNER_CONFIG}.pre-provision.
* $CACHE_BASE/{go-mod,go-build,act-actions} — persistent caches.
* /usr/local/bin/golangci-lint — latest v2.x.
* $REFRESH_SCRIPT and $CRON_PATH — nightly refresh @ 03:17.
* Runner image pre-pulled.
\033[1;33mNote on Go cache + setup-go:\033[0m if your workflow uses
\`actions/setup-go\` with \`cache: true\`, the action will still tar/untar
the cache via the Gitea cache backend on every job — partially
defeating the persistent volume. For full speed-up, drop \`cache: true\`
from the workflow once the persistent volume is warm. Per-project
decision; this script doesn't touch workflows.
EOF