36 Commits

Author SHA1 Message Date
steve aa80a3418e test: write-then-rename script-bin helpers (avoid ETXTBSY under -race)
CI / Test (linux/amd64) (pull_request) Successful in 48s
CI / Lint (pull_request) Successful in 23s
CI / Build (linux/amd64) (pull_request) Successful in 17s
CI / Build (linux/arm64) (pull_request) Successful in 18s
CI / Build (windows/amd64) (pull_request) Failing after 15m8s
CI run #48 failed with:

  --- FAIL: TestRunInitShipsStartedAndFinished
      RunInit: ... fork/exec /tmp/.../restic: text file busy

setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.

Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
2026-05-04 09:43:27 +01:00
steve ac9d7b92ed api+agent: document protocol-version stability and forget back-compat decisions
CI / Test (linux/amd64) (pull_request) Failing after 1m16s
CI / Lint (pull_request) Successful in 23s
CI / Build (windows/amd64) (pull_request) Successful in 23s
CI / Build (linux/amd64) (pull_request) Successful in 25s
CI / Build (linux/arm64) (pull_request) Successful in 22s
version.go: add a comment block explaining why Phase 5's wire changes
(CommandRunPayload, ConfigUpdatePayload, RepoStatsPayload reshapes) did
not bump CurrentProtocolVersion — lockstep deploy, no rolling-upgrade
path, smoke env restage enforces it. Notes where a version bump to 2
would be required if a multi-version path is ever introduced.

cmd/agent/main.go: document why the JobForget handler hard-errors on
empty ForgetGroups rather than falling back to a single-policy form.
The maintenance ticker is the only writer and always populates the
field; the fallback was specced but skipped given lockstep deploy.
2026-05-04 00:33:51 +01:00
steve 556d65d77f server: serialize DrainPending per host (avoid drain double-dispatch)
Add a per-host drain mutex (drainLocks map guarded by drainLocksMu) on
the Server struct. DrainPending acquires it with TryLock: if a drain is
already in-flight for this host, the call returns immediately — the
running drain will see every pending row. This prevents the on-hello
goroutine and the 30s tick from both listing the same host's rows and
dispatching them twice.

Update three existing tests that called srv.DrainPending explicitly
after the on-hello goroutine had already been spawned: replace the
now-redundant direct call with a waitForPendingCount poll so they don't
race the goroutine's mutex ownership. Add TestDrainPendingSerializesPerHost
which fires 10 concurrent DrainPending goroutines against a 5-row queue
and asserts exactly 5 job rows result.
2026-05-04 00:33:13 +01:00
steve 7ee8d2311b store: LatestJobByKind includes in-flight jobs (avoid maintenance double-fire)
Widen the SQL query to consider all statuses (queued, running,
succeeded, failed, cancelled) rather than terminal-only. An in-flight
prune that outlasts the 60s tick interval previously produced
ErrNotFound, causing the ticker to anchor at now-24h and fire a second
prune concurrently with the first.

Update the doc comment and test: remove the "queued job filtered out"
case, add assertions that a running job and a queued job are each
returned as the latest.
2026-05-04 00:29:52 +01:00
steve e4dd7f96d6 tasks: tick P2R-03 through P2R-08 done 2026-05-04 00:20:19 +01:00
steve 6afde3ce23 diag: phase 5 Playwright sweep screenshots 2026-05-04 00:19:50 +01:00
steve 775e988340 server/ws: persist repo.stats into host_repo_stats 2026-05-04 00:10:41 +01:00
steve e2cf9a68f6 server: drainer abandons only on ErrNotFound, not transient errors
GetSourceGroup errors in drainOne now gate on errors.Is(err,
store.ErrNotFound) before calling abandonPending, mirroring the
existing GetSchedule pattern. Transient errors (SQLITE_BUSY, context
cancellation) now log a warning and return without deleting the row.

Add regression test TestDrainPendingDropsRowsForGoneSourceGroup
confirming the ErrNotFound path still abandons correctly. Also add
a comment above the backoff-doubling loop explaining the progression.
2026-05-04 00:07:33 +01:00
steve 1e212db24e server: drainer uses dispatch-core to avoid duplicate pending_run enqueue
Extract dispatchBackupForGroupCore (persist+marshal+send, no enqueue on
failure) from dispatchBackupForGroup. drainOne now calls the core
directly so a failed Send only bumps the existing pending_runs row via
BumpPendingRunAttempt — not create a second row — stopping the
geometric duplication on repeated drain failures.

dispatchBackupForGroup (schedule.fire path) wraps the core and keeps
its enqueue-on-failure behaviour unchanged.

TestDrainPendingBumpsOnSendFailure strengthened: asserts exactly 1 row
remains after a send failure (was tolerating >=1 duplicate rows).
2026-05-04 00:01:42 +01:00
steve a9185424d3 server: drain pending_runs on tick + on agent reconnect
Two trigger paths land here:

- A 30s ticker in cmd/server calls Server.DrainAllDue(ctx). It
  walks pending_runs rows whose next_attempt_at <= now, dedupes by
  host, skips offline hosts, and per online host runs DrainPending.

- onAgentHello spawns a background DrainPending(hostID). When a
  host comes back, every pending row for it is dispatchable now —
  due-ness becomes irrelevant once the wire is back.

Each row's schedule + group are reloaded; ErrNotFound or
disabled-schedule or gone-group abandons the row with a
pending_run.abandoned audit. attempt >= retry_max also abandons.
Otherwise dispatchBackupForGroup is invoked; success deletes the
row, failure bumps attempt with exponential backoff capped at
30m.
2026-05-03 23:57:08 +01:00
steve 9c5037ec54 server: enqueue pending_runs when scheduled-job dispatch fails
When dispatchBackupForGroup's conn.Send errors, queue a pending_runs
row (attempt=1, next_attempt_at = now + group.RetryBackoffSeconds)
instead of silently dropping the fire. The orphaned queued job row
is left behind for forensic visibility — the drainer will create a
fresh job row on its retry.

Also adds Store.ListPendingRunsForHost — the on-reconnect drain
walks every row for the host, regardless of due-ness, since the
host being back makes 'due' irrelevant.
2026-05-03 23:53:57 +01:00
steve ed839aacb4 server: fix stale RetentionPolicy comment + check Scan errors in maintenance test 2026-05-03 23:50:05 +01:00
steve 7d2c2ae1c2 server: maintenance ticker drives forget/prune/check on cadence
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:

  - skips offline hosts (no pending_runs queueing — maintenance is
    not a backup, missed fires shouldn't pile up)
  - silently skips prune when admin creds aren't bound
  - pushes admin creds before prune, then dispatches with
    RequiresAdminCreds=true (same as operator-driven prune)
  - persists job rows with actor_kind="system"

Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
2026-05-03 23:40:35 +01:00
steve a131419b1a maintenance: pure-logic ticker decides forget/prune/check fires 2026-05-03 23:36:48 +01:00
steve 9d727a7b3a ui: hx-swap none on Run-now + truthful save banner + tailwind rebuild
Add hx-swap="none" to the three Run-now buttons (check/prune/unlock) in
host_repo.html to match the existing pattern on host_sources.html and
host_schedules.html. Fix all-blank admin-credentials save to redirect
without ?saved= query string so no false-positive banner is shown;
strengthen the corresponding test to assert Location has no ?saved=.
Rebuild CSS bundle via Tailwind to pick up max-w-[640px] JIT class.
2026-05-03 23:29:01 +01:00
steve 1cbc856514 ui: Slice E — admin creds form + run-now buttons + repo health panel
- hostRepoPage gains AdminURL/AdminUsername/HasAdminPassword, Online,
  and StatsView (pre-dereferenced projection of host_repo_stats).
- loadHostRepoPage loads the admin slot (tolerating ErrNotFound),
  hub.Connected, and stats (tolerating ErrNotFound).
- renderRepoPage gains an adminErr parameter; all callers updated.
- handleUIAdminCredentialsSave / handleUIAdminCredentialsDelete added
  (form-POST handlers mirroring the repo-creds pattern, with audit).
- Routes /hosts/{id}/admin-credentials POST and /delete POST registered.
- Template: Admin credentials form after Connection, Run-now HTMX
  buttons after Maintenance, Repo health stats panel in right rail.
- Tests: 9 new tests covering rendering, disabled states, save/delete
  round-trips, audit rows, and idempotent delete.
2026-05-03 23:18:16 +01:00
steve fb24e42c6e server: populate audit UserID on credential mutations + slog prune push errors
Switch handleSetHostCredentials, handleSetAdminCredentials, and
handleDeleteAdminCredentials from authedUser (bool) to requireUser
(*store.User) so AuditEntry.UserID and Actor are populated correctly.
Add slog.Warn on the non-ErrNotFound pushAdminCredsToAgent path in
handleRunRepoPrune so decrypt/send failures surface in the server log
rather than appearing as a generic host_offline 503.
2026-05-03 23:09:09 +01:00
steve a899cc2d04 server: cover HTMX auth-redirect path in repo-ops tests 2026-05-03 23:00:38 +01:00
steve ef2a30a82d server: HTTP run-now for prune / check / unlock
Adds POST /api/hosts/{id}/repo/{prune,check,unlock} (and matching outer
routes for HTMX form posts). Prune pushes the admin-cred slot via
pushAdminCredsToAgent before dispatch and refuses with
admin_creds_required when the slot is not set. Check reads
check_subset_pct from host_repo_maintenance (overridable via ?subset=N,
clamped 0-100; non-numeric override falls back to DB value silently).
Unlock needs no admin creds. All three share the same wantsHTML/HX-Redirect
response split as the per-source-group run-now endpoint.
2026-05-03 22:57:07 +01:00
steve e7960151fb server: admin-credentials REST + Slot:admin push helper
Adds GET/PUT/DELETE /api/hosts/{id}/admin-credentials handlers that
mirror the existing repo-credentials endpoints but write to
store.CredKindAdmin with AEAD additional-data "host:<id>:admin" (scoped
away from the repo slot to prevent cross-binding). PUT immediately pushes
a config.update(Slot:"admin") to the agent when it is connected, and the
new pushAdminCredsToAgent helper is wired for use by the upcoming prune
run-now endpoint (D2) to push on-demand before dispatch.
2026-05-03 22:55:09 +01:00
steve 2d27a23e99 agent: secrets fail-loud on corrupt blob + small polish
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
2026-05-03 22:49:12 +01:00
steve 9d9773cad4 agent/runner: ship repo.stats before job.finished in RunCheck/RunUnlock
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.

Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
2026-05-03 22:43:57 +01:00
steve b3d033fa11 agent: RunPrune/RunCheck/RunUnlock + reportStats + admin-cred slot dispatch
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).

Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
2026-05-03 22:39:23 +01:00
steve b7033fcfcd agent/secrets: separate admin slot with backwards-compatible decode
Split the on-disk bundle into repo + admin slots. Legacy flat Repo blobs
are detected at load time by the presence of "repo_url" at the top level
and transparently promoted into the new shape on the next Save/SaveAdmin.
Adds ErrNoAdmin sentinel, LoadAdmin, SaveAdmin, and three new tests.
2026-05-03 22:34:33 +01:00
steve b1b0d9d1e9 api: stats partial-update payload + ConfigUpdate.Slot + CommandRun.RequiresAdminCreds
Reshape RepoStatsPayload into pointer-field partial-update form matching
store.HostRepoStats semantics; add Slot discriminator to ConfigUpdatePayload
for admin vs repo credential routing; add RequiresAdminCreds flag to
CommandRunPayload for prune/unlock jobs that need delete authority.
2026-05-03 22:33:12 +01:00
steve f711593549 restic: tighten RunCheck lock sniff + RunStats zero-snapshot test
Narrow the LockPresent predicate from bare "locked" (too broad) to
"stale lock" and "already locked" — the two phrases restic actually
emits. Replace TestRunCheckParsesLock with table-driven
TestRunCheckLockSniff covering both trigger phrases and a benign
"locked-file" line that must not set LockPresent. Add
TestRunStatsZeroSnapshots to pin that RunStats accepts zero-snapshot
JSON without error.
2026-05-03 22:29:09 +01:00
steve dfc2cd314d restic: RunUnlock + RunStats (raw-data mode)
Add RunUnlock (delegates straight to runWithPump) and RunStats which
runs `restic stats --json --mode raw-data`, captures the single JSON
line from stdout into RepoStats, and returns an error if no JSON
arrives.  Tests cover arg plumbing for unlock, JSON parsing, and the
no-JSON error path.
2026-05-03 22:22:19 +01:00
steve 5e2b88c6dd restic: RunCheck with subset% + lock-state sniffing
Add CheckResult (LockPresent, ErrorsFound) and RunCheck.  subsetPct>0
passes --read-data-subset N% to limit data reads.  Stderr is sniffed
for "Found stale lock"/"locked" to set LockPresent; a non-zero exit
from restic is absorbed as ErrorsFound=true rather than an error so
the caller can always persist last_check_status.  Tests cover lock
detection, exit-1 absorption, and subset-arg plumbing.
2026-05-03 22:21:48 +01:00
steve 768972d870 restic: RunPrune + runWithPump helper, refactor Forget/Init onto it
Add RunPrune for admin-credential prune invocations.  Extract
runWithPump to DRY the stdout+stderr pump pattern; refactor RunForget
and RunInit to delegate to it (RunInit preserves the "config file
already exists" soft-success sniff by wrapping the handler before the
call).  Add runner_test.go with TestRunPruneInvokesPrune.
2026-05-03 22:20:48 +01:00
steve 82a73fad85 store: tighten CHECK constraint on host_repo_stats.last_check_status 2026-05-03 22:15:57 +01:00
steve 26bb881c12 store: wrap UpsertHostRepoStats in a transaction (concurrency safety) 2026-05-03 22:15:35 +01:00
steve 3873bd9d34 store: assert CHECK constraint on host_credentials.kind 2026-05-03 22:10:29 +01:00
steve 1bb31b9c49 store: HostRepoStats projection (size, lock, last-check, last-prune) 2026-05-03 22:07:24 +01:00
steve 4985050a0a store: host_credentials becomes kind-aware (repo + admin slots) 2026-05-03 22:06:05 +01:00
steve 1c7b471e75 store: migration 0009 — admin-creds kind + host_repo_stats 2026-05-03 22:05:53 +01:00
steve 88216d29d0 plan: P2 redesign Phase 5 (P2R-03..P2R-08) 2026-05-03 21:59:40 +01:00
2 changed files with 4 additions and 373 deletions
+4 -46
View File
@@ -1,46 +1,3 @@
# CI workflow — runs on every PR into main.
#
# Notes for anyone editing this file:
#
# Self-hosted runner expectations
# The Gitea runners are provisioned via scripts/provision-gitea-runner.sh.
# That script bind-mounts persistent host volumes for /root/go/pkg/mod
# (GOMODCACHE), /root/.cache/go-build (GOCACHE), and /root/.cache/act
# (action clones) into every job container. As a result:
# * `cache: true` on actions/setup-go is intentionally OMITTED — the
# action would otherwise tar/untar GOMODCACHE+GOCACHE through the
# Gitea cache backend on every job, undoing the host-volume cache
# and adding ~10s of redundant zstd round-trip per job.
# * Common GitHub actions (actions/checkout, actions/setup-go,
# actions/upload-artifact, golangci/golangci-lint-action) are
# pre-cloned into /root/.cache/act on the runner, so the per-job
# "git clone https://github.com/actions/..." step is a fetch, not
# a full clone.
# * golangci-lint is pre-installed at /usr/local/bin/golangci-lint
# on the runner (latest v2.x). The golangci-lint-action below
# still pins a specific version and re-downloads — that's fine
# (deterministic CI > marginal speed) but means the host-installed
# binary is currently unused. Drop the `version:` arg below to
# use the host-installed one if you want to trade determinism
# for speed.
#
# Build matrix
# Linux amd64 + arm64 + Windows amd64. CGO_ENABLED=0 throughout —
# modernc.org/sqlite is pure-Go so no cross-compile toolchain is
# needed. -trimpath + -ldflags="-s -w" for reproducible, smaller
# binaries.
#
# Go version
# The GO_VERSION env var anchors all three jobs. Floor is set by the
# heaviest dep (modernc.org/sqlite v1.50+ requires Go 1.23+ today;
# we run 1.25 so golangci-lint's Go-version compatibility check is
# happy — see the version pin in the lint job).
#
# upload-artifact
# Pinned at v3 historically; v3 was deprecated upstream. v4 should
# work but hasn't been validated against this runner's act_runner
# version yet. Bump when convenient.
name: CI
on:
@@ -48,6 +5,7 @@ on:
branches: [main]
env:
# Floor is set by the heaviest dep (modernc.org/sqlite v1.50+).
GO_VERSION: "1.25"
jobs:
@@ -59,7 +17,7 @@ jobs:
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
cache: true
- name: go vet
run: go vet ./...
- name: go test
@@ -75,7 +33,7 @@ jobs:
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
cache: true
- uses: golangci/golangci-lint-action@v7
with:
# Must be built against the same Go release as go.mod targets,
@@ -105,7 +63,7 @@ jobs:
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
# cache: true intentionally omitted — see header notes.
cache: true
- name: build server + agent
env:
GOOS: ${{ matrix.goos }}
-327
View File
@@ -1,327 +0,0 @@
#!/usr/bin/env bash
#
# provision-gitea-runner.sh — one-shot, idempotent host setup for an
# act_runner LXC. Speeds up Gitea Actions runs by:
#
# 1. Disabling forced docker pulls (image refresh moves to a cron).
# 2. Mounting persistent host volumes for Go module/build caches and
# the act-actions clone cache.
# 3. Pre-pulling the runner-images container image.
# 4. Pre-cloning a configurable list of GitHub actions into the
# act cache so jobs don't fetch them on every run.
# 5. Installing golangci-lint (latest v2.x) at /usr/local/bin.
# 6. Setting up a nightly cron to refresh image + action clones +
# golangci-lint.
#
# The script is generic — no per-project state. Point it at any LXC
# running act_runner as a systemd service and it will provision the
# host. Re-runs are safe; they reconcile state.
#
# Usage: sudo ./provision-gitea-runner.sh
#
# Configurable via environment variables (defaults shown):
#
# CACHE_BASE=/var/cache/gitea-runner
# ACT_RUNNER_CONFIG=/etc/act_runner/config.yaml
# RUNNER_IMAGE=docker.gitea.com/runner-images:ubuntu-latest
# ACTIONS_TO_PRECLONE=(actions/checkout@v4 actions/setup-go@v5
# actions/upload-artifact@v4
# golangci/golangci-lint-action@v7)
#
# To add more pre-cloned actions later, edit /etc/cron.d/gitea-runner-refresh
# (the ACTIONS list is materialised into the cron script).
set -euo pipefail
# ---------- defaults ---------------------------------------------------
: "${CACHE_BASE:=/var/cache/gitea-runner}"
: "${ACT_RUNNER_CONFIG:=/etc/act_runner/config.yaml}"
: "${RUNNER_IMAGE:=docker.gitea.com/runner-images:ubuntu-latest}"
DEFAULT_ACTIONS=(
"actions/checkout@v4"
"actions/setup-go@v5"
"actions/upload-artifact@v4"
"golangci/golangci-lint-action@v7"
)
# Allow caller to override by exporting ACTIONS_TO_PRECLONE as a
# space-separated string (env vars can't carry arrays cleanly).
if [[ -n "${ACTIONS_TO_PRECLONE:-}" ]]; then
read -r -a ACTIONS <<<"${ACTIONS_TO_PRECLONE}"
else
ACTIONS=("${DEFAULT_ACTIONS[@]}")
fi
# ---------- helpers ----------------------------------------------------
log() { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
warn() { printf '\033[1;33m==>\033[0m %s\n' "$*" >&2; }
die() { printf '\033[1;31m==>\033[0m %s\n' "$*" >&2; exit 1; }
require_cmd() {
command -v "$1" >/dev/null 2>&1 || die "missing: $1 (install it first)"
}
# sha256_url <url> — act_runner names its action-clone dirs after
# sha256(URL). Verified against a real run log:
# url=https://github.com/actions/checkout
# sha256=c3fe249fe73091a17d6638fe1341e7bd0bcc3466ce52323c0688e83e2463a4ab
sha256_url() {
printf '%s' "$1" | sha256sum | awk '{print $1}'
}
# ---------- pre-flight -------------------------------------------------
[[ $EUID -eq 0 ]] || die "run as root (the act_runner service writes /var/lib/act_runner as root)"
require_cmd systemctl
require_cmd docker
require_cmd git
require_cmd curl
require_cmd python3
# PyYAML for the config edit. Install if missing — Ubuntu 24.04 ships
# python3-yaml in the default repos.
if ! python3 -c 'import yaml' 2>/dev/null; then
log "installing python3-yaml (needed for safe YAML edits)"
apt-get update -qq
apt-get install -y -qq python3-yaml
fi
[[ -f "$ACT_RUNNER_CONFIG" ]] || die "$ACT_RUNNER_CONFIG not found — is act_runner installed?"
systemctl list-unit-files act_runner.service >/dev/null 2>&1 || \
die "act_runner.service not found — register the runner first"
log "pre-flight OK"
log " cache base : $CACHE_BASE"
log " config file : $ACT_RUNNER_CONFIG"
log " runner image : $RUNNER_IMAGE"
log " actions to clone : ${ACTIONS[*]}"
# ---------- 1. cache directories ---------------------------------------
log "creating cache directories under $CACHE_BASE"
for sub in go-mod go-build act-actions; do
install -d -m 0755 -o root -g root "$CACHE_BASE/$sub"
done
# ---------- 2. edit /etc/act_runner/config.yaml ------------------------
#
# Three keys are reconciled to known values:
#
# container.force_pull : false (we keep the image fresh via cron)
# container.options : "-v <cache mounts...>" (auto-mount caches
# into every job container)
# container.valid_volumes: [<our cache paths>] (whitelist so the
# container.options mounts are accepted)
#
# Other keys are preserved verbatim. The edit is idempotent: re-running
# yields the same file content.
log "patching $ACT_RUNNER_CONFIG"
# Backup once (only if no .pre-provision backup exists yet).
if [[ ! -f "${ACT_RUNNER_CONFIG}.pre-provision" ]]; then
cp -p "$ACT_RUNNER_CONFIG" "${ACT_RUNNER_CONFIG}.pre-provision"
log " saved pristine copy to ${ACT_RUNNER_CONFIG}.pre-provision"
fi
CONTAINER_OPTIONS_VALUE="-v ${CACHE_BASE}/go-mod:/root/go/pkg/mod:rw -v ${CACHE_BASE}/go-build:/root/.cache/go-build:rw -v ${CACHE_BASE}/act-actions:/root/.cache/act:rw"
CACHE_BASE="$CACHE_BASE" CONTAINER_OPTIONS_VALUE="$CONTAINER_OPTIONS_VALUE" \
ACT_RUNNER_CONFIG="$ACT_RUNNER_CONFIG" \
python3 - <<'PY'
import os, sys, yaml
cfg_path = os.environ['ACT_RUNNER_CONFIG']
cache_base = os.environ['CACHE_BASE']
container_options = os.environ['CONTAINER_OPTIONS_VALUE']
with open(cfg_path) as f:
cfg = yaml.safe_load(f) or {}
cfg.setdefault('container', {})
cfg['container']['force_pull'] = False
cfg['container']['options'] = container_options
# Whitelist every cache subdir explicitly so jobs that try to bind-mount
# them via workflow-side `volumes:` (rare but possible) are accepted.
desired_vols = [
f"{cache_base}/go-mod",
f"{cache_base}/go-build",
f"{cache_base}/act-actions",
]
existing = cfg['container'].get('valid_volumes') or []
merged = list(dict.fromkeys(existing + desired_vols)) # de-dup, preserve order
cfg['container']['valid_volumes'] = merged
# Write back with stable formatting. yaml.dump preserves enough
# structure for act_runner to parse; comments in the original config
# do get stripped — that's why we preserve the .pre-provision backup.
with open(cfg_path + '.tmp', 'w') as f:
yaml.safe_dump(cfg, f, default_flow_style=False, sort_keys=False)
os.replace(cfg_path + '.tmp', cfg_path)
print(f" container.force_pull : false")
print(f" container.options : {container_options}")
print(f" container.valid_volumes: {merged}")
PY
# ---------- 3. pre-pull the runner image -------------------------------
log "pulling $RUNNER_IMAGE (one-time; cron refreshes it nightly)"
docker pull "$RUNNER_IMAGE"
# ---------- 4. pre-clone the actions list ------------------------------
#
# act_runner expects clones at $cache/<sha256(url)> with the ref already
# checked out. We clone the default branch then fetch + check out the
# requested ref. Re-running fetches updates rather than re-cloning.
log "pre-cloning actions into $CACHE_BASE/act-actions"
for spec in "${ACTIONS[@]}"; do
if [[ "$spec" != *@* ]]; then
warn " skip '$spec' — must be owner/repo@ref"
continue
fi
repo="${spec%@*}"
ref="${spec##*@}"
url="https://github.com/${repo}"
dir="${CACHE_BASE}/act-actions/$(sha256_url "$url")"
if [[ -d "$dir/.git" ]]; then
log " refresh $repo @ $ref"
git -C "$dir" fetch --quiet --tags --prune origin
else
log " clone $repo @ $ref$dir"
git clone --quiet "$url" "$dir"
fi
# Detach onto the requested ref. Works for branches, tags, and SHAs.
if ! git -C "$dir" -c advice.detachedHead=false checkout --quiet "$ref" 2>/dev/null; then
# If `ref` is a remote branch we haven't tracked yet, try origin/<ref>.
git -C "$dir" -c advice.detachedHead=false checkout --quiet "origin/$ref"
fi
done
# ---------- 5. golangci-lint -------------------------------------------
#
# Install the latest v2.x at /usr/local/bin/golangci-lint. Workflows
# that pin a specific version via the action's `version:` arg will
# still re-download — but jobs that don't pin (or pin to "latest"/"v2")
# get the host-installed binary for free.
log "installing/updating golangci-lint (latest v2.x) → /usr/local/bin"
GOLANGCI_INSTALL_URL="https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh"
# `-b` = install dir, `-d` = quiet "downloading" lines, no version arg
# means "latest" — which install.sh resolves to the latest v2 release
# from GitHub releases.
curl -fsSL "$GOLANGCI_INSTALL_URL" | sh -s -- -b /usr/local/bin >/dev/null
/usr/local/bin/golangci-lint --version || warn "golangci-lint install verification failed"
# ---------- 6. nightly refresh cron ------------------------------------
#
# Re-pulls the runner image, refreshes the action clones, and updates
# golangci-lint. Runs at 03:17 to dodge top-of-hour CI bursts.
CRON_PATH=/etc/cron.d/gitea-runner-refresh
REFRESH_SCRIPT=/usr/local/sbin/gitea-runner-refresh
log "writing $REFRESH_SCRIPT and $CRON_PATH"
# Materialise the actions list into the script so the cron is
# self-contained and surviving an edit to this file.
ACTIONS_LITERAL=""
for s in "${ACTIONS[@]}"; do
ACTIONS_LITERAL="${ACTIONS_LITERAL} \"$s\"\n"
done
cat >"$REFRESH_SCRIPT" <<EOF
#!/usr/bin/env bash
# Auto-generated by provision-gitea-runner.sh. Re-running the
# provisioning script regenerates this file.
set -euo pipefail
CACHE_BASE="$CACHE_BASE"
RUNNER_IMAGE="$RUNNER_IMAGE"
ACTIONS=(
$(printf ' "%s"\n' "${ACTIONS[@]}")
)
sha256_url() { printf '%s' "\$1" | sha256sum | awk '{print \$1}'; }
# 1. Refresh the runner-images base.
docker pull -q "\$RUNNER_IMAGE" >/dev/null
# 2. Refresh action clones.
for spec in "\${ACTIONS[@]}"; do
[[ "\$spec" == *@* ]] || continue
repo="\${spec%@*}"; ref="\${spec##*@}"
url="https://github.com/\$repo"
dir="\$CACHE_BASE/act-actions/\$(sha256_url "\$url")"
if [[ -d "\$dir/.git" ]]; then
git -C "\$dir" fetch --quiet --tags --prune origin || true
git -C "\$dir" -c advice.detachedHead=false checkout --quiet "\$ref" 2>/dev/null \\
|| git -C "\$dir" -c advice.detachedHead=false checkout --quiet "origin/\$ref" || true
fi
done
# 3. Refresh golangci-lint (latest v2.x). Tolerate transient
# GitHub-rate-limit failures — next night will retry.
curl -fsSL https://raw.githubusercontent.com/golangci/golangci-lint/HEAD/install.sh \\
| sh -s -- -b /usr/local/bin >/dev/null 2>&1 || true
EOF
chmod 0755 "$REFRESH_SCRIPT"
cat >"$CRON_PATH" <<EOF
# Auto-generated by provision-gitea-runner.sh. Refreshes the runner
# image, action clones, and golangci-lint every night at 03:17.
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
17 3 * * * root $REFRESH_SCRIPT >> /var/log/gitea-runner-refresh.log 2>&1
EOF
chmod 0644 "$CRON_PATH"
# ---------- 7. restart act_runner --------------------------------------
log "restarting act_runner.service to pick up the new config"
systemctl restart act_runner.service
sleep 2
systemctl is-active --quiet act_runner.service \
|| die "act_runner did not come back up — check 'journalctl -u act_runner -n 50'"
# ---------- 8. container-create benchmark ------------------------------
#
# Reports cold + warm `docker run --rm <image> true` time. Sanity check
# that overlay setup is fast on this host. Numbers > ~5s indicate a
# slow filesystem or DNS issue worth investigating separately.
log "benchmark: docker run --rm $RUNNER_IMAGE true"
{
printf ' cold (post-pull) : '
/usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
printf ' warm (immediate) : '
/usr/bin/time -f '%e s' docker run --rm "$RUNNER_IMAGE" true 2>&1 | tail -1
} || warn "benchmark failed — non-fatal"
# ---------- done -------------------------------------------------------
cat <<EOF
\033[1;32m==> Provisioning complete\033[0m
What changed on this host:
* /etc/act_runner/config.yaml — force_pull off, container.options +
valid_volumes set for the cache mounts. Pristine copy preserved
at ${ACT_RUNNER_CONFIG}.pre-provision.
* $CACHE_BASE/{go-mod,go-build,act-actions} — persistent caches.
* /usr/local/bin/golangci-lint — latest v2.x.
* $REFRESH_SCRIPT and $CRON_PATH — nightly refresh @ 03:17.
* Runner image pre-pulled.
\033[1;33mNote on Go cache + setup-go:\033[0m if your workflow uses
\`actions/setup-go\` with \`cache: true\`, the action will still tar/untar
the cache via the Gitea cache backend on every job — partially
defeating the persistent volume. For full speed-up, drop \`cache: true\`
from the workflow once the persistent volume is warm. Per-project
decision; this script doesn't touch workflows.
EOF