Smoothes the rough edges that came up exercising a live deployment.
First-run bootstrap UI: /bootstrap renders a username + password form
that uses the in-memory token directly (operator no longer copies it
out of the log); /login redirects there while bootstrap is available.
Agent reliability: failJob synthetic envelopes so command.run early
returns no longer hang the server-side job; runtime probe of restic
restore --help drives --no-ownership instead of version sniffing
(0.18.x had it removed). Server unit re-shaped: ProtectSystem=full
plus ReadWritePaths=/etc/restic-manager, no ProtectHome — restore
can now write anywhere a user might want.
Restore wizard: default target is /root/rm-restore/<job-id>/ with
clearer help text. Re-init confirm input uses .field (was .input,
which doesn't exist — text was invisible).
NS-01 host delete: store DeleteHost, admin-band /hosts/{id}/delete
with hostname-confirm danger zone, audit, FK cascade, live WS close.
NS-02 enrollment-token recovery: outstanding-tokens panel on
/hosts/new, regenerate (preserves attachments) and revoke handlers
+ audit, store-level ListOutstandingEnrollmentTokens and
DeleteEnrollmentToken.
NS-03 repo init / probe surface: migration 0020 adds
hosts.repo_status + repo_status_error; WS handler projects every
init job's outcome onto the host row (idempotent already-initialised
collapses to ready); creds-save resets status and dispatches a fresh
probe; /hosts/{id}/repo/probe retry endpoint with banner.
NS-04 dashboard live + sort + filter: query-string filter
(q/status/repo_status/tag/sort/dir), 5s htmx live poll mirroring the
alerts pattern with a localStorage live toggle, sortable column
headers, filter row + clear.
Alerts page: ack'd-by line resolves user_id ULID to username.
Compose.yaml ignored — host-specific.
The CI runs go test with -race; the agent runner has two pump goroutines
(pumpStdout + pumpStderr) writing through the sender concurrently, and
the unprotected fakeSender slice append raced. The cancel_test had a
local 'safeSender' workaround for the same issue; promote that mutex
onto fakeSender itself so every test in the package is race-clean
without per-test variants.
- fakeSender grows mu sync.Mutex; Send takes/releases. New snapshot()
helper for tests that want a stable copy.
- cancel_test drops its local safeSender + sync import; uses fakeSender.
Verified: go test -race ./... passes across all packages.
1. Agent-side MkdirAll on the new-dir restore target. Restic creates
missing leaves but won't traverse multiple missing levels, and
under the systemd sandbox writes outside ReadWritePaths fail
anyway. Calling os.MkdirAll(target, 0700) before invoking restic
means the operator never has to pre-create the per-job subdir,
and a path the sandbox rejects surfaces as a clean
'restic restore: prepare target ...: read-only file system' error
in the job log instead of a cryptic restic-side stat failure.
2. tasks.md Phase 3 — Restore section refreshed:
- P3-X4 added (job log download dropdown — txt + ndjson)
- P3-X5 added (UK lint locale switch + 73-correction sweep)
- P3-X6 added (SIZE/FILES tooltip when host's restic < 0.17)
- P3-03 entry expanded to cover version-gated --no-ownership,
editable target, $HOME expansion, agent-side MkdirAll
- As-shipped sweep summary mentions custom-target restore +
download dropdown + tooltip in addition to the original walk
Test: TestRunRestoreNewDirAutoCreatesTarget seeds a multi-level
target the operator hasn't created and confirms RunRestore mkdir's
the chain before invoking restic.
Three small follow-ups from review:
1. Restore target is now operator-editable. Default value is the
literal '\$HOME/rm-restore/<job-id>/' (agent expands \$HOME at
run time using os.UserHomeDir(); also handles \${HOME} and ~/
prefixes). Operator can replace with any absolute path.
- ui_restore.go validates the input is either absolute or starts
with one of the recognised prefixes; other env-var refs (\$PATH
etc.) are deliberately rejected so operator paths can't pick up
arbitrary agent env values.
- host_restore.html replaces the read-only mono-text display with
a real <input>; help text spells out that \$HOME resolves
agent-side and <job-id> is substituted on dispatch.
- install.sh + the systemd unit prep /root/rm-restore so the
default works under the sandbox: ReadWritePaths gains a soft
'-/root/rm-restore' entry (the '-' makes the bind-mount soft-fail
if missing, but install.sh pre-creates it root-owned 0700).
2. --no-ownership flag now gated on restic version. The flag was
added in restic 0.17 and 0.16 rejects it. Previously dropped it
wholesale — that meant new-dir restores silently preserved
ownership against design intent on 0.17+. Now the agent threads
its detected restic version (sysinfo already collects it) through
runner.Config -> restic.Env, and RunRestore appends --no-ownership
only when AtLeastVersion(0, 17) returns true. 0.16 hosts still
restore with original uid/gid; help text in the wizard explicitly
notes this. The previous 'Original ownership is preserved' copy
was wrong for new-dir mode and is corrected.
3. golangci-lint misspell locale switched US -> UK and the codebase
swept (73 corrections, mostly behaviour/serialise/recognise/honour).
Wire-format ErrorCode 'unauthorized' -> 'unauthorised' is a tiny
contract change but the agent doesn't parse those codes today and
no external API consumers exist yet. Tests passed before + after.
Tests:
- internal/restic/version_test.go covers Env.AtLeastVersion across
edge cases (empty, exact match, patch above, minor below, non-
numeric) and expandHome on \$HOME / \${HOME} / ~/, plus
pass-through for absolute paths and refusal of other env vars.
- ui_restore_test updated: TargetDir now starts '\$HOME/rm-restore/'
with the job_id substituted into the placeholder.
Live verified on the smoke env: default target restored to
/root/rm-restore/<job-id>/ as the agent's expanded \$HOME (2 files,
14 bytes); custom override '/tmp/custom-restore/<job-id>/' restored
into the agent's PrivateTmp namespace (1 file, 6 bytes); both jobs
'succeeded', exit 0.
Bug fixes from the Playwright sweep against the live smoke server:
1. Snapshot-picker layout. The .snap-row class was used in the wireframe
but never landed in web/styles/input.css; rows rendered as vertical
blocks instead of a 6-column grid. Added the token (mirrors host-row
shape with restore-specific column widths).
2. Tree expansion. hx-target='closest .tree-row + .tree-children' isn't
a valid HTMX selector — modifiers don't chain. Replaced HTMX-driven
expansion with a small window.__rmTreeToggle helper that uses plain
fetch + .tree-pair wrapper structure for trivial sibling lookup.
Caches loaded state per node.
3. --no-ownership flag dropped. Restic 0.17 introduced --no-ownership;
0.16 rejects it ('unknown flag') before doing any work. Since the
agent runs as root in the systemd unit, restored files keep their
original uid/gid either way and the parent dir is root-owned, so
the 'cp without sudo' rationale doesn't hold. Drop the flag entirely.
4. Default target dir moved to /var/lib/restic-manager/restore. The
systemd unit pins ReadWritePaths to /etc/restic-manager +
/var/lib/restic-manager (with ProtectSystem=strict making the rest
of /var read-only); writes to /var/restic-restore failed with
'read-only file system'.
5. Confirm summary HTML escaping. defaultTarget JS literal evaluates
to a string with literal angle brackets; insertion into innerHTML
must escape them. Added an inline HTML-escape pass.
tasks.md ticked for the Restore sub-phase with a sweep summary
covering the live end-to-end test.
Wires JobRestore and JobDiff end-to-end at the agent layer (the wizard
backend that drives this lands in the next slice).
- internal/api: JobRestore + JobDiff JobKind constants. CommandRunPayload
grows nullable Restore + Diff sub-payloads. RestorePayload carries
snapshot_id, paths, in_place, target_dir; DiffPayload carries
snapshot_a + snapshot_b.
- internal/restic.RunRestore wraps 'restic restore <sid> --target ...
[--no-ownership] [--include p]...' with --json. New pumpRestoreStdout
parses the per-line status / summary objects (drops raw status from
log.stream — the throttled job.progress envelope covers it). New
RestoreStatus + RestoreSummary types mirror restic's wire shape.
- internal/restic.RunDiff wraps 'restic diff --json <a> <b>'.
- internal/agent/runner: RunRestore translates RestoreStatus into
job.progress (mapping FilesRestored → FilesDone etc) with a small
estimateETA helper since restic doesn't provide ETA for restore.
RunDiff is a thin streamHandler wrapper.
- cmd/agent dispatcher gains JobRestore + JobDiff cases. Both reuse
the spawn() helper from P3-X1 so cancel just works.
- Drive-by fix: lastProgress was initialised to time.Now() so the
very first status event was suppressed by the 1s throttle if the
agent reported quickly. Initialise to time.Time{} (zero) so the
first event always emits. Affects backup + restore.
Tests:
- restore_test covers restore happy path (started → progress →
finished, kind=restore on the started envelope), in-place argv
asserts no --no-ownership, new-dir argv asserts --no-ownership +
--target + --include, diff produces the expected log.stream lines.
Restage block (CLAUDE.md) is deferred to the end of the restore
sub-phase so we restage once with all changes.
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:
- internal/api already declared MsgCommandCancel + CommandCancelPayload;
promote those from forward-declarations to a working envelope. Agent
side: cmd/agent/main.go drops the TODO-stub and gains a per-job
ctx.CancelFunc map. runJob's switch is refactored around a small
spawn() helper so each kind's goroutine derives a per-job context,
registers the cancel, and removes itself on completion regardless of
outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
ctx.Canceled errors as JobCancelled (exit 130) rather than
JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
build-tagged sigterm constant; os.Kill on Windows since SIGTERM
isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
is non-terminal and the host is online, sends command.cancel via
the hub, writes a job.cancel audit row, returns 202. The agent's
resulting job.finished (status=cancelled) is what actually
transitions the row.
Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
shape + audit row), 409 for terminal jobs, 404 for missing jobs,
503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
restic that exec'd into 'sleep 30' is canceled 150ms after start
and the resulting job.finished reports JobCancelled with exit 130
in well under the WaitDelay.
Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
Agent: new runner.BackupHooks struct + runHook helper invoked via
/bin/sh -c (cmd.exe /C on Windows). pre_hook non-zero exit aborts
the backup; post_hook always runs with RM_JOB_STATUS=succeeded|failed
in env. Output streamed as 'hook(<phase>): …' log.stream lines.
Hooks only run for kind=backup (other kinds skip both phases).
Server: resolveBackupHooks resolves group → host default → empty,
decrypts via crypto.AEAD with per-slot ad bytes, plumbs plaintext
into CommandRunPayload for both schedule.fire and per-group
Run-now dispatch sites. Decrypt failures degrade silently to no
hook so a malformed blob can't poison every backup.
P2R-13a. restic.Env gains LimitUploadKBps/LimitDownloadKBps which are
emitted as global --limit-upload/--limit-download flags before the
subcommand on every invocation. Agent dispatcher tracks host-wide
caps received via config.update; server pushes them on hello and
after PUT /api/hosts/{id}/bandwidth.
Also extends api.CommandRunPayload with optional per-job overrides
(BandwidthUpKBps/Down + PreHook/PostHook); the override consumers
land in T2/T6.
CI run #48 failed with:
--- FAIL: TestRunInitShipsStartedAndFinished
RunInit: ... fork/exec /tmp/.../restic: text file busy
setupScript and setupScriptBin used os.WriteFile to write a shell
script directly at the final path, then exec'd it. Under -race +
many t.Parallel tests, a fork-from-another-goroutine could inherit
the still-open writable fd from one of those WriteFile calls; the
kernel returns ETXTBSY when the freshly-execed binary still has a
writable fd anywhere on the system.
Fix: write to "<path>.tmp", then os.Rename into place. The rename
is a pure dirent op; by the time the final path exists, no process
has a writable fd on its inode and exec is safe. -race + -count=5
on both runner packages now passes consistently.
Wires a 60s server-side ticker to the pure-logic maintenance.Decide
introduced in the previous commit. Decisions flow through a new
DispatchMaintenance method on *Server, which:
- skips offline hosts (no pending_runs queueing — maintenance is
not a backup, missed fires shouldn't pile up)
- silently skips prune when admin creds aren't bound
- pushes admin creds before prune, then dispatches with
RequiresAdminCreds=true (same as operator-driven prune)
- persists job rows with actor_kind="system"
Reshapes the forget wire payload from a single RetentionPolicy to a
ForgetGroups list (one tag + per-group keep-* per source group). The
agent walks the groups and runs `restic forget --tag <name> --keep-*`
once per group. Dead-code removed: CommandRunPayload.RetentionPolicy,
the old forget JSON-decode in cmd/agent, and the single-policy form of
restic.RunForget.
Save and SaveAdmin now propagate loadBundle errors instead of silently
overwriting a corrupt file (data-loss fix). Tests added for both paths.
reportStats logs a Debug on RunStats failure; r in runJob gets a comment
explaining the prune-runner asymmetry; runner_test comment tightened.
RunCheck and RunUnlock were calling sendFinished before reportStats,
inverting the required job.started → log.stream → repo.stats →
job.finished envelope order. Move reportStats ahead of sendFinished in
both functions to match the pattern already correct in RunPrune.
Strengthen TestRunCheckShipsCheckStatus, TestRunCheckErrorsFoundShipsErrorsStatus,
and TestRunUnlockClearsLock with the same position-index ordering
assertions used by TestRunPruneShipsExpectedEnvelopes; these assertions
would have failed against the pre-fix code.
Extract resticEnv/sendStarted/streamHandler/sendFinished helpers to remove
boilerplate duplication across Run* methods. Add RunPrune (ships repo.stats
with LastPruneAt before job.finished), RunCheck (ships stats with
LastCheckStatus/LockPresent regardless of outcome), RunUnlock (ships
LockPresent=false on success), and reportStats (fills size fields via
RunStats when caller didn't populate them).
Wire JobPrune/JobCheck/JobUnlock into the dispatcher switch; teach
MsgConfigUpdate about the Slot discriminator for admin vs repo creds;
add strconv import for subset-pct parsing.
Cleanup pass over the repo so CI can enforce lint going forward
without the only-new-issues escape hatch:
* gofumpt -w across the tree (31 hits, all formatting)
* misspell --fix (25 hits, US-locale spelling) — but reverted on
api.JobCancelled = "cancelled" since that literal is the wire +
DB CHECK constraint value, plus matched the case in store/fleet.go
back to "cancelled" and added //nolint:misspell on both for the
next time someone reaches for the auto-fix
* Wrap every `defer rows.Close()` / `defer stmt.Close()` /
`defer res.Body.Close()` in `defer func() { _ = .Close() }()`
to satisfy errcheck without losing the close itself
* websocket.Dial callers (1 prod, 4 tests) now capture + close the
upgrade response Body — coder/websocket can return res with a nil
Body on success, so the test deferred-closes guard against that
* Annotate the two genuine-by-design nilerr cases with //nolint
comments explaining why nil-on-error is the contract (cookie
missing = no session; ctx cancelled mid-backoff = clean shutdown)
* Add brief godoc on the 10 exported const groups + types that
revive flagged (api.HostOS/HostArch/JobKind/JobStatus/LogStream/
ErrorCode, restic.EventKind, store.Role, web.FS)
* Drop the unused (*Server).userByID method
* Inline the unparam baseView(active) — every UI page is under
the dashboard primary nav today
Result: `golangci-lint run ./...` reports 0 issues. CI lint job
no longer needs only-new-issues: true; X-06 follow-up entry in
tasks.md removed.
restic --json emits a status frame ~every 16ms during a backup.
The runner was forwarding every line to log.stream verbatim, which
flooded the live log pane with duplicate status JSON for any
short-running backup (visible immediately on a 1000-file, ~4MB
test set: ~14 identical 'percent_done: 1' lines in 220ms).
The progress widget already covers the same information at a sane
sample rate (one per second via job.progress), so the raw status
lines in log.stream are double-bookkeeping. Skip them and forward
only non-status lines (file names, errors, summary).
Throttling logic for job.progress is unchanged.
End-to-end forget plumbing — operator can create a forget schedule
with keep-* values, agent runs restic forget --keep-* … on the
schedule's cron (or via per-row Run-now), snapshot list shrinks,
UI updates.
* api.CommandRunPayload gains retention_policy json.RawMessage so
the agent doesn't need a typed copy of the server-side struct.
* restic.ForgetPolicy mirrors restic's --keep-* flags. Empty()
reports zero dimensions; restic wrapper RunForget refuses to
run an empty policy (would delete every snapshot). Does NOT
pass --prune — pruning lives behind a separate admin-only
credential (P2-06); forget just rewrites the snapshot index.
* runner.RunForget mirrors RunBackup's envelope shape so the
live log viewer works without special-casing. On success
triggers reportSnapshots (forget shrinks the index, the host's
snapshot count almost certainly changed).
* cmd/agent dispatcher handles MsgCommandRun with kind=forget,
decodes RetentionPolicy from the wire, builds restic.ForgetPolicy.
* Server dispatchScheduleNow marshals the schedule's
RetentionPolicy into the wire payload for kind=forget jobs.
Refuses to dispatch a forget schedule with empty retention.
* validateSchedule rejects kind=forget without at least one keep-*
dimension (new error code: missing_retention).
* UI schedule edit form gains a Kind dropdown (backup or forget;
immutable on edit). Paths block toggles by kind via inline
data-kind attributes. Form help-text explains the prune
separation.
Other kinds (prune, check, unlock) deferred to P2-06..08; the
Kind dropdown only offers backup and forget today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cohesive batch from a smoke-test session against a real rest-server.
Themed bullets:
* Agent runs as root, sandboxed via systemd. CapabilityBoundingSet
drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict
with ReadWritePaths confined to /etc + /var/lib/restic-manager;
NoNewPrivileges blocks escalation. Install script no longer
creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the
rationale (matches UrBackup / Veeam / Bareos defaults; trying to
back up "everything" as an unprivileged user creates silent skips
on /home, /root, /var/lib/* with no upside vs the threat model
the agent already implies).
* Init-repo end-to-end. New JobKind="init" wired through agent
runner, restic.Env.RunInit, server dispatcher, and a UI button
(red "Initialise repo" in the run-now panel). hosts.repo_initialised_at
flips on init success, on backup success, or on a non-empty
snapshots.report. The "Run now" / "Init" / "Retry" branching now
drives both the dashboard host row and the host-detail panel.
Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using
the safe create-new-then-rename pattern; first version corrupted
job_logs.job_id FK), 0006 (cleans up job_logs FK on already-
affected DBs).
* rest-server creds embedded at exec time only. restic.Env gains
RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL
inside envSlice() and never assigns it back to the struct, so
nothing slog-able ever sees the cleartext form. RedactURL helper
for any future surface that needs to log a URL safely. Both
helpers tested.
* Add-host UX. Repo password is now optional — server mints a
24-byte URL-safe random one and surfaces it once, alongside an
htpasswd snippet ("echo PASS | htpasswd -B -i ... USERNAME") so
the operator pastes one command on the rest-server host and one
on the endpoint. Result page also links the install snippet at
/install/install.sh (was /install.sh — 404'd before) and pipes
to bash (not sh — script uses set -o pipefail and other
bashisms; on Debian/Ubuntu sh is dash).
* Late-subscriber race in JobHub. A fast-failing job could finish
(DB write + Broadcast) before the browser's HX-Redirect → page
load → WS-connect path completed, so the JS sat forever waiting
on a job.finished that already passed. JobHub split into
Register + Send + Run; handleJobStream now subscribes first,
re-fetches the job, and sends a synthetic job.finished if the
state is already terminal.
* HTMX error visibility. New toast partial listens to
htmx:responseError and surfaces the response body as a
bottom-right toast — every server-side validation error now
becomes visible without per-handler JS wiring. Also handles
custom rm:toast events for future server-pushed notifications
via the HX-Trigger header. Themed via existing CSS vars.
* Dashboard rows are now whole-row clickable to host detail
(CSS card-link pattern: absolute-positioned anchor + .row-action
z-index restoration so the action button stays clickable).
"View →" on a running job links to /jobs/<id> rather than
/hosts/<id> since the row click already covers the host page.
* "Run first" / "Run first backup" → "Run now" everywhere for
consistency.
* runbook (docs/e2e-smoke.md) updated — live-log streaming step
now reflects P1-26; mentions the browser-driven Run-now flow.
* _diag/dump-creds — moved out of cmd/ so go build doesn't pick
it up; .gitignore now excludes /_diag/ entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agent calls restic snapshots --json after each successful backup
(60s timeout, separate from the backup ctx) and ships the projection
over the existing snapshots.report WS envelope. Failure here is
logged but doesn't fail the job — the next successful backup catches
the projection up.
Server-side ReplaceHostSnapshots is delete-then-insert plus a
hosts.snapshot_count update in one transaction so the dashboard's
per-host count stays consistent with the projection. New read
endpoint GET /api/hosts/{id}/snapshots returns the cached list with
a refreshed_at marker so the UI can show staleness when an agent
has been offline.
Schema: dropped the unused snapshots.repo_id FK (repos as a
first-class entity is P2 work), added short_id and refreshed_at
columns, switched the time index to DESC for the most-recent-first
list query. api.Snapshot gains short_id; size_bytes/file_count come
from the embedded summary block on restic 0.16+ and stay zero on
older clients.
Tests cover round-trip, authoritative replacement after forget+prune
shrinkage, and empty-after-wipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the operator → server → agent → restic → server roundtrip for
on-demand backups. The flow:
POST /api/hosts/{id}/jobs {kind:"backup",args:["/path"]}
→ server creates a queued Job row
→ server emits command.run over WS to the host's agent
→ agent dispatcher spawns runner.RunBackup in a goroutine
→ runner spawns `restic backup --json`, parses each line
→ forwards: job.started, log.stream (every line), job.progress
(throttled to 1/sec), job.finished (with summary stats blob)
→ server WS handler persists those into jobs / job_logs
P1-16 internal/restic: thin Locate + Env wrapper that runs `restic
backup --json`, scans stdout/stderr, parses BackupStatus +
BackupSummary, calls back into a LineHandler so the agent can fan
out to log.stream + job.progress. Treats exit code 3 as
"succeeded with issues" (matches restic's contract).
P1-18 store: jobs accessors (CreateJob, MarkJobStarted,
MarkJobFinished, AppendJobLog, GetJob).
P1-19 server: POST /api/hosts/{id}/jobs creates the Job row,
validates kind, dispatches via Hub.Send, audit-logs the action.
P1-20 agent runner: wraps restic.RunBackup with throttled progress
emission. Sender abstraction was added to wsclient.Handler so
background goroutines can keep replying after dispatch returns.
P1-21 server WS: dispatchAgentMessage now persists job.started,
job.finished, log.stream into the database. Browser fan-out for
live tailing lands with the UI work.
Agent gets repo_url + repo_password from agent.yaml in plaintext
for now (mode 0600, owned by service user); spec.md §7.3's keyring
storage moves there in P2. config.update over WS overrides the
in-memory copy (does not persist).
Build clean; all tests pass. End-to-end with a real restic still
needs a host that has restic installed — wire shape verified by
the existing hello/heartbeat round-trip test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>