restic-manager

Author	SHA1	Message	Date
steve	22a5bb7db5	v1 readiness: CHANGELOG + threat model + first-run onboarding polish CI / Test (store) (pull_request) Successful in 5s Details CI / Test (rest) (pull_request) Successful in 9s Details CI / Build (windows/amd64) (pull_request) Successful in 7s Details CI / Build (linux/amd64) (pull_request) Successful in 7s Details CI / Lint (pull_request) Successful in 19s Details CI / Build (linux/arm64) (pull_request) Successful in 8s Details e2e / Playwright vs docker-compose (pull_request) Failing after 1m35s Details CI / Test (server-http) (pull_request) Successful in 2m37s Details - CHANGELOG.md: Keep-a-Changelog format, v1.0.0 entry summarising what each phase delivered. - docs/threat-model.md: structured walkthrough of assets, actors, attack surfaces and residual risks; reviewed against v1.0.0. - cmd/server/main.go: at first-run startup, print a clickable $RM_BASE_URL/bootstrap URL alongside the existing one-shot bootstrap token (or a fallback hint when RM_BASE_URL is unset). - web/templates/pages/bootstrap.html: visible "Minimum 12 characters" hint under the password field so the rule is communicated before the operator submits. - tasks.md: close X-01, X-04, X-05 with notes.	2026-05-09 12:29:00 +01:00
steve	8d1fbe4f07	Added new AI focused document for host onboarding CI / Test (rest) (pull_request) Successful in 7s Details CI / Test (store) (pull_request) Successful in 9s Details CI / Lint (pull_request) Successful in 18s Details CI / Build (windows/amd64) (pull_request) Successful in 15s Details CI / Build (linux/amd64) (pull_request) Successful in 7s Details CI / Build (linux/arm64) (pull_request) Successful in 51s Details CI / Test (server-http) (pull_request) Successful in 1m29s Details e2e / Playwright vs docker-compose (pull_request) Successful in 1m27s Details	2026-05-09 12:18:42 +01:00
steve	89537d417a	P5: OSS readiness — docs site, contributor onboarding, e2e harness P5-01 — Documentation site under docs/book/ rendered with mdBook (downloaded via Makefile, same static-binary pattern as Tailwind). Structured chapters: getting started, concepts, operations, security, reference. `make docs` / `make docs-watch`. Generated output gitignored. P5-02 — CONTRIBUTING.md rewritten from placeholder to a full guide. CODE_OF_CONDUCT.md adapted from Contributor Covenant for a single-maintainer project. .gitea/issue_template/{bug,feature}.md and PULL_REQUEST_TEMPLATE.md. P5-04 — Six README screenshots captured live from a fresh server bootstrap (login, empty dashboard, add-host, alerts, settings, audit log). README rewritten to centre the screenshot grid and link out to the docs site. P5-05 — SECURITY.md with disclosure policy (3-day ack, 30-day default window), scope in/out, threat-model summary, operator hardening checklist. Mirrored as a docs-site chapter. P5-06 — End-to-end test harness. e2e/compose.e2e.yml brings up server + sibling Linux agent (alpine + restic) + restic/rest-server. Agent uses announce-and-approve so Playwright can drive the full operator flow: bootstrap → login → accept pending → backup → verify terminal status. Second spec scrapes /metrics to assert the P6-04 endpoint surface. .gitea/workflows/e2e.yml runs on every PR; local how-to in docs/e2e.md.	2026-05-08 20:08:23 +01:00
steve	73e733be61	P6-04+05: Prometheus /metrics endpoint + Grafana dashboard CI / Test (rest) (pull_request) Successful in 41s Details CI / Test (store) (pull_request) Successful in 43s Details CI / Lint (pull_request) Successful in 29s Details CI / Build (windows/amd64) (pull_request) Successful in 44s Details CI / Test (server-http) (pull_request) Successful in 1m47s Details CI / Build (linux/arm64) (pull_request) Successful in 43s Details CI / Build (linux/amd64) (pull_request) Successful in 2m1s Details New internal/server/metrics package emits the legacy text/plain exposition format directly, so we don't pull in prometheus/client_golang. Endpoint is opt-in via RM_METRICS_TOKEN and/or RM_METRICS_TRUSTED_CIDR; route is not mounted at all if neither gate is set. Both gates ANDed when both configured. Per-host gauges (online, last_backup_*, repo_size_bytes, snapshot_count, open_alerts, repo_status), server gauges (hosts_total/online, active_alerts by severity, build_info), and an in-memory job-duration histogram observed from the existing MsgJobFinished branch in the WS handler. Docs in docs/prometheus.md (enable + scrape config + metric reference + dashboard import). Sample dashboard at deploy/grafana/restic-manager-dashboard.json - six panels, Grafana schema 39, single Prometheus datasource variable. Tests: golden render, concurrent observe, bucket boundaries in the metrics package; auth matrix (no auth -> 404, token gate, CIDR gate, both required) in the HTTP layer.	2026-05-07 23:17:15 +01:00
steve	70ff554402	spec+plan: P6-04/05 prometheus /metrics + Grafana dashboard	2026-05-07 23:07:30 +01:00
steve	363bdff85b	plan: P6-03 repo size trend implementation	2026-05-07 18:15:06 +01:00
steve	20425b3360	spec: P6-03 repo size trend (sparkline + chart) design	2026-05-07 18:09:25 +01:00
steve	d24856866e	plan: P6-01+02 implementation plan	2026-05-06 21:37:38 +01:00
steve	731f01a63e	spec: P6-01+02 agent self-update + fleet update design	2026-05-06 21:20:00 +01:00
steve	d6f6d19bff	p5-07: reference deployment (server-only compose + reverse-proxy docs) CI / Test (store) (pull_request) Successful in 21s Details CI / Test (rest) (pull_request) Successful in 38s Details CI / Lint (pull_request) Successful in 33s Details CI / Build (windows/amd64) (pull_request) Successful in 39s Details CI / Test (server-http) (pull_request) Successful in 1m17s Details CI / Build (linux/amd64) (pull_request) Successful in 23s Details CI / Build (linux/arm64) (pull_request) Successful in 39s Details The reverse proxy is assumed to live outside this project (Caddy, nginx, Traefik, whatever the operator already runs). The reference compose stands up only the server: image-pinned via RM_VERSION, named volume for operator state, localhost-bound so the proxy reaches it on loopback. docs/reverse-proxy.md covers what the proxy must forward — the X-Forwarded-* headers, Host, and Connection: upgrade for the agent WebSocket and live-log streams — plus the RM_TRUSTED_PROXY CIDR rule that gates header trust. Worked examples for Caddy, nginx (with the websocket upgrade map + 1h proxy_read_timeout for live logs), and Traefik.	2026-05-05 17:15:00 +01:00
steve	7cc17813a9	p5-03: docker-only release path (drop goreleaser) Single public deliverable per tag: a multi-arch server image, with cross-compiled agent binaries + install scripts + the systemd unit baked under /opt/restic-manager/dist/. The /agent/binary and /install/* handlers fall back from <DataDir>/... to that read-only path so a fresh container Just Works without first-run staging; operators can still drop a custom build into <DataDir>/ to override per-host. Architecture rationale: agent distribution already routes through the running server, so the release surface mirrors that — there's no second source of truth to keep in sync. Workflow .gitea/workflows/release.yml triggers on v..* tag-push (fan-out :vX.Y.Z / :X.Y / :X, plus :latest once MAJOR>=1) and workflow_dispatch (snapshot tag only). Pushes to the Gitea container registry on this instance. Both binaries grow main.commit + main.date ldflag targets. Makefile and Dockerfile fill them; release workflow forwards from gitea.sha plus a UTC timestamp. Spec : docs/superpowers/specs/2026-05-05-p5-03-docker-only-release.md Plan : docs/superpowers/plans/2026-05-05-p5-03-docker-only-release.md	2026-05-05 15:18:48 +01:00
steve	cdbd8eeb88	plan: P4-05 — OIDC login implementation plan Bite-sized TDD tasks across 7 slices (A schema, B config, C OIDC client core + stub IdP, D login + callback, E logout + local-login rejection, F UI, G wiring + Authelia sweep). Each task is one commit with concrete code blocks and test cases — no placeholders. Refs spec at docs/superpowers/specs/2026-05-05-p4-05-oidc-design.md. Authelia bundle for the sweep stashed at /tmp/rm-smoke/oidc.env.	2026-05-05 13:04:39 +01:00
steve	bc19ad8804	spec: P4-05 — Authelia-specific defaults Confirmed claim name from the lab IdP is 'groups' (not 'roles' as the original spec assumed). Default the role_claim config field to 'groups' which also matches Keycloak and Authentik out of the box. Add a 'display_name' field so the SSO button can read 'Sign in with Authelia' rather than the generic 'SSO'. Two new gotchas captured: - Authelia 4.39+ 'sub' is an opaque UUID, not username — the locked design already keys on sub + reads preferred_username for display, so this is just documentation. - end_session_endpoint isn't always published (Authelia config- dependent); the locked logout flow already degrades cleanly.	2026-05-05 12:56:16 +01:00
steve	814e49cb93	spec: P4-05 — OIDC login design Brainstormed shape locked: JIT-provision local rows on first OIDC sign-in (auth_source='oidc'), YAML-only config (no UI), 'roles' claim with deny-on-no-match default, preferred_username with email fallback, refuse on local-user collision, single provider, login page shows SSO above password (break-glass), front-channel logout only, role re-evaluation at login only. Migration 0019: users.auth_source + users.oidc_subject (partial unique index), sessions.id_token (for end_session id_token_hint), oidc_state table for the OAuth round-trip state, swept on the existing alert-engine tick. Composes with the user-management work from P4-03/04: admin can disable OIDC users like local; last-admin guard catches IdP role- mapping mistakes; audit trail covers JIT-provision via user.created with auth_source payload + new user.oidc_login / user.oidc_login_blocked actions. Out of scope (deferred): back-channel logout, multi-provider, UI-driven role mapping, refresh tokens / mid-session re-eval.	2026-05-05 12:04:09 +01:00
steve	c9f230ce1d	plan: P4-03/04 — RBAC + user management implementation plan Bite-sized TDD tasks across 7 slices (A schema, B middleware, C session re-validation, D setup-token flow, E user CRUD API, F UI, G wiring + sweep). Each task is one commit with concrete code blocks and test cases — no placeholders. Refs spec at docs/superpowers/specs/2026-05-05-p4-03-04-rbac-user-mgmt-design.md.	2026-05-05 10:57:24 +01:00
steve	282258e837	spec: P4-03/04 — RBAC + user management design Brainstormed shape locked: chi route-group middleware, fail-closed admin default; setup-token flow with 1h single-use tokens (sha256-hashed at rest, raw shown to admin once); disable-only user lifecycle with last-admin guard; self-service /settings/account password change for every role; email field on users (metadata v1); session re-validation on every authenticated request so disable / role change land immediately. Locked decisions captured in §Role taxonomy, §Schema changes, §Setup-token flow, §RBAC enforcement, §Last-admin self-protection. Deferred items in §Out of scope (OIDC, SMTP email-the-link, hard delete, lockout). Migrations 0017 (users extensions) + 0018 (user_setup_tokens) both column-level ALTERs per CLAUDE.md preference.	2026-05-05 10:57:24 +01:00
steve	4b70939ab5	docs: P3 alerts implementation plan	2026-05-04 19:00:18 +01:00
steve	518c29ddb3	docs: P3 alerts spec — add SMTP as first-class v1 channel Post-brainstorm change after operator review: overnight-digest / "don't ping me at 03:00, email me in the morning" use case is poorly served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins webhook + ntfy as the third v1 channel; Apprise stays deferred. Spec updates: - Decision 5 reworded: three channels in v1. - Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link legitimately needs the headroom. - Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host, port, encryption (starttls/tls/none), username, password (AEAD), from, to. One channel = one To-address; multi-recipient = multiple channels (keeps failure attribution per-recipient). - Body shape documented: hardcoded subject pattern '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the alert id so threading groups raised → ack → resolved cleanly. Plain text only in v1. - Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no XOAUTH2 yet (app passwords recommended for Gmail / M365). - Test plan adds MailHog step in the Playwright sweep. - Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient channels are explicitly out of v1. Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html): - Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP gets the --ok green colour family so it visually separates from webhook (accent) and ntfy (warm). - New SMTP variant section (3c): host+port+encryption row, user+pass row, from+to row, test result, plus right-rail email shape preview showing the RFC 5322 layout. - Channel list grows a third row: 'overnight-digest · smtp://… → ops-overnight@example.com'.	2026-05-04 18:48:15 +01:00
steve	6165e34f6f	docs: P3 alerts design spec Phase 3 sub-spec covering the alerts engine, notification channels, and UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions locked before this spec was written. Key decisions captured: - Hardcoded rule set, no operator-tunable thresholds in v1. Six rules: backup_failed, forget_failed, prune_failed, check_failed, stale_schedule, agent_offline. - Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper for immediate triggers; one 60s ticker for stale-schedule detection + auto-resolution sweeps. - Auto-resolve when underlying condition clears; manual Resolve any time; Acknowledge as a separate I-have-seen-it intermediate state that does NOT close the alert. - v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel scope is global only — no per-host or per-severity routing. - Webhook payload is one stable JSON envelope shape across raised / acknowledged / resolved / test events; ntfy uses the standard publish format with severity → priority mapping. - Per-channel Send Test Notification button hits the real send path with a synthetic info-severity event; inline green-tick / red-cross result. - Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on every confirming tick so the UI can render still happening · Ns ago without re-notifying. - Top-level /alerts page; Settings shell with Notifications sub-tab. Per-host vitals Open alerts cell deep-links into filtered list. - Best-effort fire-and-forget delivery with 5s timeout; failures logged to a new notification_log table but never retried. Alert row in the DB is the source of truth. Migrations: - 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md) - 0014 adds notification_channels + notification_log tables Wireframe: _diag/p3-alerts-wireframe/wireframe.html	2026-05-04 18:39:26 +01:00
steve	454a2415dc	docs: P3 restore design spec + scope-decompose Phase 3 Splits Phase 3 into three independently-shippable sub-phases (Restore, Alerts, Audit UI) so they can land in separate PRs with their own brainstorm → spec → plan cycles. The Restore sub-phase is up first. The brainstorm ran on 2026-05-04 and locked the following decisions: - Single-host restore only this phase. P3-04 (cross-host restore) is moved to a new 'Future / unscheduled' section. Disaster recovery is already covered by re-enrolling a replacement host with the same repo creds; the remaining 'pull a file from host A onto host C' use case is genuinely different (file sharing / migration, not DR) and has no confirmed need. - Default target is /var/restic-restore/<job-id>/ with --no-ownership; in-place restore preserves uid/gid/mode and is gated by typed-confirmation of the host name (mirroring the repo re-init danger zone). - Tree browser is the path picker, lazy-loaded via a synchronous WS RPC (tree.list) over the existing correlation-ID infrastructure with a per-wizard-session in-memory cache (~30 min TTL). - Single-page wizard with progressively-enabled sections; entry is a top-level Restore button on host detail (or per-snapshot Restore action for direct deep-link). - Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other agent operation; output streams to the standard live job log page. - Restore-specific live job page variant with files-restored / bytes-restored / current-file widget. - Single-flight per host across all kinds, plus a real cancel-job feature (command.cancel WS envelope, agent kills the restic subprocess via context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a long-running backup if they need to restore urgently. Wires the existing job_detail Cancel button (which was a UI stub). - Audit row host.restore on every dispatch + a recent-restores panel on host detail. Role gate deferred to P4-03 RBAC. Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored — transient design artefact); screenshot reviewed and approved 2026-05-04.	2026-05-04 15:02:32 +01:00
steve	21d967a2cf	plan: P2 completion (P2R-09/10/11/12/13/14, P2-16/17/18)	2026-05-04 10:33:34 +01:00
steve	b640775a61	plan: P2 redesign Phase 5 (P2R-03..P2R-08)	2026-05-04 10:19:15 +01:00
steve	ee3ee241ea	P1 polish: agent-as-root, init-repo flow, rest creds passthrough, UX fixes CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Cohesive batch from a smoke-test session against a real rest-server. Themed bullets: * Agent runs as root, sandboxed via systemd. CapabilityBoundingSet drops to CAP_DAC_READ_SEARCH + restore caps; ProtectSystem=strict with ReadWritePaths confined to /etc + /var/lib/restic-manager; NoNewPrivileges blocks escalation. Install script no longer creates a service user. spec.md §4.2 / §14.1 / §14.3 explain the rationale (matches UrBackup / Veeam / Bareos defaults; trying to back up "everything" as an unprivileged user creates silent skips on /home, /root, /var/lib/* with no upside vs the threat model the agent already implies). * Init-repo end-to-end. New JobKind="init" wired through agent runner, restic.Env.RunInit, server dispatcher, and a UI button (red "Initialise repo" in the run-now panel). hosts.repo_initialised_at flips on init success, on backup success, or on a non-empty snapshots.report. The "Run now" / "Init" / "Retry" branching now drives both the dashboard host row and the host-detail panel. Migrations 0004 (column), 0005 (jobs.kind CHECK widened — using the safe create-new-then-rename pattern; first version corrupted job_logs.job_id FK), 0006 (cleans up job_logs FK on already- affected DBs). * rest-server creds embedded at exec time only. restic.Env gains RepoUsername; mergeRestCreds() builds the user:pass@-prefixed URL inside envSlice() and never assigns it back to the struct, so nothing slog-able ever sees the cleartext form. RedactURL helper for any future surface that needs to log a URL safely. Both helpers tested. * Add-host UX. Repo password is now optional — server mints a 24-byte URL-safe random one and surfaces it once, alongside an htpasswd snippet ("echo PASS \| htpasswd -B -i ... USERNAME") so the operator pastes one command on the rest-server host and one on the endpoint. Result page also links the install snippet at /install/install.sh (was /install.sh — 404'd before) and pipes to bash (not sh — script uses set -o pipefail and other bashisms; on Debian/Ubuntu sh is dash). * Late-subscriber race in JobHub. A fast-failing job could finish (DB write + Broadcast) before the browser's HX-Redirect → page load → WS-connect path completed, so the JS sat forever waiting on a job.finished that already passed. JobHub split into Register + Send + Run; handleJobStream now subscribes first, re-fetches the job, and sends a synthetic job.finished if the state is already terminal. * HTMX error visibility. New toast partial listens to htmx:responseError and surfaces the response body as a bottom-right toast — every server-side validation error now becomes visible without per-handler JS wiring. Also handles custom rm:toast events for future server-pushed notifications via the HX-Trigger header. Themed via existing CSS vars. * Dashboard rows are now whole-row clickable to host detail (CSS card-link pattern: absolute-positioned anchor + .row-action z-index restoration so the action button stays clickable). "View →" on a running job links to /jobs/<id> rather than /hosts/<id> since the row click already covers the host page. * "Run first" / "Run first backup" → "Run now" everywhere for consistency. * runbook (docs/e2e-smoke.md) updated — live-log streaming step now reflects P1-26; mentions the browser-driven Run-now flow. * _diag/dump-creds — moved out of cmd/ so go build doesn't pick it up; .gitignore now excludes /_diag/ entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:02:12 +01:00
steve	2418e585db	fix: enrollment FK race + log-when-rejected; runbook fixes from dry-run CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details The smoke runbook caught a real bug: ConsumeEnrollmentToken was inserting into host_credentials (FK -> hosts) inside the same tx as the token burn, but the host row didn't exist yet — CreateHost runs in the next statement. The agent saw a generic 401 with no clue why. Fix: drop the host_credentials insert from ConsumeEnrollmentToken; the HTTP handler now does Consume -> CreateHost -> SetHostCredentials. SetHostCredentials failure is logged loudly but doesn't fail the enrol — operator recovers via PUT /api/hosts/{id}/repo-credentials. Adds slog.Warn lines on both 401 paths in handleAgentEnroll so the underlying cause is visible in server logs (the wire response stays generic to avoid leaking which step failed). Test: TestEnrollmentTransfersRepoCreds rewritten to mirror the new order (consume -> create host -> SetHostCredentials). Runbook (docs/e2e-smoke.md): rest-server moved off 8000 (commonly in use); URLs use trailing slash on the rest path; clarified that secrets_key is minted on first agent start, not at enrol time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:01:59 +01:00
steve	5d1951ad94	P1-34: e2e smoke runbook + redacted GET /repo-credentials CI / Test (linux/amd64) (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Build (windows/amd64) (push) Has been cancelled Details CI / Build (linux/amd64) (push) Has been cancelled Details CI / Build (linux/arm64) (push) Has been cancelled Details Adds docs/e2e-smoke.md — an ~5-minute runbook that walks the full P1 happy path against a sibling restic/rest-server: bootstrap admin, mint token with repo creds, enrol an agent, watch the config.update push land, run a backup, confirm the snapshot, edit creds and watch the second push fire. Per the design discussion this is a runbook (not a Go integration test); the Playwright version lands in P5-06. GET /api/hosts/{id}/repo-credentials returns the redacted view — {repo_url, repo_username, has_password} — so the UI can pre-fill the edit form without ever pulling the password out of the AEAD blob. Marks P1-32 / P1-33 / P1-34 done in tasks.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:49:34 +01:00

25 Commits