Commit Graph

5 Commits

Author SHA1 Message Date
steve 814e49cb93 spec: P4-05 — OIDC login design
Brainstormed shape locked: JIT-provision local rows on first OIDC
sign-in (auth_source='oidc'), YAML-only config (no UI), 'roles'
claim with deny-on-no-match default, preferred_username with email
fallback, refuse on local-user collision, single provider, login
page shows SSO above password (break-glass), front-channel logout
only, role re-evaluation at login only.

Migration 0019: users.auth_source + users.oidc_subject (partial
unique index), sessions.id_token (for end_session id_token_hint),
oidc_state table for the OAuth round-trip state, swept on the
existing alert-engine tick.

Composes with the user-management work from P4-03/04: admin can
disable OIDC users like local; last-admin guard catches IdP role-
mapping mistakes; audit trail covers JIT-provision via
user.created with auth_source payload + new user.oidc_login /
user.oidc_login_blocked actions.

Out of scope (deferred): back-channel logout, multi-provider,
UI-driven role mapping, refresh tokens / mid-session re-eval.
2026-05-05 12:04:09 +01:00
steve 282258e837 spec: P4-03/04 — RBAC + user management design
Brainstormed shape locked: chi route-group middleware, fail-closed
admin default; setup-token flow with 1h single-use tokens
(sha256-hashed at rest, raw shown to admin once); disable-only user
lifecycle with last-admin guard; self-service /settings/account
password change for every role; email field on users (metadata
v1); session re-validation on every authenticated request so
disable / role change land immediately.

Locked decisions captured in §Role taxonomy, §Schema changes,
§Setup-token flow, §RBAC enforcement, §Last-admin self-protection.
Deferred items in §Out of scope (OIDC, SMTP email-the-link,
hard delete, lockout).

Migrations 0017 (users extensions) + 0018 (user_setup_tokens)
both column-level ALTERs per CLAUDE.md preference.
2026-05-05 10:57:24 +01:00
steve 518c29ddb3 docs: P3 alerts spec — add SMTP as first-class v1 channel
Post-brainstorm change after operator review: overnight-digest /
"don't ping me at 03:00, email me in the morning" use case is poorly
served by ntfy (push) and clumsy via webhook → email-gateway. SMTP joins
webhook + ntfy as the third v1 channel; Apprise stays deferred.

Spec updates:
- Decision 5 reworded: three channels in v1.
- Channel iface gains smtpChannel using net/smtp + crypto/tls. 10s
  timeout vs 5s for HTTP — STARTTLS handshake + DATA over a slow link
  legitimately needs the headroom.
- Migration 0014 CHECK now allows 'smtp'. New smtpConfig struct: host,
  port, encryption (starttls/tls/none), username, password (AEAD), from,
  to. One channel = one To-address; multi-recipient = multiple channels
  (keeps failure attribution per-recipient).
- Body shape documented: hardcoded subject pattern
  '[restic-manager] [<sev>] <host>: <kind>', Message-ID includes the
  alert id so threading groups raised → ack → resolved cleanly. Plain
  text only in v1.
- Encryption defaults to STARTTLS on 465/587; PLAIN auth over TLS, no
  XOAUTH2 yet (app passwords recommended for Gmail / M365).
- Test plan adds MailHog step in the Playwright sweep.
- Non-goals expanded: HTML emails, OAuth2/XOAUTH2, multi-recipient
  channels are explicitly out of v1.

Wireframe updates (_diag/p3-alerts-wireframe/wireframe.html):
- Kind picker grows from 2 cards to 3 (Webhook / Ntfy / SMTP @). SMTP
  gets the --ok green colour family so it visually separates from
  webhook (accent) and ntfy (warm).
- New SMTP variant section (3c): host+port+encryption row, user+pass
  row, from+to row, test result, plus right-rail email shape preview
  showing the RFC 5322 layout.
- Channel list grows a third row: 'overnight-digest · smtp://… →
  ops-overnight@example.com'.
2026-05-04 18:48:15 +01:00
steve 6165e34f6f docs: P3 alerts design spec
Phase 3 sub-spec covering the alerts engine, notification channels, and
UI (P3-05/06/07). Brainstorm ran 2026-05-04; all ten design decisions
locked before this spec was written.

Key decisions captured:

- Hardcoded rule set, no operator-tunable thresholds in v1. Six rules:
  backup_failed, forget_failed, prune_failed, check_failed,
  stale_schedule, agent_offline.
- Hybrid engine cadence: event hooks at MarkJobFinished + offline-sweeper
  for immediate triggers; one 60s ticker for stale-schedule detection +
  auto-resolution sweeps.
- Auto-resolve when underlying condition clears; manual Resolve any time;
  Acknowledge as a separate I-have-seen-it intermediate state that does
  NOT close the alert.
- v1 channels: native ntfy + webhook. Apprise + SMTP deferred. Channel
  scope is global only — no per-host or per-severity routing.
- Webhook payload is one stable JSON envelope shape across raised /
  acknowledged / resolved / test events; ntfy uses the standard publish
  format with severity → priority mapping.
- Per-channel Send Test Notification button hits the real send path with
  a synthetic info-severity event; inline green-tick / red-cross result.
- Dedup by (host_id, kind, resolved_at IS NULL); last_seen_at bumped on
  every confirming tick so the UI can render still happening · Ns ago
  without re-notifying.
- Top-level /alerts page; Settings shell with Notifications sub-tab.
  Per-host vitals Open alerts cell deep-links into filtered list.
- Best-effort fire-and-forget delivery with 5s timeout; failures logged
  to a new notification_log table but never retried. Alert row in the DB
  is the source of truth.

Migrations:
- 0013 adds alerts.last_seen_at (column-level ALTER per CLAUDE.md)
- 0014 adds notification_channels + notification_log tables

Wireframe: _diag/p3-alerts-wireframe/wireframe.html
2026-05-04 18:39:26 +01:00
steve 454a2415dc docs: P3 restore design spec + scope-decompose Phase 3
Splits Phase 3 into three independently-shippable sub-phases (Restore,
Alerts, Audit UI) so they can land in separate PRs with their own brainstorm
→ spec → plan cycles. The Restore sub-phase is up first.

The brainstorm ran on 2026-05-04 and locked the following decisions:

- Single-host restore only this phase. P3-04 (cross-host restore) is moved
  to a new 'Future / unscheduled' section. Disaster recovery is already
  covered by re-enrolling a replacement host with the same repo creds; the
  remaining 'pull a file from host A onto host C' use case is genuinely
  different (file sharing / migration, not DR) and has no confirmed need.
- Default target is /var/restic-restore/<job-id>/ with --no-ownership;
  in-place restore preserves uid/gid/mode and is gated by typed-confirmation
  of the host name (mirroring the repo re-init danger zone).
- Tree browser is the path picker, lazy-loaded via a synchronous WS RPC
  (tree.list) over the existing correlation-ID infrastructure with a
  per-wizard-session in-memory cache (~30 min TTL).
- Single-page wizard with progressively-enabled sections; entry is a
  top-level Restore button on host detail (or per-snapshot Restore action
  for direct deep-link).
- Snapshot diff (P3-09) is a JobDiff JobKind, dispatched like every other
  agent operation; output streams to the standard live job log page.
- Restore-specific live job page variant with files-restored /
  bytes-restored / current-file widget.
- Single-flight per host across all kinds, plus a real cancel-job feature
  (command.cancel WS envelope, agent kills the restic subprocess via
  context cancel + SIGTERM/SIGKILL grace) so the operator can pre-empt a
  long-running backup if they need to restore urgently. Wires the existing
  job_detail Cancel button (which was a UI stub).
- Audit row host.restore on every dispatch + a recent-restores panel on
  host detail. Role gate deferred to P4-03 RBAC.

Wireframe at _diag/p3-restore-wizard/wireframe.html (gitignored —
transient design artefact); screenshot reviewed and approved 2026-05-04.
2026-05-04 15:02:32 +01:00