initial setup ready

This commit is contained in:
2026-04-30 23:55:52 +01:00
commit ab02869d82
3 changed files with 1324 additions and 0 deletions
+721
View File
@@ -0,0 +1,721 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>restic-manager · Phase 0 wireframes</title>
<style>
/* Wireframe-grade only. No brand. No polish.
Purpose: confirm information architecture & API coverage
before locking spec.md §6.1 (REST) and §6.2 (WS) shapes. */
:root {
--ink: #1a1a1a;
--mute: #666;
--line: #999;
--soft: #ddd;
--bg: #f5f5f4;
--panel: #fff;
--note: #b45309; /* annotations only, single accent so they read as "meta" */
}
* { box-sizing: border-box; }
html, body {
margin: 0;
background: var(--bg);
color: var(--ink);
font: 13px/1.5 ui-monospace, "SF Mono", Menlo, Consolas, monospace;
}
.page {
max-width: 1200px;
margin: 32px auto;
padding: 0 24px;
}
h1, h2, h3, h4 { font-weight: 600; margin: 0; }
.doc-header {
border-bottom: 1px solid var(--line);
padding-bottom: 16px;
margin-bottom: 32px;
}
.doc-header h1 { font-size: 18px; }
.doc-header p { color: var(--mute); margin: 8px 0 0; max-width: 760px; }
/* ---- screen frame ---- */
.screen {
background: var(--panel);
border: 1px dashed var(--line);
margin: 48px 0;
position: relative;
}
.screen-label {
position: absolute;
top: -10px; left: 16px;
background: var(--bg);
padding: 0 8px;
font-size: 11px;
text-transform: uppercase;
letter-spacing: 0.1em;
color: var(--mute);
}
.screen-body { padding: 32px; }
/* ---- block primitives ---- */
.box {
border: 1px dashed var(--line);
padding: 12px;
background: var(--panel);
}
.box.solid { border-style: solid; border-color: var(--soft); }
.box.placeholder {
background: repeating-linear-gradient(
45deg, transparent 0 8px, #f0efee 8px 16px
);
color: var(--mute);
text-align: center;
padding: 24px 12px;
}
.row { display: flex; gap: 12px; }
.row > * { flex: 1; }
.stack { display: flex; flex-direction: column; gap: 12px; }
.grid-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
.grid-2 { display: grid; grid-template-columns: repeat(2, 1fr); gap: 16px; }
.label { color: var(--mute); font-size: 11px; text-transform: uppercase; letter-spacing: 0.05em; }
.value { font-size: 14px; }
.small { font-size: 11px; color: var(--mute); }
.strong { font-weight: 600; }
.pill { display: inline-block; border: 1px solid var(--line); padding: 1px 8px; font-size: 11px; }
.btn { display: inline-block; border: 1px solid var(--ink); padding: 4px 12px; font-size: 12px; background: var(--panel); cursor: pointer; }
.btn.ghost { border-color: var(--line); color: var(--mute); }
.btn.danger { border-style: dashed; }
table { width: 100%; border-collapse: collapse; }
th, td { text-align: left; padding: 6px 8px; border-bottom: 1px dashed var(--soft); font-weight: normal; }
th { color: var(--mute); font-size: 11px; text-transform: uppercase; letter-spacing: 0.05em; }
/* ---- annotation callouts ----
Every element that depends on a backend source carries a [src] tag
so we can audit spec.md §6 coverage in one pass. */
.src {
display: inline-block;
margin-left: 6px;
padding: 1px 6px;
font-size: 10px;
color: var(--note);
border: 1px solid var(--note);
border-radius: 2px;
vertical-align: middle;
white-space: nowrap;
}
.src::before { content: ""; }
/* margin annotation lane */
.annotated { display: grid; grid-template-columns: 1fr 280px; gap: 24px; }
.ann-lane { font-size: 11px; color: var(--note); }
.ann-lane h4 { color: var(--note); font-size: 11px; text-transform: uppercase; letter-spacing: 0.05em; margin-bottom: 8px; }
.ann-lane ul { margin: 0; padding-left: 16px; }
.ann-lane li { margin-bottom: 6px; line-height: 1.45; }
/* ---- top app chrome ---- */
.chrome {
border-bottom: 1px solid var(--soft);
padding: 12px 32px;
display: flex; align-items: center; gap: 24px;
background: var(--panel);
}
.chrome .logo { font-weight: 600; }
.chrome nav { display: flex; gap: 16px; color: var(--mute); }
.chrome nav .active { color: var(--ink); border-bottom: 1px solid var(--ink); }
.chrome .right { margin-left: auto; color: var(--mute); font-size: 12px; }
/* ---- tabs ---- */
.tabs {
display: flex;
gap: 0;
border-bottom: 1px solid var(--soft);
margin-bottom: 24px;
}
.tabs a {
padding: 8px 16px;
border: 1px dashed var(--line);
border-bottom: none;
margin-right: -1px;
color: var(--mute);
text-decoration: none;
background: var(--bg);
}
.tabs a.active {
color: var(--ink);
background: var(--panel);
border-style: solid;
border-color: var(--soft);
}
/* status dots — unstyled, just outline */
.dot { display: inline-block; width: 8px; height: 8px; border: 1px solid var(--ink); border-radius: 50%; vertical-align: middle; margin-right: 4px; }
.dot.off { background: var(--panel); }
.dot.ok { background: var(--ink); }
.dot.degraded { background: repeating-linear-gradient(45deg, var(--ink) 0 2px, transparent 2px 4px); }
/* log stream */
.log {
background: #111;
color: #ddd;
font-size: 12px;
line-height: 1.5;
padding: 12px 16px;
height: 320px;
overflow: auto;
border: 1px solid var(--soft);
}
.log .ts { color: #888; }
.log .err { color: #f88; }
/* progress bar */
.progress {
background: var(--soft);
height: 8px;
position: relative;
overflow: hidden;
}
.progress > span {
display: block;
background: var(--ink);
height: 100%;
width: 38%;
}
/* annotations bullet style */
details summary { cursor: pointer; color: var(--note); font-size: 11px; }
details[open] { margin-bottom: 8px; }
/* TOC */
.toc { background: var(--panel); border: 1px solid var(--soft); padding: 16px 20px; margin-bottom: 32px; }
.toc ol { margin: 8px 0 0; padding-left: 20px; }
.toc a { color: var(--ink); }
/* findings */
.findings { border: 1px solid var(--note); padding: 16px 20px; margin-top: 48px; background: #fffbeb; }
.findings h3 { color: var(--note); margin-bottom: 12px; }
.findings ol { padding-left: 20px; margin: 0; }
.findings li { margin-bottom: 8px; }
.findings code { background: rgba(180,83,9,.08); padding: 1px 4px; }
</style>
</head>
<body>
<div class="page">
<header class="doc-header">
<h1>restic-manager · Phase 0 wireframes</h1>
<p>
Low-fidelity wireframes for Phase 1/2 screens. Purpose: confirm the data each
screen needs before the API in spec.md §6.1 and the WS messages in §6.2 are
locked in. Grayscale on purpose &mdash; visual design is deferred to Phase 5
(and a focused hi-fi pass on the restore wizard in Phase 3).
</p>
<p>
<span class="src">[GET /api/...]</span> tags mark REST data sources.
<span class="src">[WS: ...]</span> tags mark WebSocket message dependencies.
Open the &ldquo;Findings&rdquo; section at the bottom for spec gaps.
</p>
</header>
<nav class="toc">
<strong>Screens</strong>
<ol>
<li><a href="#dashboard">Dashboard &mdash; fleet overview</a></li>
<li><a href="#host-detail">Host detail &mdash; 5 tabs</a></li>
<li><a href="#job-detail">Job detail &mdash; live log</a></li>
<li><a href="#findings">Findings &mdash; gaps in spec.md §6</a></li>
</ol>
</nav>
<!-- ============================================================ -->
<!-- SCREEN 1 · DASHBOARD -->
<!-- ============================================================ -->
<section id="dashboard" class="screen">
<span class="screen-label">Screen 1 · Dashboard (/)</span>
<div class="chrome">
<div class="logo">restic-manager</div>
<nav>
<span class="active">Dashboard</span>
<span>Hosts</span>
<span>Jobs</span>
<span>Repos</span>
<span>Alerts</span>
<span>Audit</span>
<span>Settings</span>
</nav>
<div class="right">user: alice (admin) &middot; logout</div>
</div>
<div class="screen-body annotated">
<div>
<!-- Fleet summary strip -->
<div class="grid-3" style="margin-bottom:24px">
<div class="box solid">
<div class="label">Fleet status</div>
<div class="value strong">10 online &middot; 1 offline &middot; 1 degraded</div>
<div class="small">Last sync 12s ago</div>
</div>
<div class="box solid">
<div class="label">Storage (sum across repos)</div>
<div class="value strong">2.4 TB across 12 repos</div>
<div class="small">+18 GB last 24h</div>
</div>
<div class="box solid">
<div class="label">Open alerts</div>
<div class="value strong">3 &middot; 1 critical</div>
<div class="small">2 unacked</div>
</div>
</div>
<!-- Filter / search -->
<div class="row" style="margin-bottom:16px; align-items:center">
<div class="box" style="flex:3">[ search hosts &middot; filter by tag &middot; status ]</div>
<div style="flex:0">
<span class="btn">+ Add host</span>
</div>
</div>
<h3 style="margin: 24px 0 12px">Hosts</h3>
<!-- Host card grid -->
<div class="grid-3">
<!-- card: healthy -->
<div class="box solid">
<div style="display:flex; align-items:center; justify-content:space-between">
<div class="strong">prod-db-01 <span class="small">linux/amd64</span></div>
<span class="pill"><span class="dot ok"></span>online</span>
</div>
<hr style="border:none; border-top:1px dashed var(--soft); margin:8px 0">
<div class="label">Last backup</div>
<div class="value">2h ago &middot; success</div>
<div class="label" style="margin-top:8px">Repo</div>
<div class="value">412 GB &middot; 1,284 snapshots</div>
<div class="label" style="margin-top:8px">Alerts</div>
<div class="value">&mdash;</div>
<div style="margin-top:12px; display:flex; gap:8px">
<span class="btn">View</span>
<span class="btn ghost">Backup now</span>
</div>
</div>
<!-- card: failed last -->
<div class="box solid">
<div style="display:flex; align-items:center; justify-content:space-between">
<div class="strong">staging-app <span class="small">linux/arm64</span></div>
<span class="pill"><span class="dot degraded"></span>degraded</span>
</div>
<hr style="border:none; border-top:1px dashed var(--soft); margin:8px 0">
<div class="label">Last backup</div>
<div class="value">9h ago &middot; <span class="strong">failed</span></div>
<div class="label" style="margin-top:8px">Repo</div>
<div class="value">88 GB &middot; 412 snapshots</div>
<div class="label" style="margin-top:8px">Alerts</div>
<div class="value">2 &middot; 1 critical</div>
<div style="margin-top:12px; display:flex; gap:8px">
<span class="btn">View</span>
<span class="btn ghost">Retry</span>
</div>
</div>
<!-- card: offline -->
<div class="box solid">
<div style="display:flex; align-items:center; justify-content:space-between">
<div class="strong">laptop-bob <span class="small">windows/amd64</span></div>
<span class="pill"><span class="dot off"></span>offline</span>
</div>
<hr style="border:none; border-top:1px dashed var(--soft); margin:8px 0">
<div class="label">Last seen</div>
<div class="value">3d ago</div>
<div class="label" style="margin-top:8px">Repo</div>
<div class="value">142 GB &middot; 88 snapshots</div>
<div class="label" style="margin-top:8px">Alerts</div>
<div class="value">1</div>
<div style="margin-top:12px; display:flex; gap:8px">
<span class="btn">View</span>
<span class="btn ghost" style="opacity:.4">Backup now</span>
</div>
</div>
<!-- placeholder rest -->
<div class="box placeholder">&hellip; more host cards (12 total in target deployment)</div>
</div>
<!-- Recent jobs -->
<h3 style="margin: 32px 0 12px">Recent activity (fleet-wide)</h3>
<div class="box solid" style="padding:0">
<table>
<thead>
<tr>
<th>When</th><th>Host</th><th>Kind</th><th>Status</th><th>Duration</th><th></th>
</tr>
</thead>
<tbody>
<tr><td>2h ago</td><td>prod-db-01</td><td>backup</td><td>succeeded</td><td>00:14:22</td><td><span class="small">view</span></td></tr>
<tr><td>3h ago</td><td>web-02</td><td>backup</td><td>succeeded</td><td>00:08:11</td><td><span class="small">view</span></td></tr>
<tr><td>9h ago</td><td>staging-app</td><td>backup</td><td><span class="strong">failed</span></td><td>00:01:03</td><td><span class="small">view</span></td></tr>
<tr><td>1d ago</td><td>prod-db-01</td><td>check</td><td>succeeded</td><td>00:42:17</td><td><span class="small">view</span></td></tr>
<tr><td>1d ago</td><td>web-01</td><td>prune</td><td>succeeded</td><td>00:04:55</td><td><span class="small">view</span></td></tr>
</tbody>
</table>
</div>
</div>
<!-- annotation lane -->
<aside class="ann-lane">
<h4>Data sources</h4>
<ul>
<li><strong>Fleet summary strip</strong> &mdash; no endpoint in §6.1. Either (a) add <code>GET /api/fleet/summary</code> or (b) compute client-side from <code>GET /api/hosts</code> + <code>GET /api/alerts</code>. <em>Recommend (a)</em> &mdash; cheaper than fanout, and Prometheus already needs the rollup (§14.4).</li>
<li><strong>Host cards</strong> &mdash; <code>GET /api/hosts</code> must return: status, last_backup_at, last_backup_status, repo_size_bytes, snapshot_count, open_alert_count, agent_version. Domain model (§5) only has <code>status</code> + <code>last_seen_at</code>. Need to extend list response.</li>
<li><strong>"Backup now" button</strong> &mdash; <code>POST /api/hosts/:id/jobs</code> with <code>{kind: "backup"}</code>.</li>
<li><strong>Recent activity</strong> &mdash; <code>GET /api/jobs?limit=N&order=desc</code>. Spec doesn't document query params; need to add.</li>
<li><strong>HTMX cadence</strong> &mdash; this page polls every ~10s with <code>hx-trigger="every 10s"</code> on the summary + cards. WS push isn't needed here.</li>
</ul>
</aside>
</div>
</section>
<!-- ============================================================ -->
<!-- SCREEN 2 · HOST DETAIL -->
<!-- ============================================================ -->
<section id="host-detail" class="screen">
<span class="screen-label">Screen 2 · Host detail (/hosts/:id)</span>
<div class="chrome">
<div class="logo">restic-manager</div>
<nav>
<span>Dashboard</span>
<span class="active">Hosts</span>
<span>Jobs</span>
<span>Repos</span>
<span>Alerts</span>
<span>Audit</span>
<span>Settings</span>
</nav>
<div class="right">user: alice (admin)</div>
</div>
<div class="screen-body annotated">
<div>
<!-- Host header -->
<div class="box solid" style="margin-bottom:24px">
<div style="display:flex; align-items:flex-start; justify-content:space-between; gap:16px">
<div>
<div class="small">&laquo; Dashboard / Hosts</div>
<h2 style="margin:4px 0">prod-db-01</h2>
<div class="small">linux/amd64 &middot; agent 0.4.2 &middot; restic 0.17.1 &middot; last seen 12s ago</div>
<div style="margin-top:8px">
<span class="pill"><span class="dot ok"></span>online</span>
<span class="pill">tag: prod</span>
<span class="pill">tag: db</span>
</div>
</div>
<div style="display:flex; flex-direction:column; gap:6px; align-items:flex-end">
<div class="small">Currently: <span class="strong">idle</span></div>
<div style="display:flex; gap:8px">
<span class="btn">Backup now</span>
<span class="btn ghost">Run check</span>
<span class="btn ghost">&hellip;</span>
</div>
</div>
</div>
</div>
<!-- Tabs -->
<div class="tabs">
<a href="#" class="active">Snapshots</a>
<a href="#">Schedules</a>
<a href="#">Jobs</a>
<a href="#">Repo</a>
<a href="#">Settings</a>
</div>
<!-- TAB: Snapshots (active) -->
<div>
<div class="row" style="margin-bottom:12px">
<div class="box" style="flex:3">[ filter by tag &middot; path &middot; date range ]</div>
<div class="box" style="flex:1">[ sort: newest first ]</div>
</div>
<div class="box solid" style="padding:0">
<table>
<thead>
<tr>
<th>Snapshot</th><th>Time</th><th>Paths</th><th>Tags</th><th>Size</th><th>Files</th><th></th>
</tr>
</thead>
<tbody>
<tr><td><code>3a8f1e</code></td><td>2h ago</td><td>/var/lib/postgres</td><td>auto, daily</td><td>412 GB</td><td>1.2M</td><td><span class="small">restore &middot; diff</span></td></tr>
<tr><td><code>8c7b22</code></td><td>1d ago</td><td>/var/lib/postgres</td><td>auto, daily</td><td>411 GB</td><td>1.2M</td><td><span class="small">restore &middot; diff</span></td></tr>
<tr><td><code>4f0a99</code></td><td>2d ago</td><td>/var/lib/postgres, /etc</td><td>auto, weekly</td><td>411 GB</td><td>1.2M</td><td><span class="small">restore &middot; diff</span></td></tr>
<tr><td colspan="7" class="small" style="text-align:center; padding:12px">&hellip; 1,281 more &middot; load more</td></tr>
</tbody>
</table>
</div>
</div>
<!-- Other tabs collapsed previews -->
<hr style="margin:32px 0; border:none; border-top:1px dashed var(--soft)">
<div class="small" style="margin-bottom:8px">Other tabs (preview, not navigated):</div>
<div class="grid-2">
<!-- TAB: Schedules -->
<div class="box solid">
<div class="strong" style="margin-bottom:8px">Tab · Schedules</div>
<table>
<thead>
<tr><th>Kind</th><th>Cron</th><th>Paths</th><th>Retention</th><th>Enabled</th></tr>
</thead>
<tbody>
<tr><td>backup</td><td>0 2 * * *</td><td>/var/lib/postgres</td><td>7d/4w/12m</td><td>[x]</td></tr>
<tr><td>forget+prune</td><td>0 4 * * 0</td><td>&mdash;</td><td>per policy</td><td>[x]</td></tr>
<tr><td>check</td><td>0 5 1 * *</td><td>&mdash;</td><td>&mdash;</td><td>[ ]</td></tr>
</tbody>
</table>
<div style="margin-top:12px"><span class="btn">+ New schedule</span></div>
<details style="margin-top:12px">
<summary>schedule editor (expanded form)</summary>
<div class="stack" style="margin-top:8px">
<div class="box">kind: [backup ▾]</div>
<div class="box">cron: [ 0 2 * * * ] &nbsp; <span class="small">human: every day at 02:00</span></div>
<div class="box">paths: [ /var/lib/postgres ] [+ add]</div>
<div class="box">excludes: [ *.tmp, /tmp ]</div>
<div class="box">tags: [ auto, daily ]</div>
<div class="box">retention: keep [7] daily, [4] weekly, [12] monthly &middot; keep-tag [ ]</div>
<div class="box">bandwidth: upload [ ] KB/s &middot; download [ ] KB/s &nbsp; <span class="small">§14.2</span></div>
<div class="box">pre-hook: [ pg_dump ... ] &nbsp; <span class="small">§14.3 admin-only</span></div>
<div class="box">post-hook: [ ... ]</div>
<div class="box">enabled: [x]</div>
</div>
</details>
</div>
<!-- TAB: Jobs -->
<div class="box solid">
<div class="strong" style="margin-bottom:8px">Tab · Jobs (host-scoped)</div>
<table>
<thead>
<tr><th>Started</th><th>Kind</th><th>Status</th><th>Duration</th><th>By</th></tr>
</thead>
<tbody>
<tr><td>2h ago</td><td>backup</td><td>succeeded</td><td>00:14:22</td><td>schedule</td></tr>
<tr><td>1d ago</td><td>check</td><td>succeeded</td><td>00:42:17</td><td>schedule</td></tr>
<tr><td>2d ago</td><td>backup</td><td>cancelled</td><td>00:00:42</td><td>alice</td></tr>
<tr><td>3d ago</td><td>backup</td><td>failed</td><td>00:01:09</td><td>schedule</td></tr>
</tbody>
</table>
</div>
<!-- TAB: Repo -->
<div class="box solid">
<div class="strong" style="margin-bottom:8px">Tab · Repo</div>
<div class="grid-2">
<div><div class="label">URL</div><div>rest:https://restic.lab&hellip;/prod-db-01</div></div>
<div><div class="label">Kind</div><div>rest (append-only)</div></div>
<div><div class="label">Total size</div><div>412 GB</div></div>
<div><div class="label">Dedup ratio</div><div>4.2&times;</div></div>
<div><div class="label">Snapshots</div><div>1,284</div></div>
<div><div class="label">Last check</div><div>1d ago &middot; clean</div></div>
<div><div class="label">Lock state</div><div>unlocked</div></div>
<div><div class="label">Credential</div><div>append-only &middot; rotated 14d ago</div></div>
</div>
<div style="margin-top:12px; display:flex; gap:8px">
<span class="btn">Run check</span>
<span class="btn ghost">Unlock</span>
<span class="btn ghost">Forget+prune (admin)</span>
</div>
</div>
<!-- TAB: Settings -->
<div class="box solid">
<div class="strong" style="margin-bottom:8px">Tab · Settings</div>
<div class="stack">
<div class="box"><div class="label">Tags</div><div>prod, db [+ add]</div></div>
<div class="box"><div class="label">Default pre-hook</div><div>(empty)</div></div>
<div class="box"><div class="label">Default post-hook</div><div>(empty)</div></div>
<div class="box"><div class="label">Hook shell</div><div>/bin/sh</div></div>
<div class="box"><div class="label">Default bandwidth caps</div><div>none</div></div>
<div class="box">
<div class="label">Enrollment</div>
<div>enrolled 42d ago &middot; <span class="btn ghost">Regenerate token</span></div>
</div>
<div class="box">
<div class="label">Agent</div>
<div>0.4.2 &middot; auto-update [x] &middot; <span class="btn ghost">Force update now</span></div>
</div>
<div class="box">
<div class="label danger" style="color:var(--note)">Danger zone</div>
<div><span class="btn danger">Remove host</span> <span class="small">does not touch repo data</span></div>
</div>
</div>
</div>
</div>
</div>
<!-- annotations -->
<aside class="ann-lane">
<h4>Data sources</h4>
<ul>
<li><strong>Host header</strong> &mdash; <code>GET /api/hosts/:id</code>. <em>Gap:</em> "currently running job" not in domain model. Either join a <code>current_job_id</code> on Host, or have UI poll <code>GET /api/jobs?host_id=X&status=running</code>.</li>
<li><strong>Snapshots tab</strong> &mdash; <code>GET /api/hosts/:id/snapshots</code>. Filtering needs server support: <code>?tag=</code>, <code>?path=</code>, <code>?since=</code>. Tag autocomplete needs distinct list &mdash; either client-derived or new endpoint.</li>
<li><strong>Schedules tab</strong> &mdash; <code>GET /api/hosts/:id/schedules</code> + <code>POST/PUT/DELETE</code>. Editor exposes §14.2 bandwidth and §14.3 hooks &mdash; both stored as JSON blobs on Schedule, but UI needs structured fields. Confirm <code>retention_policy</code> JSON shape.</li>
<li><strong>Jobs tab</strong> &mdash; <code>GET /api/jobs?host_id=X</code>. <em>Gap:</em> "By" column wants user-or-schedule attribution. AuditLog has it; Job table doesn't expose <code>actor</code> directly. Either denormalize onto Job or join.</li>
<li><strong>Repo tab</strong> &mdash; <code>GET /api/hosts/:id/repo</code>. <em>Gap:</em> spec lists size/last-check/lock state. Add: dedup ratio, snapshot count, credential rotation timestamp, append-only flag. (Some derive from <code>restic stats</code>.)</li>
<li><strong>Settings tab</strong> &mdash; mostly host-row edits. New: <code>POST /api/hosts/:id/agent/update</code> for force-update (§4.2 self-update). <em>Gap:</em> spec doesn't surface this.</li>
<li><strong>HTMX cadence</strong> &mdash; tab content swap via <code>?tab=jobs</code> hyperlinks (server renders partial). Header polls every 10s for currently-running state.</li>
</ul>
</aside>
</div>
</section>
<!-- ============================================================ -->
<!-- SCREEN 3 · JOB DETAIL -->
<!-- ============================================================ -->
<section id="job-detail" class="screen">
<span class="screen-label">Screen 3 · Job detail (/jobs/:id) &mdash; running state</span>
<div class="chrome">
<div class="logo">restic-manager</div>
<nav>
<span>Dashboard</span>
<span>Hosts</span>
<span class="active">Jobs</span>
<span>Repos</span>
<span>Alerts</span>
<span>Audit</span>
<span>Settings</span>
</nav>
<div class="right">user: alice (admin)</div>
</div>
<div class="screen-body annotated">
<div>
<!-- Header -->
<div class="box solid" style="margin-bottom:16px">
<div class="small">&laquo; prod-db-01 / Jobs</div>
<div style="display:flex; align-items:flex-start; justify-content:space-between; gap:16px; margin-top:4px">
<div>
<h2 style="margin:0">backup &middot; prod-db-01</h2>
<div class="small">job <code>j_01HJ8K7</code> &middot; started 4m12s ago &middot; triggered by alice</div>
<div style="margin-top:8px">
<span class="pill"><span class="dot ok"></span>running</span>
<span class="pill">schedule: nightly-pg</span>
</div>
</div>
<div style="display:flex; gap:8px">
<span class="btn danger">Cancel job</span>
</div>
</div>
</div>
<!-- Progress -->
<div class="grid-2" style="margin-bottom:16px">
<div class="box solid">
<div class="label">Progress</div>
<div class="value strong" style="margin:4px 0">38% &middot; ~6m remaining</div>
<div class="progress"><span></span></div>
<div class="small" style="margin-top:6px">156 GB of 412 GB &middot; 482k of 1.2M files</div>
</div>
<div class="box solid">
<div class="grid-2">
<div><div class="label">Files new</div><div>2,103</div></div>
<div><div class="label">Files changed</div><div>418</div></div>
<div><div class="label">Bytes added</div><div>2.4 GB</div></div>
<div><div class="label">Throughput</div><div>42 MB/s</div></div>
</div>
</div>
</div>
<!-- Live log -->
<div class="label" style="margin-bottom:6px">Live log <span class="small">(streaming via WS)</span></div>
<div class="log">
<span class="ts">14:02:11</span> [agent] starting restic backup --json
<span class="ts">14:02:11</span> [agent] pre_hook: pg_dump | gzip &gt; /tmp/dump.sql.gz
<span class="ts">14:02:48</span> [pre_hook] dump complete (1.2 GB)
<span class="ts">14:02:49</span> [restic] open repository
<span class="ts">14:02:50</span> [restic] lock repository
<span class="ts">14:02:50</span> [restic] load index files
<span class="ts">14:02:53</span> [restic] start scan
<span class="ts">14:02:55</span> [restic] start backup on /var/lib/postgres
<span class="ts">14:03:01</span> [restic] {"message_type":"status","percent_done":0.04,"total_files":1234567,"files_done":48234,"total_bytes":442000000000,"bytes_done":17600000000}
<span class="ts">14:04:22</span> [restic] {"message_type":"status","percent_done":0.18,"...}
<span class="ts">14:05:55</span> [restic] {"message_type":"status","percent_done":0.31,"...}
<span class="ts">14:06:23</span> <span class="err">[restic] warning: failed to lstat /var/lib/postgres/pg_wal/.lock</span>
<span class="ts">14:06:24</span> [restic] {"message_type":"status","percent_done":0.38,"...}
<span style="color:#888"></span>
</div>
<div class="row" style="margin-top:8px">
<div><span class="small">[ ] auto-scroll &nbsp; [ ] show stderr only &nbsp; download full log</span></div>
</div>
</div>
<aside class="ann-lane">
<h4>Data sources</h4>
<ul>
<li><strong>Header</strong> &mdash; <code>GET /api/jobs/:id</code>. Need: kind, host, started_at, actor (user / schedule / system), status, schedule_id, schedule_name. <em>Gap:</em> Job table has <code>scheduled_id</code> but no actor/user_id; need to join AuditLog or denormalize.</li>
<li><strong>Progress block</strong> &mdash; live updates from <code>WS /api/jobs/:id/stream</code>. The WS message <code>job.progress</code> (§6.2) needs a documented JSON shape: <code>{percent_done, files_done, total_files, bytes_done, total_bytes, eta_seconds, throughput_bps}</code>. Spec leaves this vague.</li>
<li><strong>Stats panel</strong> &mdash; on completion mirrors <code>restic backup --json</code> summary fields: <code>files_new</code>, <code>files_changed</code>, <code>files_unmodified</code>, <code>data_added</code>, <code>total_bytes_processed</code>, <code>duration</code>, <code>snapshot_id</code>. Lives in <code>Job.stats</code> JSON.</li>
<li><strong>Live log</strong> &mdash; <code>WS</code> messages of type <code>log.stream</code> (agent → server) fan out to browsers subscribed to <code>/api/jobs/:id/stream</code>. UI distinguishes <code>stdout</code> / <code>stderr</code> / <code>event</code> &mdash; the schema's <code>JobLog.stream</code> enum already covers this.</li>
<li><strong>Cancel</strong> &mdash; <code>POST /api/jobs/:id/cancel</code> &rarr; server emits <code>command.cancel</code> WS to agent (§6.2). UI should optimistically show "cancelling…" until WS confirms <code>job.finished</code>.</li>
<li><strong>HTMX caveat</strong> &mdash; this is the one screen where progressive enhancement isn't enough; live log requires WS. Plan: <code>hx-ext="ws"</code> with <code>ws-connect</code>, server sends innerHTML-fragment patches for the progress + log areas. Falls back to 2s polling without WS.</li>
</ul>
</aside>
</div>
</section>
<!-- ============================================================ -->
<!-- FINDINGS -->
<!-- ============================================================ -->
<section id="findings" class="findings">
<h3>Findings &mdash; gaps in spec.md §6 surfaced by Phase 0 wireframing</h3>
<ol>
<li>
<strong>Aggregate fleet endpoint missing.</strong> Dashboard summary strip and Prometheus metrics (§14.4) both need fleet rollups. Add <code>GET /api/fleet/summary</code> returning host counts by status, total repo bytes, open alert counts. Cheaper than client fanout and reused by /metrics.
</li>
<li>
<strong>Host list response is too thin.</strong> Domain model Host (§5) has status + last_seen_at; cards need <code>last_backup_at</code>, <code>last_backup_status</code>, <code>repo_size_bytes</code>, <code>snapshot_count</code>, <code>open_alert_count</code>, <code>current_job_id</code>. Either add columns or compute server-side and include in <code>GET /api/hosts</code>.
</li>
<li>
<strong>Job actor not modelled.</strong> Job table tracks <code>scheduled_id</code> but not <em>who</em> (user vs schedule vs system) triggered a run-now. Dashboard "Recent activity" and Jobs tab both want this. Add <code>Job.actor_kind</code> + <code>Job.actor_id</code> &mdash; cheaper than joining AuditLog every time.
</li>
<li>
<strong>WS <code>job.progress</code> JSON shape is undefined.</strong> §6.2 lists the message name only. Lock the shape now: <code>{percent_done: float, files_done: int, total_files: int, bytes_done: int, total_bytes: int, eta_seconds: int, throughput_bps: int}</code>. Keeps client + agent in lockstep before Phase 1 codes against it.
</li>
<li>
<strong>Repo response needs more fields.</strong> §6.1 says size/last-check/lock state. Wireframe also wants: dedup ratio, snapshot count, credential rotation timestamp, append-only flag. Most derive from <code>restic stats</code> + Credential row &mdash; expose them through <code>GET /api/hosts/:id/repo</code>.
</li>
<li>
<strong>Snapshot filtering needs server support.</strong> Tag/path/date filters belong on the server (12-host fleets are small but a single host can hold thousands of snapshots). Add query params to <code>GET /api/hosts/:id/snapshots</code>: <code>?tag=</code>, <code>?path=</code>, <code>?since=</code>, <code>?limit=</code>. Distinct-tag list endpoint optional &mdash; could be derived client-side at first.
</li>
<li>
<strong>Job listing needs query params.</strong> Recent activity, host-scoped jobs, and the Jobs page all use <code>GET /api/jobs</code>. Lock down: <code>?host_id=</code>, <code>?kind=</code>, <code>?status=</code>, <code>?since=</code>, <code>?limit=</code>, <code>?order=</code>. Pagination too.
</li>
<li>
<strong>Agent self-update endpoint not in §6.1.</strong> §4.2 describes the mechanism but no REST endpoint exists. Settings tab wants a "Force update now" button &mdash; add <code>POST /api/hosts/:id/agent/update</code>.
</li>
<li>
<strong>Schedule retention/options JSON shape.</strong> §14.2 (bandwidth) and §14.3 (hooks) both extend <code>Schedule</code>. Document the canonical shape now (<code>retention_policy</code>, <code>options.limit_upload</code>, <code>options.limit_download</code>, <code>pre_hook</code>, <code>post_hook</code>) so the schedule editor and the agent can both target it.
</li>
<li>
<strong>HTMX-vs-WS responsibility split.</strong> Decision: only the Job detail screen needs WS. Dashboard, Hosts, Snapshots use HTMX polling (10s). This avoids fan-out complexity for v1; revisit if dashboard feels stale.
</li>
</ol>
</section>
</div>
</body>
</html>
+455
View File
@@ -0,0 +1,455 @@
# restic-manager — Specification
## 1. Overview
**restic-manager** is a self-hosted, browser-based, single-pane-of-glass for managing [restic](https://restic.net) backups across a fleet of Linux and Windows endpoints. It provides visibility, scheduling, ad-hoc operations, restore workflows, and alerting from one UI.
It is built for small-to-medium fleets (initial target: ~12 endpoints) and is intentionally simple to deploy: one Docker Compose file on the control-plane host, one small agent binary on each endpoint.
**License:** PolyForm Noncommercial 1.0.0
## 2. Goals & Non-Goals
### Goals
- Central visibility into backup state for every endpoint
- Trigger any restic operation remotely (`backup`, `forget`, `prune`, `check`, `unlock`, `snapshots`, `stats`, `diff`, `restore`)
- Manage per-host backup schedules from the UI
- Live job progress streamed back to the UI
- Restore wizard (browse snapshots, pick paths, restore to original or alternate host)
- Repo health surfacing (size, dedup ratio, last check, lock state)
- Alerting on failure or staleness
- Cross-platform agent (Linux + Windows)
- Ransomware-resistant repo access via append-only credentials
### Non-Goals (initial release)
- Replacing restic itself or providing custom repo formats
- Managing non-restic backup tools
- Multi-tenancy / SaaS deployment
- High availability of the control plane (SQLite, single-instance)
- Mobile-native apps (responsive web only)
## 3. Architecture
### 3.1 Components
```
┌──────────────────────────────────────────────────────────────────┐
│ Proxmox cluster │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ docker compose: restic-manager │ │
│ │ - server (Go binary, REST + WS API, embedded HTMX UI) │ │
│ │ - SQLite volume │ │
│ └────────────────────────────────────────────────────────────┘ │
└────────────────────────▲─────────────────────────────────────────┘
│ HTTPS (control plane)
│ - agent → server: status, telemetry
│ - server → agent: commands, schedules
┌────────────────────────┴─────────────────────────────────────────┐
│ Endpoints (Linux + Windows) │
│ ┌──────────────────────┐ ┌────────────────────────────────┐ │
│ │ restic-manager- │ │ restic CLI │ │
│ │ agent (Go binary) │───▶│ invoked by agent │ │
│ │ - systemd / svc │ └─────────────┬──────────────────┘ │
│ │ - WS to server │ │ HTTPS │
│ └──────────────────────┘ │ (data plane) │
└─────────────────────────────────────────────┼────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Unraid │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Docker: restic/rest-server │ │
│ │ - per-host append-only credentials │ │
│ │ - one repo per host │ │
│ │ - storage: Unraid share │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```
### 3.2 Data flow
- **Backup data:** endpoint → restic CLI → restic REST server on Unraid → Unraid share. The control plane *never* touches backup bytes.
- **Control plane:** agent maintains an outbound WebSocket to the server. Server pushes commands and schedule changes; agent pushes status, logs, live job progress, host metadata.
- **UI:** browser → server (HTTPS, session cookies). Server fans out commands to agents, streams progress back to browser.
### 3.3 Why agent (not SSH)
- Push model works through NAT/firewalls without inbound rules
- Native Windows support without OpenSSH service quirks
- Local scheduling survives controller restarts
- Self-contained `restic --json` parsing, no remote shell quoting hazards
### 3.4 Why per-host repos
- Isolates corruption / lock contention
- Append-only credentials per host = compromised endpoint can't delete other hosts' backups
- Simpler `prune` orchestration (no global lock coordination)
- Trivially easy to retire a host (delete its repo + credential)
## 4. Components in detail
### 4.1 Server
- **Language:** Go 1.22+
- **Storage:** SQLite (via `modernc.org/sqlite`, no CGo)
- **HTTP:** `net/http` + `chi` router
- **WebSocket:** `nhooyr.io/websocket`
- **UI:** HTMX + Tailwind, server-rendered Go templates, no Node build step
- **Distribution:** single static binary, packaged in a Docker image; published `docker-compose.yml`
- **Config:** YAML or env vars (`RM_LISTEN`, `RM_DATA_DIR`, `RM_BASE_URL`, `RM_TLS_CERT`, `RM_TLS_KEY`)
- **TLS:** terminate TLS in-process (cert from Caddy/Traefik sidecar acceptable; agents require HTTPS)
### 4.2 Agent
- **Language:** Go (cross-compiled for `linux/amd64`, `linux/arm64`, `windows/amd64`)
- **Service integration:** systemd unit (Linux), Windows service via `golang.org/x/sys/windows/svc`
- **Footprint goal:** ≤ 15 MB binary, ≤ 50 MB RSS idle
- **Persistence:** local config file + small state DB (BoltDB or JSON) for queued reports if server is unreachable
- **Restic invocation:** spawns `restic` with `--json`, parses streamed output, forwards to server in real time
- **Self-update:** server publishes signed agent binary; agent downloads, verifies signature, swaps binary, restarts service
### 4.3 Restic REST server (Unraid)
- Run `restic/rest-server` Docker container
- `--append-only` enabled
- `--private-repos` enabled (each user only sees their own subpath)
- htpasswd file with one user per host
- Storage path mapped to Unraid share
## 5. Domain model
```
Host
id, name, os, arch, agent_version, restic_version,
enrolled_at, last_seen_at, status (online/offline/degraded),
repo_id (FK), tags,
current_job_id (FK nullable),
last_backup_at, last_backup_status (succeeded|failed|cancelled|null),
repo_size_bytes, snapshot_count, open_alert_count
# Last six fields are denormalised projections, refreshed on
# job.finished, snapshots.report, repo.stats, and alert state changes.
Repo
id, name, url, kind (rest|s3|local), credential_id (FK),
password_secret_id (FK),
size_bytes, snapshot_count, dedup_ratio,
last_check_at, last_check_status, lock_state (locked|unlocked),
append_only (bool), credential_rotated_at
# Bottom block is a cached projection from `restic stats` +
# Credential row, refreshed by repo.stats agent messages.
Credential
id, kind, username, secret_ref (encrypted),
rotated_at
Schedule
id, host_id (FK), kind (backup|forget|prune|check),
cron_expr, paths (json), excludes (json), tags (json),
retention_policy (json), options (json), pre_hook, post_hook,
enabled
# retention_policy: {keep_last, keep_hourly, keep_daily, keep_weekly,
# keep_monthly, keep_yearly, keep_tag: [...]}
# options: {limit_upload_kbps, limit_download_kbps}
# pre_hook/post_hook: see §14.3 (encrypted at rest)
Job
id, host_id (FK), kind, status (queued|running|succeeded|failed|cancelled),
scheduled_id (FK nullable),
actor_kind (user|schedule|system), actor_id (nullable),
started_at, finished_at,
exit_code, stats (json), error
JobLog
job_id (FK), seq, ts, stream (stdout|stderr|event), payload
Snapshot (cached projection from `restic snapshots --json`)
id (restic id), host_id (FK), repo_id (FK),
time, hostname, paths, tags, size_bytes, file_count
Alert
id, host_id (FK nullable), kind, severity, message,
created_at, acknowledged_at, resolved_at
User
id, username, password_hash, role (admin|operator|viewer),
created_at, last_login_at
Session
id, user_id (FK), created_at, expires_at, ip, ua
AuditLog
id, user_id (FK nullable), actor (user|agent|system),
action, target_kind, target_id, ts, payload (json)
```
## 6. API surface (control plane)
### 6.1 UI/REST (browser → server)
```
POST /api/auth/login
POST /api/auth/logout
GET /api/fleet/summary (aggregate: host counts by status,
total bytes, open alerts; reused by /metrics)
GET /api/hosts ?tag=&status=&limit=&offset=
(returns Host rows incl. denormalised
last_backup_*, repo_size_bytes,
snapshot_count, open_alert_count,
current_job_id)
GET /api/hosts/:id
DELETE /api/hosts/:id
POST /api/hosts/:id/enrollment-token (regenerate)
POST /api/hosts/:id/agent/update (force agent self-update; see §4.2)
GET /api/hosts/:id/snapshots ?tag=&path=&since=&until=&limit=&offset=
GET /api/hosts/:id/repo (full Repo projection)
POST /api/hosts/:id/jobs (run-now: backup/forget/prune/check/unlock)
POST /api/hosts/:id/restore (restore wizard submit)
GET /api/hosts/:id/schedules
POST /api/hosts/:id/schedules
PUT /api/schedules/:id
DELETE /api/schedules/:id
GET /api/jobs ?host_id=&kind=&status=&since=&until=
&limit=&offset=&order=desc
GET /api/jobs/:id
GET /api/jobs/:id/logs (paginated: ?after_seq=&limit=)
WS /api/jobs/:id/stream (live progress; see §6.2 for shape)
POST /api/jobs/:id/cancel
GET /api/repos
GET /api/repos/:id
GET /api/alerts
POST /api/alerts/:id/ack
GET /api/audit
GET /api/users (admin)
POST /api/users (admin)
```
**Realtime strategy:** only `/api/jobs/:id/stream` uses WS. All other screens
(dashboard, hosts, snapshots) refresh via HTMX polling (~10s cadence). Revisit
if dashboard staleness becomes a problem in practice.
### 6.2 Agent ↔ Server
Single authenticated WebSocket per agent. Bidirectional JSON-RPC-ish messages.
**Agent → server:**
- `hello` (host metadata, agent version, restic version, OS)
- `heartbeat` (every 30s)
- `job.started` (job_id, kind, started_at)
- `job.progress` (job_id, percent_done, files_done, total_files,
bytes_done, total_bytes, eta_seconds, throughput_bps)
- `job.finished` (job_id, status, exit_code, stats, error, finished_at)
- `snapshots.report` (full list after each successful backup)
- `repo.stats` (size_bytes, snapshot_count, dedup_ratio, last_check_at,
last_check_status, lock_state)
- `log.stream` (live stdout/stderr lines while job running;
{job_id, seq, ts, stream: stdout|stderr|event, payload})
**Server → agent:**
- `command.run` (kind, args)
- `command.cancel` (job_id)
- `schedule.set` (full schedule list, agent reconciles local cron)
- `config.update`
- `agent.update` (new version available, URL + signature)
The server fans `job.progress` and `log.stream` for a given job to all
browsers subscribed to `WS /api/jobs/:id/stream` (§6.1) without
transformation, so the schema is shared end-to-end.
### 6.3 Enrollment
1. Operator clicks "Add host" → server generates one-time token (TTL 1h)
2. Operator runs install script on endpoint with token
3. Agent calls `POST /api/agents/enroll` with token + host metadata
4. Server issues persistent agent credential (bearer token + TLS pin) and host record
5. Agent stores credential, opens WS connection
## 7. Security
### 7.1 Authentication
- **Phase 1:** username + password (argon2id), HTTP-only secure session cookies, CSRF tokens on state-changing requests
- **Phase 2:** OIDC (Authelia, Keycloak, Authentik)
- **Agents:** bearer token over TLS; pin server cert fingerprint at enrollment time
### 7.2 Authorization (Phase 1: simple roles)
- **admin:** everything
- **operator:** trigger jobs, edit schedules, restore
- **viewer:** read-only
### 7.3 Secret handling
- Restic repo passwords and REST-server credentials encrypted at rest in SQLite using a server-side key (loaded from env or file at startup)
- Pushed to agents only over the authenticated WS, only when needed for a job
- Agent stores them in OS keyring where available (Windows DPAPI, Linux Secret Service / fallback to encrypted file with restricted perms)
### 7.4 Repo protection
- Restic REST server runs with `--append-only` for routine backups
- A separate non-append-only credential exists for `forget`/`prune` operations, used only when explicitly invoked from the UI by an admin/operator and audited
### 7.5 Audit
- Every state-changing UI action and every server→agent command logged with user, target, timestamp, and payload
## 8. UI
Stack: HTMX + Tailwind + Go html/templates. No SPA framework. Server-rendered, progressive enhancement.
**Pages:**
- **Login**
- **Dashboard:** fleet overview (host cards: status, last backup, repo size, alerts)
- **Host detail:** tabs for Snapshots / Schedules / Jobs / Repo / Settings
- **Job detail:** live log streaming via WS, cancel button
- **Restore wizard:** host → snapshot → paths → target → confirm
- **Repos:** aggregate view across hosts
- **Alerts:** list, acknowledge
- **Settings:** users (admin), notification channels, agent download
- **Audit log**
## 9. Alerting
- **Triggers:** backup failed, backup hasn't run in N hours past its schedule, repo `check` failed, agent offline > N minutes, repo size growth anomaly
- **Channels (Phase 1):** webhook, ntfy, email (SMTP)
- **Channels (Phase 2+):** Discord, Slack, Pushover
## 10. Deployment
### 10.1 Control plane (Proxmox host or LXC)
`docker-compose.yml`:
```yaml
services:
restic-manager:
image: ghcr.io/<owner>/restic-manager:latest
restart: unless-stopped
ports:
- "8443:8443"
volumes:
- ./data:/data
- ./certs:/certs:ro
environment:
- RM_DATA_DIR=/data
- RM_LISTEN=:8443
- RM_BASE_URL=https://restic.lab.example
- RM_TLS_CERT=/certs/fullchain.pem
- RM_TLS_KEY=/certs/privkey.pem
- RM_SECRET_KEY_FILE=/data/secret.key
```
### 10.2 Restic REST server (Unraid)
Standard `restic/rest-server` container, `--append-only`, `--private-repos`, htpasswd mounted, data path on the share.
### 10.3 Agent install
- **Linux:** `curl -fsSL https://restic.lab.example/install.sh | sudo RM_TOKEN=xxx sh`
- **Windows:** `iwr https://restic.lab.example/install.ps1 | iex` (with `$env:RM_TOKEN`)
- Installer drops binary + service unit, calls enroll endpoint, starts service
## 11. Testing strategy
- **Unit tests:** restic JSON parsing, schedule reconciliation, retention policy logic
- **Integration tests:** spin up real `restic` + `rest-server` in Docker, exercise full backup/snapshot/restore flows
- **End-to-end:** Playwright against a compose-up'd stack with one Linux agent in a sibling container
- **Cross-platform agent CI:** build matrix `linux/amd64`, `linux/arm64`, `windows/amd64`; smoke test on Windows runner
## 12. Repository layout
```
restic-manager/
├── cmd/
│ ├── server/
│ └── agent/
├── internal/
│ ├── api/ # shared API types
│ ├── server/
│ │ ├── http/
│ │ ├── ws/
│ │ └── ui/ # templates, handlers
│ ├── agent/
│ │ ├── service/ # systemd / windows service glue
│ │ ├── runner/ # restic invocation
│ │ └── scheduler/
│ ├── restic/ # restic CLI wrapper, --json parsing
│ ├── store/ # sqlite layer
│ ├── crypto/ # secret encryption
│ └── auth/
├── web/
│ ├── templates/
│ └── static/
├── deploy/
│ ├── docker-compose.yml
│ ├── Dockerfile.server
│ └── install/
│ ├── install.sh
│ └── install.ps1
├── docs/
├── LICENSE # PolyForm Noncommercial 1.0.0
├── README.md
├── spec.md
└── tasks.md
```
## 13. Phased delivery
- **Phase 1 (MVP):** server skeleton, agent skeleton, enrollment, host list, snapshot list, on-demand backup, live job log
- **Phase 2:** schedules, retention, run-now for `forget`/`prune`/`check`/`unlock`, repo stats
- **Phase 3:** restore wizard, alerts (webhook/ntfy/email), audit log
- **Phase 4:** agent self-update, OIDC, multi-user/RBAC polish, repo trends
- **Phase 5:** OSS readiness — docs site, contribution guide, screenshot tour
## 14. Confirmed extensions (in scope)
These were originally listed as open questions and have been confirmed for inclusion. Slotted into phases below.
### 14.1 Cross-host restore
Restore a snapshot taken on host A onto host B (e.g. recover a dead box onto a fresh one, clone a workload onto a sibling host, restore a developer's home dir onto a new laptop).
- **Credential model:** target host's agent receives a temporary, server-issued read credential for the source host's repo, scoped to a single restore job and revoked immediately after
- **Path remapping:** UI allows rewriting source paths to target paths (e.g. `/home/alice``/home/alice-new`)
- **Permissions:** restore runs as the agent's service user; UI surfaces a warning when source paths require root and target service user is non-root
- **Phase:** 3 (with the restore wizard)
### 14.2 Bandwidth limiting
Per-host upload/download caps for backup, restore, and prune jobs.
- Exposed on the schedule editor as optional `--limit-upload` / `--limit-download` (KB/s)
- Also overridable on run-now jobs via the UI
- Persisted in `Schedule.options` (JSON blob) so the schema stays stable
- **Phase:** 2 (with scheduling)
### 14.3 Pre/post backup hooks
Per-host shell commands run before and after a backup job. Use cases: `mysqldump`/`pg_dump` to a staging path, stop/start Docker containers, quiesce a service, post-backup notifications.
- **Schema:** `Schedule.pre_hook` and `Schedule.post_hook` (string, optional). For more complex cases, `Host.pre_hook_default` / `Host.post_hook_default` apply to all schedules on that host unless overridden
- **Execution:** agent runs hooks via the host's default shell (`/bin/sh` Linux, `cmd.exe` or PowerShell Windows — host-configurable)
- **Failure semantics:** `pre_hook` non-zero exit aborts the backup and marks the job failed. `post_hook` runs on both success and failure (with `RM_JOB_STATUS` env var); its own exit code is recorded but does not change the backup job's final status
- **Stdout/stderr:** captured into `JobLog` like restic output, prefixed `pre_hook:` / `post_hook:`
- **Security:** hooks are stored encrypted; only admins can edit them; every edit audit-logged
- **Phase:** 2 (with scheduling)
### 14.4 Prometheus `/metrics` endpoint
Standard Prometheus exposition on `/metrics`, protected by either bearer token or IP allow-list.
- **Metrics (per host):**
- `restic_manager_last_backup_timestamp_seconds{host=...}`
- `restic_manager_last_backup_status{host=...}` (1=success, 0=failure)
- `restic_manager_repo_size_bytes{host=...}`
- `restic_manager_snapshot_count{host=...}`
- `restic_manager_agent_online{host=...}` (1/0)
- `restic_manager_job_duration_seconds_bucket{kind=...,host=...}` (histogram)
- **Server-level:** `restic_manager_jobs_total{kind=...,status=...}`, `restic_manager_alerts_active`, `restic_manager_build_info`
- **Phase:** 4 (alongside repo trend charts — both rely on the same time-series data)
## 15. Future considerations (not yet committed)
- Read-only share links for snapshot listings (auditor view) — out of scope for personal/lab use; revisit if multi-tenant or org use cases emerge
+148
View File
@@ -0,0 +1,148 @@
# restic-manager — Tasks
Tasks are grouped by phase. Each task has an ID for cross-referencing, an estimated size (S/M/L), and acceptance criteria.
Sizes: **S** = under a day, **M** = 13 days, **L** = 37 days.
---
## Phase 0 — Project bootstrap
- [ ] **P0-01** (S) Initialize Go module, `cmd/server`, `cmd/agent`, baseline `internal/` packages
- [ ] **P0-02** (S) Add LICENSE (PolyForm Noncommercial 1.0.0), README stub, CONTRIBUTING placeholder
- [ ] **P0-03** (S) Set up `golangci-lint`, `gofumpt`, `goimports`; pre-commit config
- [ ] **P0-04** (S) GitHub Actions: build matrix (linux amd64/arm64, windows amd64), unit tests, lint
- [ ] **P0-05** (S) `Dockerfile.server` (multi-stage, distroless), `deploy/docker-compose.yml`
- [ ] **P0-06** (S) Makefile / `taskfile.yml` with common targets (`build`, `test`, `run`, `release`)
---
## Phase 1 — MVP: enrollment, visibility, on-demand backup
### Server foundations
- [ ] **P1-01** (M) HTTP server scaffolding (`chi`, structured logging via `slog`, graceful shutdown)
- [ ] **P1-02** (M) SQLite store layer (`modernc.org/sqlite`) + migrations (`golang-migrate` or hand-rolled)
- [ ] **P1-03** (M) Schema for `users`, `sessions`, `hosts`, `repos`, `credentials`, `jobs`, `job_logs`, `snapshots`, `audit_log`
- [ ] **P1-04** (M) Auth: argon2id password hashing, login/logout, session cookies, CSRF middleware
- [ ] **P1-05** (S) First-run admin bootstrap (printed one-time setup token in server logs)
- [ ] **P1-06** (M) Secret encryption helper (AEAD with key from `RM_SECRET_KEY_FILE`)
- [ ] **P1-07** (M) Audit log writer + middleware
### Agent ↔ server protocol
- [ ] **P1-08** (M) Define shared API types in `internal/api` (Go structs, JSON tags)
- [ ] **P1-09** (L) WebSocket transport (`nhooyr.io/websocket`), framed JSON envelopes, request/response correlation, ping/pong, reconnect with backoff
- [ ] **P1-10** (M) Enrollment flow: `POST /api/agents/enroll` with one-time token → returns persistent bearer + cert pin
- [ ] **P1-11** (M) Agent registration on connect (`hello` message → upsert host record, mark online)
- [ ] **P1-12** (S) Heartbeat handler (mark host offline after 90s without heartbeat)
### Agent foundations
- [ ] **P1-13** (M) Agent config file (`/etc/restic-manager/agent.yaml` / `%PROGRAMDATA%\restic-manager\agent.yaml`)
- [ ] **P1-14** (M) Service integration: systemd unit + Windows service entrypoint
- [ ] **P1-15** (M) Outbound WS client with reconnect, server cert pinning
- [ ] **P1-16** (M) Restic wrapper: locate `restic` binary, run with `--json`, stream parsed events
- [ ] **P1-17** (S) Host metadata collection (OS, arch, hostname, restic version, agent version)
### Run-now backup
- [ ] **P1-18** (L) Job lifecycle: queued → running → succeeded/failed/cancelled, persisted with logs
- [ ] **P1-19** (M) Server endpoint `POST /api/hosts/:id/jobs` to dispatch a `backup` command
- [ ] **P1-20** (M) Agent executes `restic backup`, streams stdout/stderr + parsed JSON events back as `job.progress` / `log.stream`
- [ ] **P1-21** (M) Server persists log stream to `job_logs`, exposes `WS /api/jobs/:id/stream` for live tailing
- [ ] **P1-22** (S) Snapshot listing: `restic snapshots --json`, cached projection table, refresh after each backup
### UI (HTMX + Tailwind)
- [ ] **P1-23** (M) Base layout, login page, session-aware nav
- [ ] **P1-24** (M) Dashboard: host cards (status dot, last backup, repo size)
- [ ] **P1-25** (M) Host detail page: snapshots tab + run-now button
- [ ] **P1-26** (M) Live job log viewer (WS-driven, auto-scroll, cancel button)
- [ ] **P1-27** (S) "Add host" flow: generate token, copy install command snippet
- [ ] **P1-28** (S) Tailwind build via `tailwindcss` standalone binary (no Node)
### Install scripts
- [ ] **P1-29** (M) `install.sh` (Linux): detects arch, downloads agent, installs systemd unit, enrolls
- [ ] **P1-30** (M) `install.ps1` (Windows): downloads agent, installs as service, enrolls
- [ ] **P1-31** (S) Server endpoint to serve agent binaries + install scripts (signed)
### Phase 1 acceptance
- One Linux + one Windows host can enroll, appear in the dashboard, and a backup can be triggered from the UI with live log streaming. Snapshots list updates after success.
---
## Phase 2 — Scheduling, retention, repo operations
- [ ] **P2-01** (M) Schedule schema + CRUD API
- [ ] **P2-02** (L) Server-pushed schedule reconciliation (server is source of truth; agent applies)
- [ ] **P2-03** (M) Agent local scheduler (`robfig/cron/v3`); persists next-fire times across restarts
- [ ] **P2-04** (M) Schedule editor UI (paths, excludes, tags, cron, retention)
- [ ] **P2-05** (M) `forget` command with retention policy (keep-last/daily/weekly/monthly/yearly)
- [ ] **P2-06** (M) `prune` command (admin-only, uses non-append-only credential)
- [ ] **P2-07** (S) `check` command (random subset + `--read-data-subset`)
- [ ] **P2-08** (S) `unlock` command
- [ ] **P2-09** (M) Repo stats panel: size, dedup ratio, snapshot count, last check time, lock state
- [ ] **P2-10** (S) Run-now buttons for forget/prune/check/unlock on host detail page
- [ ] **P2-11** (S) Schedule "next run" / "last run" surfaced on host card
- [ ] **P2-12** (S) Bandwidth limit fields on schedule editor (`--limit-upload`, `--limit-download`); also overridable on run-now jobs
- [ ] **P2-13** (M) Pre/post backup hooks: schema (`Schedule.pre_hook`, `Schedule.post_hook`, `Host.pre_hook_default`, `Host.post_hook_default`), encrypted at rest, admin-only edit, audit-logged
- [ ] **P2-14** (M) Agent execution of hooks: configurable shell per host, `pre_hook` failure aborts backup, `post_hook` always runs with `RM_JOB_STATUS` env var, stdout/stderr captured into `JobLog` with prefix
- [ ] **P2-15** (S) Hook editor UI on schedule + host pages, with sensible warnings (e.g. "this hook runs as the agent service user")
### Phase 2 acceptance
- Schedules created in UI run on agents on time; retention is applied; admin can prune from UI; repo health visible per host. Pre/post hooks fire correctly (verified with a Docker stop/start example and a `mysqldump` example). Bandwidth limits honoured.
---
## Phase 3 — Restore, alerts, audit
- [ ] **P3-01** (L) Restore wizard backend: snapshot tree browse via `restic ls --json`, path picker, target selection
- [ ] **P3-02** (L) Restore wizard UI (multi-step: host → snapshot → paths → target → confirm)
- [ ] **P3-03** (M) Restore execution: `restic restore` invocation, progress streaming
- [ ] **P3-04** (L) Cross-host restore: target agent receives a temporary scoped read credential for source host's repo (single-job, auto-revoked); UI supports source→target path remapping; warns when source paths need root and target service user is non-root
- [ ] **P3-05** (M) Alert engine: rule evaluation loop (failed backup, stale schedule, agent offline, check failed)
- [ ] **P3-06** (M) Notification channels: webhook, ntfy, SMTP email
- [ ] **P3-07** (S) Alert UI: list, acknowledge, resolve
- [ ] **P3-08** (S) Audit log UI with filters (user, action, target, time range)
- [ ] **P3-09** (S) `diff` between two snapshots in UI
### Phase 3 acceptance
- A file deleted on a host can be restored from the UI in under 2 minutes. A failed backup raises an alert via the configured channel within 60s.
---
## Phase 4 — Self-update, RBAC polish, OIDC
- [ ] **P4-01** (L) Agent self-update: signed binary published by server, agent downloads, verifies, swaps, restarts
- [ ] **P4-02** (M) Agent version reporting on dashboard; "update all" admin action
- [ ] **P4-03** (M) RBAC enforcement at API layer (admin / operator / viewer)
- [ ] **P4-04** (S) User management UI (create/edit/disable, role assignment, password reset)
- [ ] **P4-05** (L) OIDC login (generic provider config, group → role mapping)
- [ ] **P4-06** (M) Repo size trend graphs (sparkline on host card, full chart on repo page)
- [ ] **P4-07** (S) Per-host tags + dashboard filtering by tag
- [ ] **P4-08** (M) Prometheus `/metrics` endpoint: per-host gauges (last backup timestamp, last backup status, repo size, snapshot count, agent online), server gauges (active alerts, build info), job duration histograms; protected by bearer token or IP allow-list
- [ ] **P4-09** (S) Document Prometheus integration + sample Grafana dashboard JSON
### Phase 4 acceptance
- Non-admin users see an appropriately limited UI. Agents update themselves with one click. OIDC login works against at least one provider (Authelia or Authentik). Prometheus can scrape `/metrics` and the sample Grafana dashboard renders with live data.
---
## Phase 5 — OSS readiness
- [ ] **P5-01** (M) Documentation site (mdBook or similar) with install, concepts, security model, screenshots
- [ ] **P5-02** (S) `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue + PR templates
- [ ] **P5-03** (S) Release automation: `goreleaser` for binaries + Docker image to GHCR
- [ ] **P5-04** (S) Demo screenshots / short Loom walkthrough in README
- [ ] **P5-05** (S) `SECURITY.md` with disclosure process
- [ ] **P5-06** (M) End-to-end test suite in CI (Playwright vs. compose stack with sibling Linux agent)
- [ ] **P5-07** (S) Sample `docker-compose.yml` with TLS via Caddy sidecar
- [ ] **P5-08** (S) Optional Prometheus `/metrics` endpoint
### Phase 5 acceptance
- A stranger can read the docs and stand up a working install in under 30 minutes.
---
## Cross-cutting / ongoing
- [ ] **X-01** Keep CHANGELOG.md updated (Keep-a-Changelog format)
- [ ] **X-02** Track restic version compatibility matrix
- [ ] **X-03** Periodic dependency updates (`dependabot` or `renovate`)
- [ ] **X-04** Threat-model review at end of each phase