P3-X1: cancel-job feature
Wires the existing job_detail Cancel button (which was a UI stub) into
real backend behaviour:
- internal/api already declared MsgCommandCancel + CommandCancelPayload;
promote those from forward-declarations to a working envelope. Agent
side: cmd/agent/main.go drops the TODO-stub and gains a per-job
ctx.CancelFunc map. runJob's switch is refactored around a small
spawn() helper so each kind's goroutine derives a per-job context,
registers the cancel, and removes itself on completion regardless of
outcome. command.cancel looks up the func and fires it.
- internal/agent/runner.sendFinished now takes ctx and rebadges
ctx.Canceled errors as JobCancelled (exit 130) rather than
JobFailed. All Run* call sites updated.
- internal/restic.resticCmd sets cmd.Cancel to send SIGTERM (via
build-tagged sigterm constant; os.Kill on Windows since SIGTERM
isn't deliverable there) and cmd.WaitDelay=5s for the SIGKILL
fallback. SIGTERM lets restic remove its lock file before exiting.
- New POST /api/jobs/{id}/cancel server endpoint validates the job
is non-terminal and the host is online, sends command.cancel via
the hub, writes a job.cancel audit row, returns 202. The agent's
resulting job.finished (status=cancelled) is what actually
transitions the row.
Tests:
- internal/server/http/cancel_test.go covers happy path (envelope
shape + audit row), 409 for terminal jobs, 404 for missing jobs,
503 for offline hosts.
- internal/agent/runner/cancel_test.go covers cancel mid-run: a fake
restic that exec'd into 'sleep 30' is canceled 150ms after start
and the resulting job.finished reports JobCancelled with exit 130
in well under the WaitDelay.
Foundational for P3 restore (operator needs to be able to cancel a
running backup if they need to restore urgently). Independently useful
for prune/check/backup that are stuck.
This commit is contained in:
@@ -72,11 +72,30 @@ func (e Env) globalArgs() []string {
|
||||
// before the supplied subcommand args. Centralizing this so every
|
||||
// command (backup/forget/prune/check/unlock/init/stats) honors
|
||||
// the caps without each call site having to remember.
|
||||
//
|
||||
// Cancellation: by default exec.CommandContext sends SIGKILL when
|
||||
// ctx is canceled, which leaves restic no chance to clean up its
|
||||
// repository lock. Override Cmd.Cancel to send SIGTERM first, and
|
||||
// set Cmd.WaitDelay so the process is force-killed if it doesn't
|
||||
// exit within five seconds. Restic responds to SIGTERM by removing
|
||||
// its lock file before exiting, which is what we want when an
|
||||
// operator cancels a long-running backup/restore from the UI.
|
||||
func (e Env) resticCmd(ctx context.Context, sub ...string) *exec.Cmd {
|
||||
args := append(e.globalArgs(), sub...)
|
||||
cmd := exec.CommandContext(ctx, e.Bin, args...)
|
||||
cmd.Env = e.envSlice()
|
||||
cmd.Dir = e.WorkDir
|
||||
cmd.Cancel = func() error {
|
||||
// Cmd.Process is set after Start; Cancel only fires post-Start
|
||||
// so the nil check is defensive against the documented but
|
||||
// unlikely race. Signal returns ErrProcessDone if the process
|
||||
// already exited; that's not a problem here either.
|
||||
if cmd.Process == nil {
|
||||
return nil
|
||||
}
|
||||
return cmd.Process.Signal(sigterm)
|
||||
}
|
||||
cmd.WaitDelay = 5 * time.Second
|
||||
return cmd
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user