yatsu-eval-tab¶
The dashboard eval tab — a node's measurement timeline (verdict + expected + live freshness) with evidence (image or transcript) on expand, plus the spec-cli read API behind it.
raw source¶
The eval/loss engine (spec-yatsu, built by yatsu-core) records readings; this is the surface that
reads them back. Realize the founding "Evidence — one timeline, two sources" contract's first source: a
node's eval tab lists its measurements chronologically, each carrying its verdict, the scenario's
expected, and the freshness signal spex yatsu scan reports, with the captured evidence (an image or a
transcript) expanding inline. LOCAL readings only for now — the forge issue-events source is a later
sibling; leave a clean seam for it.
expanded spec¶
Two halves behind one tab. The read engine (spec-cli, in evaltab.ts) computes what only a live
read knows. A node's measurement timeline is every reading from its yatsu.evals.ndjson sidecar (scenario,
the read's codeSha, blob + blobKind, evaluator, verdict, ts) joined with the scenario's expected
(from the live yatsu.md — what zero loss looks like) and a freshness flag, derived live from git by the
same freshness machinery scan uses: a reading is current until its governed code, its scenario, or the
evaluator version moved past the sha it was taken at, otherwise stale (and which axis moved). Readings
come back newest-first.
This timeline rides the board: buildBoard folds each node's evalTimeline onto it as the evals
field — the SAME single source as a node's issues / overlays / lastDiff — so the dashboard reads it from the
one /api/board poll, never a per-node fetch. Alongside the readings it folds the node's declared
scenarios (name, expected, optional code), so a consumer sees the WHOLE set — a never-measured
scenario has no reading but is still a countable unit of loss (yatsu-score-badge's tile count, the
focus-panel). To keep the attach cheap it reuses the board's specs + driftIndex and one shared yatsu
walk. The bytes are the one thing NOT
folded: /api/yatsu/blob serves a reading's evidence by content hash from the shared cache, fetched lazily
on expand, with a miss original file signal when the bytes are gone and a MIME sniffed from the content.
(A standalone /api/specs/:id/evals route exposes the same engine for one id.)
The eval tab (spec-dashboard) is a fourth face on the node popup beside spec/history/issues, on the
same panesFor registry. Because the readings arrive on the board prop, the tab is instant and consistent
— never the previous node's readings on a switch. It is a thin consumer of the chronological-timeline
scaffold the history tab uses (see work-pane): newest expanded, older reveal on the down gesture. Each
row's header names its scenario, the verdict badge (✓ pass / ✗ fail, with its optional note annotation
shown beside; legacy for a pre-verdict or legacy note-only reading), and the per-reading score circle
(yatsu-score-badge), then its evaluator, codeSha, and time.
Its evidence is the scenario's expected over the captured proof (screenshot or transcript), or miss
original file when the blob was pruned.
The tab surfaces the whole declared set in one list, not only the readings. A declared scenario
with no reading leads that list as a blind-spot row — the empty score ring over its name, its
expected, and the files it tracks. The ring is the only distinction (no fenced-off band, no second
scrollbar): an unmeasured scenario is the node's outstanding loss and belongs where the attention is, so a
node's intent is legible inside the popup before a reading lands (the only place it shows once the popup
covers the focus-panel). No reading at all → those rows under a hint; some measured, some not → those
rows lead the timeline. The one presence-distinct empty state survives: a node with no scenarios
(no yatsu.md → no evals field) shows nothing.
The seam / out of scope: the forge issue-events half of the timeline — each tracked issue appearing twice (open, close) and linking out to its forge-hosted image rather than a local blob — arrives with the needs-yatsu-eval forge node; the tab joins it at read time then. Backend and computer-use evaluators, and the cache cleanup surface, stay with their own nodes.