spec-yatsu¶

yatsu is the evaluation in spec=loss / commits=optimizer — a spec carries how to measure its loss, the agent measures it, yatsu keeps score and flags stale.

The third SpexCode package, with spec-cli, spec-dashboard, and spec-forge. Read the system as one optimization: a spec is a loss-function design (what we want), issues/commits are the optimizer (driving the code toward it), and yatsu is the evaluation — the measured loss, how far live behavior sits from the spec.

A spec carries how to measure its loss¶

Beyond the target, a node's yatsu.md says how to measure the loss against it — one or more scenarios, each: a description (what to check), the expected result (what zero loss looks like), and optionally a test file beside it (a real playwright.spec.ts, a script — whatever runs). This is the measurement, written next to the loss function. yatsu defines no DSL and runs nothing.

The agent measures; yatsu keeps score¶

The agent is the evaluator. When a score is stale, the agent reads the scenario, runs it however — the test file, by hand, a computer-use pass — compares the actual result to the expected, and files the measurement: spex yatsu eval <node> with the evidence it captured (a screenshot, a transcript) and a verdict (met expected, or how far off). yatsu executes nothing; it only records the result.

yatsu keeps score. Measurements live in a flat git-tracked yatsu.evals.ndjson beside the spec — a second git-as-database axis: a measurement commit is an evaluation event, so history / attribution / drift apply unchanged. A score is stale when its governed code:, its scenario, or the evaluator moved since — derived live from git, no stored hashes. Evidence bytes are content-addressed under the shared git common dir (one blob per content, shared by every worktree, never committed; gone → "miss original file").

spex yatsu scan — which scores are stale or missing.
spex yatsu eval [.|<node>] — the agent files a measurement (evidence + verdict).
spex yatsu show [.|<node>] [--json] — read a node's scores; the same data the dashboard's eval tab renders (one engine, two faces).
spex yatsu clean [--keep-latest|--all] — prune the evidence cache.

Proactive — the optimizer keeps its scores fresh¶

yatsu is the loss signal the optimizer reads, so a stale score is a blind spot. The core contract tells every agent: changed a node that has a yatsu.md? re-measure it. The stop-gate surfaces a stale or missing score the way it surfaces code-drift, so the nudge lands in the flow, not on demand. Only nodes that declare a scenario are in scope — a node with no surface to measure simply has none.

What's next¶

The computer-use "stupid user" is the agent's most thorough measuring hand — it just looks. Backend yatsu measures loss through real APIs (freshness reconcile waiting). Nothing in yatsu ever learns how to test: the spec defines the loss, the agent measures it, the optimizer drives it down, yatsu keeps score.