Evaluator primitive

Evaluators are the primitive of progressive autonomy

The manifesto says autonomy is earned by evaluation. This page names the evaluator itself: the target, technique, oracle, and position that make a gate real.

An unvalidated evaluator is not a gate. It is a guess with a UI.

An evaluator is four choices

Most eval work only makes the technique choice explicit. PAA requires the other three as well.

Target

What layer you evaluate: input, process, output, or outcome.

Start at the cheapest layer that catches the failure, then move deeper only when needed.

Technique

What produces the verdict: rule, metric, classifier, judge, or human.

The technique catalog is the implementation choice, not the whole evaluator.

Oracle

What the verdict is measured against: invariant, label, rubric, or downstream result.

Without an oracle, a score is just a number with confidence styling.

Position

Where it runs: blocking gate, async monitor, or offline promotion evidence.

Position determines whether the evaluator belongs in HITL, HOTL, or both.

Target

Gate at the cheapest layer that actually catches the failure.

Evaluator target layers
Layer	Question	Cost	Note
Input	Was the right context present and well-formed before acting?	Cheapest	Catches structural failures before work starts.
Process	Were the present inputs used correctly during generation?	Hard / often sealed	Usually inferred from output and outcome.
Output	Is the produced artifact good against some standard?	Needs an oracle	The default eval layer.
Outcome	Did it work downstream?	Truest, most lagged	Noisy and delayed, but the real numerator.

Technique catalog

Ordered by cost, cheapest first.

Evaluator technique catalog
Technique	Examples	Cost / latency	Best use
Deterministic / rule-based	Schema, types, bounds, regex, allow/deny lists, assertions	Zero	Structural correctness and hard invariants
Reference-based metrics	Exact match, F1, numeric tolerance, tests against a reference	Cheap	Knowable-correct tasks with gold labels
Learned classifier	Logreg, SVM, small net on embeddings, SetFit, ModernBERT	Cheap inference	The mature high-volume gate
LLM-as-judge	A model scores output against a rubric, with or without a reference	Expensive / latent	Bootstrap labels before distilling to cheaper gates
Human evaluation	Expert raters or reviewers	Highest	Ground truth for ambiguous or high-stakes cases

Gate economics

The evaluator must cost less than the action it guards.

Evaluator economics
Principle	Rule	Implication
Gate economics
The evaluator must cost less than the action it guards.	If the gate is more expensive than the task, the economics fail.
Cost curve
Expensive at cold start, cheap at volume.	A reviewer-heavy phase can fund itself because the volume is still low.
Distillation
Move from judge to classifier once labels exist.	The accumulated review stream pays for the cheaper gate you will need later.
Boundary discipline
Instrument from the start.	The data must already be there when the gate needs to get cheaper.

Maturity curve

The evaluator descends the cost curve as the task climbs the autonomy curve.

Evaluator maturity curve
Stage	Evaluator setup	Autonomy level
Cold start
Human plus LLM judge	Low autonomy	Expensive but affordable because volume is low.
Growing evidence
Distilled classifier begins to replace the judge	Rising autonomy	Review labels turn into training data.
Mature volume
Cheap classifier on the hot path	High autonomy	The gate is fast enough to run at scale.
Regression
Fallback to the more expensive evaluator or human	Demotion	Agreement or calibration drift exceeded the bar.

Metric choice

Optimize according to failure cost, not preference.

Metric choice by failure profile
Failure profile	Metric to favor	Why
Rare, high-cost failures
Recall	Misses are catastrophic, so escalate liberally.
Common, low-cost failures
Precision and throughput	False escalations become the expensive error.
Bootstrap labels
Human agreement	Use the judge to create the training set, not to pretend the gate is already proven.

Blocking vs monitoring

The same technique can live in different positions in the loop.

Gate position and loop behavior
Position	Timing	Loop mode
Blocking gate
Pre-execution	HITL	Halts the write until the verdict clears.
Async monitor
Post-hoc	HOTL	Watches outputs and triggers demotion or alert.
Offline promotion evidence
Batch / aggregate	Promotion governance	Measures whether the task has earned more autonomy.

How an evaluator gets promoted and demoted

The evaluator itself runs the same loop. It starts eligible when instrumented and paired with human gold, gets promoted when a cheaper evaluator clears agreement above the bar, is monitored continuously for calibration drift, and is demoted when agreement degrades.

That recursion is the whole architecture. The task and the gate are both moving, each on its own clock, each measured against the one above it.