Evaluator primitive

Evaluators are the primitive of progressive autonomy

The manifesto says autonomy is earned by evaluation. This page names the evaluator itself: the target, technique, oracle, and position that make a gate real.

An unvalidated evaluator is not a gate. It is a guess with a UI.

An evaluator is four choices

Most eval work only makes the technique choice explicit. PAA requires the other three as well.

Target

What layer you evaluate: input, process, output, or outcome.

Start at the cheapest layer that catches the failure, then move deeper only when needed.

Technique

What produces the verdict: rule, metric, classifier, judge, or human.

The technique catalog is the implementation choice, not the whole evaluator.

Oracle

What the verdict is measured against: invariant, label, rubric, or downstream result.

Without an oracle, a score is just a number with confidence styling.

Position

Where it runs: blocking gate, async monitor, or offline promotion evidence.

Position determines whether the evaluator belongs in HITL, HOTL, or both.

Target

Gate at the cheapest layer that actually catches the failure.

Evaluator target layers
Layer Question Cost Note
Input Was the right context present and well-formed before acting? Cheapest Catches structural failures before work starts.
Process Were the present inputs used correctly during generation? Hard / often sealed Usually inferred from output and outcome.
Output Is the produced artifact good against some standard? Needs an oracle The default eval layer.
Outcome Did it work downstream? Truest, most lagged Noisy and delayed, but the real numerator.

Technique catalog

Ordered by cost, cheapest first.

Evaluator technique catalog
Technique Examples Cost / latency Best use
Deterministic / rule-based Schema, types, bounds, regex, allow/deny lists, assertions Zero Structural correctness and hard invariants
Reference-based metrics Exact match, F1, numeric tolerance, tests against a reference Cheap Knowable-correct tasks with gold labels
Learned classifier Logreg, SVM, small net on embeddings, SetFit, ModernBERT Cheap inference The mature high-volume gate
LLM-as-judge A model scores output against a rubric, with or without a reference Expensive / latent Bootstrap labels before distilling to cheaper gates
Human evaluation Expert raters or reviewers Highest Ground truth for ambiguous or high-stakes cases

Gate economics

The evaluator must cost less than the action it guards.

Evaluator economics
Principle Rule Implication
Gate economics
The evaluator must cost less than the action it guards. If the gate is more expensive than the task, the economics fail.
Cost curve
Expensive at cold start, cheap at volume. A reviewer-heavy phase can fund itself because the volume is still low.
Distillation
Move from judge to classifier once labels exist. The accumulated review stream pays for the cheaper gate you will need later.
Boundary discipline
Instrument from the start. The data must already be there when the gate needs to get cheaper.

Maturity curve

The evaluator descends the cost curve as the task climbs the autonomy curve.

Evaluator maturity curve
Stage Evaluator setup Autonomy level Meaning
Cold start
Human plus LLM judge Low autonomy Expensive but affordable because volume is low.
Growing evidence
Distilled classifier begins to replace the judge Rising autonomy Review labels turn into training data.
Mature volume
Cheap classifier on the hot path High autonomy The gate is fast enough to run at scale.
Regression
Fallback to the more expensive evaluator or human Demotion Agreement or calibration drift exceeded the bar.

Metric choice

Optimize according to failure cost, not preference.

Metric choice by failure profile
Failure profile Metric to favor Why
Rare, high-cost failures
Recall Misses are catastrophic, so escalate liberally.
Common, low-cost failures
Precision and throughput False escalations become the expensive error.
Bootstrap labels
Human agreement Use the judge to create the training set, not to pretend the gate is already proven.

Blocking vs monitoring

The same technique can live in different positions in the loop.

Gate position and loop behavior
Position Timing Loop mode Behavior
Blocking gate
Pre-execution HITL Halts the write until the verdict clears.
Async monitor
Post-hoc HOTL Watches outputs and triggers demotion or alert.
Offline promotion evidence
Batch / aggregate Promotion governance Measures whether the task has earned more autonomy.

How an evaluator gets promoted and demoted

The evaluator itself runs the same loop. It starts eligible when instrumented and paired with human gold, gets promoted when a cheaper evaluator clears agreement above the bar, is monitored continuously for calibration drift, and is demoted when agreement degrades.

That recursion is the whole architecture. The task and the gate are both moving, each on its own clock, each measured against the one above it.