Target
What layer you evaluate: input, process, output, or outcome.
Start at the cheapest layer that catches the failure, then move deeper only when needed.
An open, vendor-neutral standard for autonomy that is earned, scoped, and reversible.
Evaluator primitive
The manifesto says autonomy is earned by evaluation. This page names the evaluator itself: the target, technique, oracle, and position that make a gate real.
An unvalidated evaluator is not a gate. It is a guess with a UI.
Most eval work only makes the technique choice explicit. PAA requires the other three as well.
What layer you evaluate: input, process, output, or outcome.
Start at the cheapest layer that catches the failure, then move deeper only when needed.
What produces the verdict: rule, metric, classifier, judge, or human.
The technique catalog is the implementation choice, not the whole evaluator.
What the verdict is measured against: invariant, label, rubric, or downstream result.
Without an oracle, a score is just a number with confidence styling.
Where it runs: blocking gate, async monitor, or offline promotion evidence.
Position determines whether the evaluator belongs in HITL, HOTL, or both.
Gate at the cheapest layer that actually catches the failure.
| Layer | Question | Cost | Note |
|---|---|---|---|
| Input | Was the right context present and well-formed before acting? | Cheapest | Catches structural failures before work starts. |
| Process | Were the present inputs used correctly during generation? | Hard / often sealed | Usually inferred from output and outcome. |
| Output | Is the produced artifact good against some standard? | Needs an oracle | The default eval layer. |
| Outcome | Did it work downstream? | Truest, most lagged | Noisy and delayed, but the real numerator. |
Ordered by cost, cheapest first.
| Technique | Examples | Cost / latency | Best use |
|---|---|---|---|
| Deterministic / rule-based | Schema, types, bounds, regex, allow/deny lists, assertions | Zero | Structural correctness and hard invariants |
| Reference-based metrics | Exact match, F1, numeric tolerance, tests against a reference | Cheap | Knowable-correct tasks with gold labels |
| Learned classifier | Logreg, SVM, small net on embeddings, SetFit, ModernBERT | Cheap inference | The mature high-volume gate |
| LLM-as-judge | A model scores output against a rubric, with or without a reference | Expensive / latent | Bootstrap labels before distilling to cheaper gates |
| Human evaluation | Expert raters or reviewers | Highest | Ground truth for ambiguous or high-stakes cases |
The evaluator must cost less than the action it guards.
| Principle | Rule | Implication |
|---|---|---|
| Gate economics | The evaluator must cost less than the action it guards. | If the gate is more expensive than the task, the economics fail. |
| Cost curve | Expensive at cold start, cheap at volume. | A reviewer-heavy phase can fund itself because the volume is still low. |
| Distillation | Move from judge to classifier once labels exist. | The accumulated review stream pays for the cheaper gate you will need later. |
| Boundary discipline | Instrument from the start. | The data must already be there when the gate needs to get cheaper. |
The evaluator descends the cost curve as the task climbs the autonomy curve.
| Stage | Evaluator setup | Autonomy level | Meaning |
|---|---|---|---|
| Cold start | Human plus LLM judge | Low autonomy | Expensive but affordable because volume is low. |
| Growing evidence | Distilled classifier begins to replace the judge | Rising autonomy | Review labels turn into training data. |
| Mature volume | Cheap classifier on the hot path | High autonomy | The gate is fast enough to run at scale. |
| Regression | Fallback to the more expensive evaluator or human | Demotion | Agreement or calibration drift exceeded the bar. |
Optimize according to failure cost, not preference.
| Failure profile | Metric to favor | Why |
|---|---|---|
| Rare, high-cost failures | Recall | Misses are catastrophic, so escalate liberally. |
| Common, low-cost failures | Precision and throughput | False escalations become the expensive error. |
| Bootstrap labels | Human agreement | Use the judge to create the training set, not to pretend the gate is already proven. |
The same technique can live in different positions in the loop.
| Position | Timing | Loop mode | Behavior |
|---|---|---|---|
| Blocking gate | Pre-execution | HITL | Halts the write until the verdict clears. |
| Async monitor | Post-hoc | HOTL | Watches outputs and triggers demotion or alert. |
| Offline promotion evidence | Batch / aggregate | Promotion governance | Measures whether the task has earned more autonomy. |
The evaluator itself runs the same loop. It starts eligible when instrumented and paired with human gold, gets promoted when a cheaper evaluator clears agreement above the bar, is monitored continuously for calibration drift, and is demoted when agreement degrades.
That recursion is the whole architecture. The task and the gate are both moving, each on its own clock, each measured against the one above it.