Canonical flow

Support response agent

Evaluator maturity pipeline for support replies.

Abstract pattern

Evaluator maturity pipeline

The evaluator-maturity example showing how the gate itself can evolve from an expensive judge into a distilled classifier while the worker stays the same.

  1. Assisted Human-led with automation support.
  2. HITL Human approves each action.
  3. HOTL Human samples or monitors.
  4. Autonomous Automation acts within guardrails.

Task contract

Support response agent

The evaluator-maturity example showing how the gate itself can evolve from an expensive judge into a distilled classifier while the worker stays the same.

Worker
The LLM that drafts the response.
Boundary
Input is the ticket, customer context, and knowledge base. Output is a response draft.
Evidence log
Ticket, draft, judge verdict, human edit or approval, and customer reaction such as resolved, reopened, or escalated.
Evaluator
The evaluator matures from an LLM judge to a distilled classifier.
Promotion rule
Judge agreement with human approvals stays above threshold and reopen rate stays low, which allows sampled monitoring instead of blocking every send.
Demotion rule
Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
Fallback
Human approval stays in place for anything outside the routine ticket classes.
Lives
HITL -> HOTL with sampling

Evaluator detail

What the gate actually checks

Target
Output
Technique
LLM judge first, then a classifier distilled from the judge's labels once enough examples accumulate.
Oracle
Human-gold support-agent edits and approvals validate the judge before the judge is used to label data for distillation.
Position
hotl

Teaching point

What this flow proves

Eval-the-eval is the point here, because the evaluator is validated, promoted, demoted, and eventually made cheap enough to own the gate.

Six questions

How this flow governs autonomy

Without PAA
Responses go out without a systematic quality gate; the model's quality is invisible until a customer complains; there is no audit trail distinguishing what the model produced from what a human approved.
What gets gated
Every response draft initially — then sampled monitoring replaces blocking review once judge agreement and reopen rate are proven stable.
What is logged
Ticket, draft, judge verdict, human edit or approval, and customer reaction including resolved, reopened, or escalated to a human agent.
Earns promotion
Judge agreement with human approvals stays above threshold and reopen rate stays low over a window, allowing the flow to shift from blocking every send to sampled monitoring.
Triggers demotion
Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
Never full-auto
Anything outside the routine ticket classes — novel, sensitive, or escalating tickets stay human-gated as a permanent design constraint, not a classifier gap.

This page is linked from the canonical card set on the flows index.