Canonical flow

Support response agent

Evaluator maturity pipeline for support replies.

Abstract pattern

Evaluator maturity pipeline

The evaluator-maturity example showing how the gate itself can evolve from an expensive judge into a distilled classifier while the worker stays the same.

Assisted Human-led with automation support.
HITL Human approves each action.
HOTL Human samples or monitors.
Autonomous Automation acts within guardrails.

Worker: The LLM that drafts the response.
Boundary: Input is the ticket, customer context, and knowledge base. Output is a response draft.
Evidence log: Ticket, draft, judge verdict, human edit or approval, and customer reaction such as resolved, reopened, or escalated.
Evaluator: The evaluator matures from an LLM judge to a distilled classifier.
Promotion rule: Judge agreement with human approvals stays above threshold and reopen rate stays low, which allows sampled monitoring instead of blocking every send.
Demotion rule: Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
Fallback: Human approval stays in place for anything outside the routine ticket classes.
Lives: HITL -> HOTL with sampling

Evaluator detail

What the gate actually checks

Target: Output
Technique: LLM judge first, then a classifier distilled from the judge's labels once enough examples accumulate.
Oracle: Human-gold support-agent edits and approvals validate the judge before the judge is used to label data for distillation.
Position: hotl

Teaching point

What this flow proves

Eval-the-eval is the point here, because the evaluator is validated, promoted, demoted, and eventually made cheap enough to own the gate.

Six questions

How this flow governs autonomy

Without PAA: Responses go out without a systematic quality gate; the model's quality is invisible until a customer complains; there is no audit trail distinguishing what the model produced from what a human approved.
What gets gated: Every response draft initially — then sampled monitoring replaces blocking review once judge agreement and reopen rate are proven stable.
What is logged: Ticket, draft, judge verdict, human edit or approval, and customer reaction including resolved, reopened, or escalated to a human agent.
Earns promotion: Judge agreement with human approvals stays above threshold and reopen rate stays low over a window, allowing the flow to shift from blocking every send to sampled monitoring.
Triggers demotion: Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
Never full-auto: Anything outside the routine ticket classes — novel, sensitive, or escalating tickets stay human-gated as a permanent design constraint, not a classifier gap.

This page is linked from the canonical card set on the flows index.