Canonical flow
Support response agent
Evaluator maturity pipeline for support replies.
Abstract pattern
Evaluator maturity pipeline
The evaluator-maturity example showing how the gate itself can evolve from an expensive judge into a distilled classifier while the worker stays the same.
- Assisted Human-led with automation support.
- HITL Human approves each action.
- HOTL Human samples or monitors.
- Autonomous Automation acts within guardrails.
Task contract
Support response agent
The evaluator-maturity example showing how the gate itself can evolve from an expensive judge into a distilled classifier while the worker stays the same.
- Worker
- The LLM that drafts the response.
- Boundary
- Input is the ticket, customer context, and knowledge base. Output is a response draft.
- Evidence log
- Ticket, draft, judge verdict, human edit or approval, and customer reaction such as resolved, reopened, or escalated.
- Evaluator
- The evaluator matures from an LLM judge to a distilled classifier.
- Promotion rule
- Judge agreement with human approvals stays above threshold and reopen rate stays low, which allows sampled monitoring instead of blocking every send.
- Demotion rule
- Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
- Fallback
- Human approval stays in place for anything outside the routine ticket classes.
- Lives
- HITL -> HOTL with sampling
Evaluator detail
What the gate actually checks
- Target
- Output
- Technique
- LLM judge first, then a classifier distilled from the judge's labels once enough examples accumulate.
- Oracle
- Human-gold support-agent edits and approvals validate the judge before the judge is used to label data for distillation.
- Position
- hotl
Teaching point
What this flow proves
Eval-the-eval is the point here, because the evaluator is validated, promoted, demoted, and eventually made cheap enough to own the gate.
Six questions
How this flow governs autonomy
- Without PAA
- Responses go out without a systematic quality gate; the model's quality is invisible until a customer complains; there is no audit trail distinguishing what the model produced from what a human approved.
- What gets gated
- Every response draft initially — then sampled monitoring replaces blocking review once judge agreement and reopen rate are proven stable.
- What is logged
- Ticket, draft, judge verdict, human edit or approval, and customer reaction including resolved, reopened, or escalated to a human agent.
- Earns promotion
- Judge agreement with human approvals stays above threshold and reopen rate stays low over a window, allowing the flow to shift from blocking every send to sampled monitoring.
- Triggers demotion
- Reopen or escalation rate rises, or the distilled classifier drifts from fresh human gold, which demotes both the task and the evaluator.
- Never full-auto
- Anything outside the routine ticket classes — novel, sensitive, or escalating tickets stay human-gated as a permanent design constraint, not a classifier gap.
This page is linked from the canonical card set on the flows index.