Canonical flow

PR review agent

Stacked semantic review for merge safety.

Abstract pattern

Stacked semantic review

The engineer-legible flagship showing how deterministic checks can sit underneath semantic judgment and keep review on the right side of the merge gate.

Assisted Human-led with automation support.
HITL Human approves each action.
HOTL Human samples or monitors.
Autonomous Automation acts within guardrails.

Worker: LLM + tools
Boundary: Input is the PR diff plus context like the linked issue and test results. Output is a structured review verdict of approve, request-changes, or escalate with rationale.
Evidence log: Diff, test results, lint results, judge verdict, human decision when gated, and whether a later merge needed a revert or caused an incident.
Evaluator: A stacked gate that combines deterministic checks with an LLM judge.
Promotion rule: Judge agreement with senior reviewers stays above threshold over a window and deterministic escapes stay near zero, allowing the flow to shift from blocking review to sampled monitoring.
Demotion rule: A merged auto-approved PR causes a revert or incident, or judge-vs-human agreement drifts down, which sends the flow back to blocking review.
Fallback: Human review stays on the merge path for substantive changes.
Lives: HITL -> HOTL

Evaluator detail

What the gate actually checks

Target: Output
Technique: Deterministic tests and lint run first, then an LLM judge handles the semantic call the deterministic layer cannot make.
Oracle: Tests as reference for the deterministic layer and senior reviewer decisions as human-gold for judge validation.
Position: hitl

Teaching point

What this flow proves

Cheap deterministic checks can stack under an expensive semantic judge, and the artifact boundary itself is the thing being reviewed.

Six questions

How this flow governs autonomy

Without PAA: Merges rely on whoever happens to review the PR; semantic quality is informal and inconsistent; the merge gate has no systematic bar and no audit trail when something ships that should not have.
What gets gated: The merge decision — the agent's verdict is held at the gate until a human reviewer clears it; at HOTL only flagged or high-risk PRs re-enter blocking review.
What is logged: Diff, test results, lint results, judge verdict, human decision when gated, and whether the merged PR later triggered a revert or incident.
Earns promotion: Judge agreement with senior reviewers stays above threshold over a window and deterministic escapes stay near zero.
Triggers demotion: A merged auto-approved PR causes a revert or incident, or judge-vs-human agreement drifts below threshold.
Never full-auto: Substantive semantic judgment — an LLM judge can own the first-pass review but not the final merge gate without ongoing human validation.

This page is linked from the canonical card set on the flows index.