Canonical flow

PR review agent

Stacked semantic review for merge safety.

Abstract pattern

Stacked semantic review

The engineer-legible flagship showing how deterministic checks can sit underneath semantic judgment and keep review on the right side of the merge gate.

  1. Assisted Human-led with automation support.
  2. HITL Human approves each action.
  3. HOTL Human samples or monitors.
  4. Autonomous Automation acts within guardrails.

Task contract

PR review agent

The engineer-legible flagship showing how deterministic checks can sit underneath semantic judgment and keep review on the right side of the merge gate.

Worker
LLM + tools
Boundary
Input is the PR diff plus context like the linked issue and test results. Output is a structured review verdict of approve, request-changes, or escalate with rationale.
Evidence log
Diff, test results, lint results, judge verdict, human decision when gated, and whether a later merge needed a revert or caused an incident.
Evaluator
A stacked gate that combines deterministic checks with an LLM judge.
Promotion rule
Judge agreement with senior reviewers stays above threshold over a window and deterministic escapes stay near zero, allowing the flow to shift from blocking review to sampled monitoring.
Demotion rule
A merged auto-approved PR causes a revert or incident, or judge-vs-human agreement drifts down, which sends the flow back to blocking review.
Fallback
Human review stays on the merge path for substantive changes.
Lives
HITL -> HOTL

Evaluator detail

What the gate actually checks

Target
Output
Technique
Deterministic tests and lint run first, then an LLM judge handles the semantic call the deterministic layer cannot make.
Oracle
Tests as reference for the deterministic layer and senior reviewer decisions as human-gold for judge validation.
Position
hitl

Teaching point

What this flow proves

Cheap deterministic checks can stack under an expensive semantic judge, and the artifact boundary itself is the thing being reviewed.

Six questions

How this flow governs autonomy

Without PAA
Merges rely on whoever happens to review the PR; semantic quality is informal and inconsistent; the merge gate has no systematic bar and no audit trail when something ships that should not have.
What gets gated
The merge decision — the agent's verdict is held at the gate until a human reviewer clears it; at HOTL only flagged or high-risk PRs re-enter blocking review.
What is logged
Diff, test results, lint results, judge verdict, human decision when gated, and whether the merged PR later triggered a revert or incident.
Earns promotion
Judge agreement with senior reviewers stays above threshold over a window and deterministic escapes stay near zero.
Triggers demotion
A merged auto-approved PR causes a revert or incident, or judge-vs-human agreement drifts below threshold.
Never full-auto
Substantive semantic judgment — an LLM judge can own the first-pass review but not the final merge gate without ongoing human validation.

This page is linked from the canonical card set on the flows index.