Multi-Pass Code Review: Plan, Build, Review, Fix
Stage One: Planning the Review Scope
The planning stage determines what to review and how deeply to review it. When a developer submits a pull request, the pipeline first enumerates all changed files, categorizes them by type and risk level, and estimates the total review scope. This estimation guides resource allocation for subsequent stages, ensuring that high-risk changes receive proportionally more scrutiny.
File categorization assigns risk levels based on the file type and the nature of the changes. Authentication modules, payment processing code, database migration files, and infrastructure configuration files are flagged as high-risk. UI component changes, documentation updates, and test file modifications receive lower risk scores. The categorization can be customized per project, with teams adding their own risk rules based on which parts of their codebase are most sensitive.
Dependency mapping runs during the planning stage as well. The system identifies which unchanged files interact with the changed code through imports, function calls, type references, and shared state. This dependency map determines which additional files the review needs to examine beyond the direct change set. Without dependency mapping, the review would miss bugs that emerge from the interaction between changed and unchanged code.
Token budget estimation completes the planning stage. The system calculates how many tokens the review will consume based on file sizes, dependency depth, and the configured number of review passes. If the total exceeds the budget, the planner prioritizes high-risk files and reduces analysis depth for low-risk changes. This budgeting ensures the review completes within cost and time constraints while focusing effort where it matters most.
Stage Two: Building Context
The context-building stage gathers all the information the AI model needs to perform a thorough review. This goes beyond simply reading the changed files. The system loads function signatures from imported modules, reads type definitions referenced by the changed code, pulls in configuration values that affect behavior, and retrieves test cases that cover the modified functions.
Codebase-specific context includes coding standards, naming conventions, error handling patterns, and architectural guidelines. These are typically encoded in the system prompt or loaded from configuration files. The goal is to give the AI model the same context that a human reviewer on the team would have: knowledge of how things are done in this particular project.
Historical context from the version control system provides additional signal. The commit history shows which parts of the code change frequently (indicating instability or active development), which changes have previously introduced bugs (indicating areas that need extra scrutiny), and which developers authored the original code (enabling the model to understand the design intent). Some systems also load previous review findings for the same files, tracking whether past issues have been addressed.
The assembled context is organized into coherent chunks that fit within the model context window. Related files are grouped together, with the most relevant context placed closest to the changed code in the input sequence. This organization maximizes the model ability to draw connections between the changes and their surrounding context, improving the quality of findings.
Stage Three: Iterative Review Passes
The review stage runs the actual analysis, executing multiple passes over the code with increasing depth at each iteration. The first pass performs broad, fast analysis covering all changed files. Subsequent passes narrow their focus to areas where the first pass identified potential issues or where the code complexity warrants deeper investigation.
The first pass typically catches 55 to 65 percent of all detectable issues. These are the straightforward bugs, style violations, security patterns, and obvious logic errors that any competent reviewer would notice. The value of the first pass is efficiency: it covers a lot of ground quickly, flagging the easy wins that do not require deep analysis.
The second pass adds substantial value by examining the issues flagged in the first pass with more context. Some first-pass flags turn out to be false positives when the surrounding code is considered more carefully. The second pass also traces data flows end-to-end, following values from input to output through all the functions they pass through. This tracing catches sanitization gaps, type conversion errors, and error handling inconsistencies that the broader first pass could not detect.
The third pass focuses on interaction effects and edge cases. It examines how the changed code interacts with concurrent operations, how it behaves under error conditions, and whether boundary values (empty arrays, null values, maximum integers, Unicode characters) are handled correctly. This pass catches the subtle bugs that cause intermittent production failures, the kind that are hardest to diagnose and most expensive to fix.
Convergence detection runs after each pass. The system compares findings from the current pass against all previous findings. If a pass produces no new findings, the review has converged and further passes would not add value. Most routine code changes converge after two passes. Complex algorithmic changes may require three or four passes. The convergence mechanism prevents wasting compute on reviews that have already been thorough enough.
Stage Four: Fix Verification
The fix verification stage runs after developers address findings from previous review rounds. Rather than re-reviewing the entire change set from scratch, this stage performs targeted analysis of the specific areas that were modified in response to review feedback. This targeted approach is both faster and more focused than a full re-review.
The verification checks three things: whether the fix correctly addresses the original finding, whether the fix introduces any new issues, and whether the fix is consistent with the patterns used elsewhere in the codebase. A fix that resolves a null pointer dereference by adding a null check but throws a generic exception instead of handling the null case properly would be flagged for further improvement.
Regression detection is a critical capability of the fix verification stage. Changes made to fix one issue sometimes break something else, especially in tightly coupled code. The verification stage traces the impact of each fix through the dependency graph, checking that functions that depend on the modified code still receive the inputs they expect and produce the outputs their callers rely on.
The fix verification stage also updates the issue tracking state. Findings that have been addressed are marked as resolved. Findings where the fix is incomplete or introduces new problems are updated with new information. New findings discovered during verification are added to the issue list. This tracking provides a clear audit trail showing how each finding was handled.
Pipeline Configuration Best Practices
Configuring the multi-pass pipeline requires balancing thoroughness against cost and speed. The optimal configuration depends on the team risk tolerance, budget, and development velocity. Teams deploying safety-critical software typically run four passes on every change. Teams building consumer web applications might run two passes by default with three passes triggered for changes in sensitive areas.
Model selection for each pass affects both quality and cost. A common pattern uses a smaller, faster model for the first pass (broad, inexpensive scanning) and a larger, more capable model for the deep analysis passes (thorough but expensive reasoning). This tiered approach achieves most of the quality benefit of using the best model for every pass while reducing total cost by 40 to 60 percent.
Custom rules and suppression lists should be maintained as living documents. As the team discovers false positive patterns specific to their codebase, these patterns should be added to the suppression list. As new coding standards are adopted, the rules should be updated to reflect them. Regular review of the pipeline configuration, at least quarterly, keeps the tool aligned with the team evolving practices.
Metrics collection provides the data needed to optimize pipeline configuration over time. Track the number of findings per pass, the false positive rate, the time from PR submission to review completion, and the number of production bugs that AI review would have caught. These metrics reveal whether the pipeline is configured correctly or needs adjustment.