What AI Code Review Misses: Current Limits
Business Logic and Domain Correctness
The most significant blind spot in AI code review is business logic validation. An AI model can verify that a function correctly implements the logic expressed in its code, but it cannot determine whether that logic is correct for the business problem it is supposed to solve. A pricing function that applies a 10% discount when the order total exceeds 00 will pass AI review perfectly, even if the business requirement was a 15% discount above 50.
This limitation exists because AI review operates on code, not on business requirements. The model can see what the code does, but it cannot see what the code should do. Business requirements live in product documents, user stories, conversations between product managers and developers, and organizational knowledge that is not accessible from the code alone. Until AI systems can reliably connect code behavior to business intent, this gap will remain.
Domain-specific validation is a related gap. A financial application that calculates tax liability must follow specific tax rules that vary by jurisdiction. A healthcare application that processes patient data must comply with regulatory requirements like HIPAA. A trading system that executes orders must follow exchange rules and market regulations. AI review cannot validate compliance with these domain-specific requirements unless they are encoded as explicit rules in the review configuration, which requires significant domain expertise to set up.
The practical impact is that teams must maintain human review for business logic, even when AI handles the mechanical aspects of code quality. The most effective division of labor assigns AI review to catch bugs, security vulnerabilities, and code quality issues while reserving human attention for validating that the code correctly implements business requirements.
Architectural and Design Decisions
AI code review evaluates code at the function and file level, but it cannot assess architectural decisions that span the entire system. Whether a microservices architecture is the right choice for a particular application, whether the data model supports the planned feature roadmap, whether the caching strategy will handle the projected growth, these are design questions that require understanding the full system context and future plans.
An AI reviewer might approve a pull request that introduces tight coupling between two services because the code within the PR is well-written. A human architect would reject the same PR because the coupling violates the team architectural principles and will make future service extraction impossible. The AI sees correct code. The human sees a design decision that will cause technical debt.
API design is another area where AI review falls short. An AI model can check that API endpoints follow RESTful conventions, handle errors properly, and validate input. It cannot evaluate whether the API design is intuitive for consumers, whether the resource hierarchy reflects the domain model accurately, or whether the versioning strategy will support backwards compatibility as the API evolves.
Performance architecture decisions are similarly beyond AI review capability. Choosing between caching strategies (local cache vs. distributed cache vs. CDN), database architectures (SQL vs. NoSQL vs. graph), and communication patterns (synchronous vs. asynchronous vs. event-driven) requires understanding the application load patterns, consistency requirements, and scaling constraints that are not visible in the code of any single pull request.
Novel and Complex Vulnerability Classes
AI code review reliably catches known vulnerability patterns that are well-represented in model training data: SQL injection, XSS, CSRF, insecure deserialization, and the other OWASP Top 10 categories. However, it struggles with novel vulnerability classes, complex multi-step attack chains, and application-specific security logic.
Zero-day vulnerability patterns are by definition not in training data. When a new class of vulnerability is discovered, AI review systems cannot detect it until they are updated with knowledge of the new pattern. During the gap between discovery and model update, the vulnerability class is invisible to AI review. Human security researchers who understand the underlying principles can often recognize novel vulnerabilities by reasoning from first principles, a capability that current AI models do not reliably possess.
Multi-step attack chains that span multiple services are another gap. An attacker might exploit a combination of individually innocuous behaviors across three different microservices to achieve unauthorized access. Each service code passes security review individually, but the combination creates a vulnerability. AI review operates on individual code changes and does not model the full attack surface of a distributed system.
Application-specific security requirements often fall outside standard patterns. A multi-tenant application must ensure that data from one tenant is never visible to another. A financial application must enforce transaction ordering constraints. An access control system must correctly implement complex permission hierarchies. These requirements are specific to the application and must be encoded as custom rules or validated by domain-aware human reviewers.
Performance and Algorithmic Optimization
AI code review provides limited guidance on performance optimization. A model might approve a solution with O(n^2) time complexity when an O(n log n) solution exists, especially if the less efficient solution is clearly written and correct. The model does not know the expected data sizes, performance requirements, or bottleneck tolerance of the application, so it cannot evaluate whether the performance characteristics of the code are acceptable.
Database query optimization is a specific performance area where AI review is weak. An AI model can verify that a query is syntactically correct and returns the right results, but it cannot evaluate the query execution plan, identify missing indices, or predict how the query will perform as the table grows to millions of rows. These assessments require knowledge of the database schema, data distribution, and query patterns that are not available from the code alone.
Caching decisions similarly require context beyond the code. Whether to cache a particular computation, which cache invalidation strategy to use, and how long to set the cache TTL depend on the freshness requirements of the data, the cost of recomputation, and the access patterns of the application. AI review cannot make these judgments because the relevant information exists in system architecture documents and operational metrics rather than in the source code.
Memory allocation patterns, garbage collection impact, and system resource consumption are performance dimensions that AI review does not reliably evaluate. A function that creates temporary objects in a tight loop might cause excessive garbage collection pressure, but this effect is only visible at runtime under load, not during static code analysis.
Undocumented Conventions and Team Knowledge
Every development team accumulates conventions and practices that are not fully documented: preferred library choices, deprecated patterns that should not be used in new code, known workarounds for third-party library bugs, and historical context about why certain architectural decisions were made. AI review systems do not have access to this tribal knowledge.
The gap is most visible when AI review approves code that uses a deprecated internal library, introduces a dependency that the team has decided to avoid, or implements a pattern that the team has moved away from. A human reviewer on the team would catch these issues immediately because they have the context that the AI lacks. The AI sees syntactically correct, well-structured code and approves it.
Gradually this gap can be narrowed by encoding team conventions in the review configuration, adding custom rules for deprecated patterns, preferred libraries, and architectural guidelines. But this encoding is always incomplete because team conventions evolve continuously, and the effort required to maintain comprehensive documentation of all conventions often exceeds what teams are willing to invest.