Carlos Aggio
Voltar aos Artigos
AI & Software Engineering

The Eval Gate: No Pass, No Progress

Carlos Aggio·11 de fevereiro de 2026·3 min de leitura

If you take one thing from this book, let it be this: evaluations are the difference between an agentic system that produces reliable software and one that produces confident-looking garbage at high speed. I've seen teams invest heavily in agent design and orchestration while treating evaluation as an afterthought. It never ends well.

The principle is simple: every agent produces an artifact (a file, not a conversational response), and that artifact must pass evaluation before the workflow advances. The evaluation examines the actual output, not the agent's explanation of its output. Did the artifact meet its definition of done? That's the only question that matters.

Two Evaluation Layers

Evaluations follow a layered approach that mirrors the overall system architecture:

Deterministic validation runs first. These are fast, repeatable, zero-ambiguity checks. For a requirement artifact: does it contain the mandatory metadata fields? Are all required sections populated? Do internal references (like links to parent requirements or dependent tasks) point to artifacts that actually exist in the repository? For code: does it compile? Does the linter pass? Do the tests execute and pass? Does code coverage meet the project's minimum threshold? Does the import structure respect the architectural boundaries (verified through static analysis of the dependency graph)?

An evaluator agent runs second. This is a separate, dedicated AI agent whose sole purpose is quality assessment. It makes the judgment calls that rule-based checks can't handle. For requirements: are the acceptance criteria genuinely testable, or are they vague enough to be interpreted multiple ways? For architecture: does the proposed approach align with established project patterns? Are there obvious coverage gaps that would surface during implementation? For code: does the implementation actually fulfill the task specification? Are there security exposures that static analysis wouldn't catch?

If either layer fails the artifact, the producing agent revises and resubmits. Most implementations allow three to five revision cycles before escalating to a human. This cap prevents runaway loops while giving agents meaningful space to self-correct based on specific feedback.

Production platforms are starting to formalize these evaluation patterns as managed services. Google's Agent Builder, for example, includes an Example Store (a centralized repository of few-shot examples that steer agent behavior on specific task types without retraining the underlying model) and an Evaluation Service (a feedback loop system that enables scaled review of agent outputs against quality metrics). These aren't revolutionary concepts individually, but their appearance as platform-level primitives signals that the industry recognizes evaluation infrastructure as essential, not optional. The pattern I described above (deterministic validation first, then AI-powered quality assessment) maps directly to what these managed services provide. The difference is that building it yourself gives you full control over the evaluation criteria, while managed services reduce operational overhead at the cost of some customizability.

Changing How Humans Spend Their Time

The practical effect of eval gates is transformative for how senior engineers invest their review capacity. Without automated evaluation, human reviewers spend most of their energy on low-value catches: missing metadata, inconsistent naming, test coverage gaps, formatting issues. With eval gates handling those mechanically, the human review focuses exclusively on high-leverage questions: is the architectural reasoning sound? Do the business logic choices make sense? Were the right tradeoffs made?

At a mining operations client, we saw code review time per feature drop by roughly 60 percent after implementing eval gates, while the quality of the feedback that human reviewers provided actually improved. They were catching real design issues instead of spending their attention budget on formatting and structural compliance.


This article is from The Agentic SDLC by Carlos Aggio.