Where my old test instincts started missing
The first time I tried to write a unit test for an LLM feature, I did the obvious thing. Same prompt, expected string, assert output == "...". It worked the first run and broke the second, even at temperature 0. The model came back with “Yes, that’s correct” instead of “Correct, yes.” Semantically the same, the assertion didn’t care.
That was the moment I felt the gap. The tests I’d been writing for ten years assume a function, given the same input, returns the same output. LLMs don’t. The Advisor360 line I keep coming back to is, “you can’t assert your way out of non-determinism.” Once you accept that, you stop asking “is the output exactly this?” and start asking “is the output in the acceptable distribution of good answers?”
That’s what evals are. They’re how I’ve been thinking about quality for AI features since I started shipping them.
The three tiers I keep reaching for
When I look at how the eval ecosystem has settled in 2026, most teams I’ve talked to land on three kinds of checks, in increasing order of flexibility and decreasing order of speed.
Code-based assertions. Exact match, regex, JSON schema validation, did-the-agent-call-the-right-tool. Anthropic’s own docs put it first for a reason: it’s the fastest, cheapest, and least fussy. If the output is structured (a tool call, a JSON blob, a classification label), this is where I start.
Statistical scoring. BLEU, ROUGE, BERTScore, embedding cosine similarity. The legacy NLP toolkit. Eugene Yan’s writeup on patterns is fairly direct that these correlate poorly with human judgment on open-ended outputs, and I’ve felt that too. I still use them for sanity checks, but I wouldn’t gate a release on them.
LLM-as-judge. Use a strong model to grade the output of your application against a rubric. This is where most of the action has moved.
LLM-as-judge, the part I had to learn the hard way
The framing that finally clicked for me came from Hamel Husain’s guide. The judge is a model you’re training. The rubric is the prompt. Humans are the ground truth. Cohen’s kappa is the loss function. Once you frame it that way, the practice falls out.
A few things I’ve internalized:
Single-output scoring vs pairwise. Single-output gives the judge one answer and a rubric. Pairwise gives it two answers and asks which is better. I default to single-output for objective things (faithfulness, toxicity), and pairwise when I’m comparing model or prompt versions. Eugene Yan’s evaluator post found pairwise gives more stable agreement with humans on subjective tasks.
G-Eval. The Liu et al. paper from EMNLP 2023 is the one most modern frameworks build on. You give the judge a task description and criteria, ask it to first reason through chain-of-thought evaluation steps, then score in a form-filling pass. With GPT-4 they hit a Spearman correlation of 0.514 with humans on summarization, well ahead of anything that came before. DeepEval and Promptfoo both ship configurable G-Eval metrics as a primitive now.
Pass/fail beats Likert. Husain is uncompromising on this and I’ve come around to his view: “if your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale, you’re doing it wrong.” People don’t know what to do with a 3. Binary forces you to define what good means, and the judge writes a short critique alongside the verdict, which is what you read when something regresses.
The biases are real. The MT-Bench paper (Zheng et al., 2023) names three I see often: position bias (judges favor the first option, sometimes 70% of the time), verbosity bias (judges prefer longer answers, over 90% in adversarial tests), and self-preference bias (models favor their own outputs). The mitigations aren’t exotic: randomize order, control response length in the rubric, use a different model family as the judge than the one being evaluated.
The thing I didn’t expect when I started building judges: the act of writing the rubric, labeling 100 examples, and arguing with my judge’s verdicts taught me what I wanted from the system. Shankar et al.’s “Who Validates the Validators?” calls this criteria drift, the idea that you can’t fully define your eval criteria before grading outputs. The grading is how the criteria emerge. I’ve found that to be true every time.
Metrics I check on most projects
For a RAG application or anything grounded in retrieved context, the vocabulary that’s stuck is mostly from the Ragas paper (Es et al., EACL 2024) and what DeepEval added on top:
- Faithfulness. Every claim in the output is supported by the retrieved context. The usual computation is to extract claims from the answer and check each one against the source.
- Answer relevancy. Does the response address the question, or does it wander?
- Hallucination. Output contains claims that aren’t in the context. The HaluEval benchmark found GPT-3.5 only managed 58.5% accuracy distinguishing factual from hallucinated summaries, which is humbling.
- Context grounding (precision and recall). Did the retriever return the right passages? Precision is signal-to-noise; recall is whether you got the chunks you needed at all.
- Bias and toxicity. Demographic skew and harmful content. Most frameworks ship reasonable defaults here.
In my experience, factual inconsistency lands somewhere in the 5-10% range even with decent RAG, and getting below 2% takes real work.
Golden datasets and the data flywheel
The eval suite is only as good as the dataset you grade against. The pattern I keep coming back to:
- Start with 20-30 hand-written cases from someone who knows the domain. Husain suggests writing more until new failure modes stop appearing.
- Use synthetic generation to scale that 10-100x. DeepEval, Ragas, and Promptfoo all have synthesizers. Treat synthetic data as “silver” and promote to “gold” after a human reviews it.
- Version the dataset like code. Pin every eval run to a dataset version so you can diff results across commits.
- Wire production traces back into the dataset. This is the part I underestimated for a long time. The failures users hit in prod are the cases your synthetic data didn’t anticipate, and they’re the ones worth turning into regression tests. Braintrust calls this turning traces into eval cases with one click, LangSmith has a similar flow, and the LangChain framing of a “data flywheel” describes it well.
The gap between what you imagined users would do and what they’re doing is where most production failures live. Closing that loop is the work.
Evals in CI/CD
Once the dataset and judge are stable enough to trust, evals slot into CI in roughly the same shape as any other test suite. The patterns I’ve seen converge on:
- A small, fast eval suite that runs on every PR, mostly code-based assertions and a handful of judge checks on a curated subset.
- The full golden dataset run on merge to main, or on a nightly schedule.
- Deploy gates that compare aggregate scores against the last known-good baseline. If faithfulness drops 5 points, the deploy doesn’t go out.
DeepEval extends pytest, so deepeval test run looks and feels like any other Python test job in GitHub Actions. Braintrust posts a PR comment with side-by-side diffs of model outputs across commits, which is closer to a Vercel preview deploy than a pass/fail. Promptfoo keeps its config in YAML next to the code, so the eval suite lives in the repo and reviews like any other change.
The part I like is that this is the same idea as the rest of shift-left: catch the regression before it ships, not after a user files a ticket.
Where humans still belong
A judge that’s been calibrated to one task on one model drifts. The model behind the judge changes, the application changes, the users find new things to ask. The teams I’ve seen do this well don’t ever fully step away from looking at data. They build automation for the 90% of cases that are stable and route the rest to expert review.
Husain’s line on this is the one I keep quoting: “you can never stop looking at data, no free lunch exists.” The human-in-the-loop part isn’t a fallback for when automation fails. It’s the input that keeps the automation honest.
The tooling I’ve been reaching for
A rough snapshot of where the ecosystem sits in mid-2026:
- DeepEval (Apache 2.0, by Confident AI). Pytest-native, 14+ built-in metrics, runs locally, framework-agnostic. My default if the team is already in Python and wants evals to feel like unit tests.
- Braintrust. Polished UI, playground for prototyping, production trace to eval pipeline, decent for orgs with non-technical stakeholders in the eval loop.
- Promptfoo (MIT, acquired by OpenAI earlier this year). YAML config, CLI-first, 60+ providers, strong on cross-model comparison and red-teaming.
- Ragas. RAG-specific, reference-free metrics, the paper that seeded the modern vocabulary. Worth using directly if your eval needs are mostly RAG-shaped.
None of them is the obvious winner, and most teams I know end up using more than one. Promptfoo for prompt iteration in the editor, DeepEval in CI, Braintrust or Langfuse for production tracing. The seams between them aren’t always clean.
What I keep landing on
Evals are how I’ve stopped feeling like I’m shipping AI features blind. They aren’t unit tests in the old sense, the assertion model doesn’t survive non-determinism, but they fill the same hole in the loop: a fast, repeatable signal that something I just changed made things better or worse.
The work that buys the signal is the part nobody talks about enough. Labeling examples, calibrating a judge against humans, versioning a dataset, wiring prod traces back in. None of it is glamorous. All of it is what makes the next change you ship safer than the last.