If it doesn't have an eval harness, it's still a demo.
Five rules we use on every agentic build, drawn from a year of live client work, and the eval template we ship to every new engagement on day one.
Most of the agentic projects we’ve reviewed in the wild aren’t broken because the agent is bad. They’re broken because nobody can tell whether the agent is bad. There’s no measurement. The team changes a prompt or upgrades the model and ships, because the demo still ran.
That isn’t engineering. It’s vibes with a budget.
Here’s the working rulebook we use on every project, and the day-one eval template we hand new clients.
Five rules
1. Eval harness is a deliverable, not a deliverable’s attribute
We treat the eval harness as a separate, billable deliverable on every engagement. It has its own scope, its own owner, and its own line on the proposal. When clients try to fold “we’ll add some evals later” into the build, we push back. Later doesn’t come.
2. Eval cases come from real user work, not from a benchmark
A static benchmark tells you whether the model is good at the benchmark. It tells you almost nothing about whether the system is good at the work. We mine eval cases from actual user transcripts, real ticket logs, real tool calls. The cost of doing this once is high. The cost of doing it never is higher.
3. Every prompt change runs the harness before merge
The harness runs in CI on every change to a prompt, an agent definition, or a tool description. A regression below threshold blocks merge. This is the one rule we will not bend on. If your team can’t bear to fail a prompt change in CI, you don’t have an eval harness — you have a folder of test cases.
4. Three buckets, three thresholds
Every eval suite splits into three:
- Golden — must always pass. A regression here is a hard block.
- Aspirational — we measure pass rate. Trend matters; absolute number matters less.
- Adversarial — known prompt injections, jailbreak attempts, tool-misuse cases. Pass rate publicly tracked.
Splitting like this stops the harness from collapsing into “did the score go up.” Different cases warrant different treatment.
5. Cost and latency are eval dimensions
A correct answer that takes thirty seconds and twelve dollars of inference is sometimes a wrong answer in product terms. We score every run on cost and latency too, with thresholds. Otherwise the system silently drifts in the direction of “use the biggest model for everything.”
The day-one template
On day one of a new engagement we ship a Python project with the following:
evals/
├── harness.py # runs a suite, scores, writes report
├── suites/
│ ├── golden.yaml # ~20 cases, must pass
│ ├── aspirational.yaml # ~50-100 cases, trend tracked
│ └── adversarial.yaml # ~10-30 cases, public pass rate
├── graders/
│ ├── exact_match.py
│ ├── llm_judge.py # bounded, with rubric
│ └── cost_latency.py
└── ci/
└── run_on_pr.sh
The three suites start tiny. They grow as the project produces real user transcripts. Within a quarter on a typical engagement we have 200–400 cases across the three suites.
A few non-obvious things we’ve learned the hard way:
- LLM-as-judge needs a rubric you’d hand a junior reviewer. Vague judging prompts produce noise. We rewrite our rubrics every quarter.
- Adversarial cases must include attempts that succeeded once. Real prompt-injection attempts that worked, anonymized, are the most valuable cases in the suite.
- Costs go in the report. Every CI run posts a comment with the cost delta. Engineers calibrate fast when they see it.
What this is really for
The eval harness isn’t a quality control mechanism. Or, it isn’t only that. It’s the thing that lets a team change models, change vendors, change architectures, and know what happened. It’s the thing that lets a regulator ask “how did you know your agent was behaving” and get a real answer. It’s the thing that turns a vibe-coded prototype into a system you can hand off, sell, certify, and keep.
If your project doesn’t have one, that’s where the next sprint goes. Everything else can wait.
— wGrow studio