Two years ago Cuecoder shipped its first LLM feature. A week of prompt tuning, a Friday demo, and on Monday a customer asked why it was confidently wrong. There was no answer. There were no tests. There was no model of what “good” looked like for the system.
That moment is when evals stopped being a thing to do after shipping and started being the product itself. Everything else — the prompt, the retrieval strategy, the model choice, the UX — is a knob you turn against the eval. If the eval is bad, every other decision is taste.
The eval is the spec
In traditional software, you have a spec and you have tests. The spec describes what the system should do; the tests describe what it does today. For AI systems, the eval suite collapses both into a single artifact. There is no other source of truth.
That sounds dramatic but it’s the only framing that has survived contact with production. Specs without evals are vibes. Code without evals is a slot machine.
“Specs without evals are vibes. Code without evals is a slot machine.”
Three layers in every production suite
Every production eval suite Cuecoder runs has the same three layers:
- Unit evals — small, fast, deterministic. Run on every PR.
- Trajectory evals — multi-step agent traces. Run nightly.
- Drift evals — production samples graded against historical baselines. Run continuously.
What a unit eval looks like
from cue import eval, score
@eval("pricing-agent.classify")
def test_classify_simple(model):
out = model("How much is the enterprise plan?")
assert score.contains(out, "enterprise")
assert score.no_hallucinated_price(out)
assert score.latency(out) < 800
Notice there’s no exact-string assertion. The assertions are properties of the output — it mentions the right plan, it doesn’t invent a price, it’s fast enough. The eval encodes what “correct” means in a way that survives prompt edits and model swaps.
Where this gets you
Once the eval suite is the product, the workflow inverts. You don’t ship a prompt change to see if it feels better. You ship a prompt change to see what the eval delta is. The eval delta is the answer. Everything else is interpretation.
Cuecoder is building Axon precisely to make this loop tighter for other teams. But you don’t need Axon to start — you need a one-page Python file, a CSV of inputs, and 30 minutes of discipline. Do it before you ship. Do it especially when you don’t want to.
If this resonated, Cuecoder sends one essay like this every Tuesday. Subscribe to The Cue.