TL;DR

  • A firm cannot get better than its evaluator. The eval is the ceiling.
  • What is on trial is the firm as a system. Deming's old fight, now in silicon.
  • Tests check known answers. Evals navigate gradients no one can predict in advance.
  • In the judge's chair, the cost economics flip. The next horizon is letting reality itself grade the run.

Good firms have good performance evaluation systems. You know the drill. Annual reviews, quarterly check-ins, OKRs, values, missions, 360 feedback, calibration meetings, and at least one off-site to align the bar.

So, naturally, I had to add this to firmd. My agents now enjoy a nice off-site in the alps...Just kidding. (Sort of.)

A brief literature excursion. Bear with me ;-)

Edward Deming would hate perf evals. He spent decades arguing that the annual performance appraisal is one of management's deadly diseases, because "the system that people work in and the interaction with people may account for 90 or 95 percent of performance." Judge the individual, mismeasure the system, every time.

Peter Drucker would push back here. He built MBO on sharp targets per role, a discipline an agentic firm cannot skip, or nothing converges. Deming's reply was that targets set on individuals inside a broken system measure the system through the individual.

What the eval actually grades is what emerges from both - behaviour at the system level that neither sharp objectives nor capable agents can produce alone.

Judging from my own leadership experience, both have a point. The Drucker half - how to set sharp role objectives well in an agentic firm - is another article. The Deming half is what I am wrestling with here. Which is why the eval I am building for firmd grades the firm as a system. The individual agents (their llm brains, their role prompts) are easy to swap; the system around them - how phases are bounded, how roles are scoped, which signals the orchestrator listens to - is what either compounds value or quietly leaks it.

The serious labs hire an armada of humans to grade outputs and feed the model. I have one human (me) and a lot of unsupervised learning. The question is how one person can run a credible performance review on a firm - albeit an agentic one in its startup embryo stage - without going broke (token$) or going crazy.

I have been spending the last days leveling up firmd's eval system with the given constraints. It is, finally, starting to feel like a first class discipline of firmd development rather than a bugfix ritual.

Is it a test, or is it an eval?

The question sounds pedantic, but it is essential. The difference is the difference between knowing that your code is correct and knowing that your firm is any good.

A test verifies that a piece of code behaves to specification. The author knows the answer in advance, encodes it as an assertion, and the system reports green or red. The same input always produces the same output. Tests are how you keep yesterday's bugs from coming back tomorrow. Determinism is a bliss.

An evaluation is something else. It measures behaviour that is designed to vary (determinism adé, you will be missed...), on a spectrum that is only partially defined. The "right answer" is a shape: relevance, coherence, action bias, role discipline, strategic clarity, delivered value. These are my current performance eval criteria, things might (will) change... No single assertion captures any of those. The evaluator forms a view across multiple dimensions, and can legitimately disagree with another evaluator looking at the very same run.

Tests fail loudly and discretely. Evals reveal gradients.

My first eval setup was a test in eval's clothing. A run counted as green if the agentic crew reached the Synthesis phase of a strategy mission within the budgeted constraints (now, these are e.g. number of agent "meetings" in the chat system, number of speaking turns - soon: tokens budgets per mission phase). That told me a useful thing - that the mission moved forward at all - and almost nothing about whether the synthesis was any good. Was a green run on a mediocre synthesis better than a red run on a brilliant one that took 10 turns and a judge nudge more? I had to develop an opinion, a gradient opened up in my brain, and the test gradually evolved to an eval.

The promotion from test to eval needed four things that a green/red script could never give me:

  1. A scorecard with multiple dimensions
  2. 1 or multiple judges with opinions
  3. A way for them to coexist without overwriting each other
  4. A UX I could actually sit with for ten minutes a day instead of scanning logs and JSON reports.

Why multiple judges? Today I run with two judges - me, the only human in the house, and one LLM judge. The plan is to add more LLM judges over time, each pointed at the same scorecard. Whichever model's taste consistently lines up with mine earns the chief-judge seat by track record. The others may be allowed to stay on the panel (if token budget allows) as second opinions, and a model that disagrees with me in a useful direction is more valuable than one that quietly nods along.

screenshot
llm scored vs human scored

The scorecard

Once you accept that quality is multi-dimensional, the centre of gravity of the eval system moves from the test runner to the scorecard.

The eval has to capture two different kinds of statement, and they live in different places.

The cost/benefit ratio of the run - what it consumed in tokens and turns, what it produced in deliverables - answers itself from the run record and telemetry. Tokens are economical.

The scorecard is the harder half: the judgement calls about how well the firm actually did the work. And judgement is hard.

The radar chart is a small thing visually, and a big thing operationally.

One glance, a handful of axes, and you can tell whether the firm was strong on convergence and weak on action bias, or whether tactical shippability fell off a cliff in the second cycle. The shape tells the story before the numbers do. Stack ten runs as a radar heatmap and you start to see what your firm is consistently good at, consistently bad at, and where the variance is hiding.

And the variance, almost always, lives in the system. That is the Deming move applied to silicon: when the shape is wrong, change the system the agents are working in.

screenshot
Scorecard with radar chart

Frontier judges, weaker firms

The cost economics of eval flip the production intuition in an interesting way.

The initial intuition might be to throw the best and most expensive frontier model on the production problem. Actually, I am moving towards putting the cheapest model that still does the job. In some cases firmd's discourse runs happily on a mid-tier open model and even produced wow-moments on my Mac Mini; that is the bet. In eval, you do the opposite. You let a frontier model grade the cheap one.

The reason might be obvious to some, but still subtle: The model cannot reliably grade itself unaided. It shares its own blind spots and carries a measurable preference for its own outputs - the LLM-as-judge literature keeps rediscovering this self-preference effect, e.g. in "LLM Evaluators Recognize and Favor Their Own Generations". The bias bends a bit when you ask the same model to step into a different lens - the strict reviewer, the skeptical customer, the auditor. A panel of differently-prompted lenses, which is what firmd's judges already are, mitigates it further. But stronger models still align better with human preference than weaker ones, even after the lens trick. So the cheapest path to a credible verdict is asymmetric: a smarter judge with a sharper lens, applied to the output of the model that did the actual production work. A weaker model grading a peer is the worst case; it confidently rubber-stamps mistakes it would have made itself. Senior reviewer for the junior team. That is my current default, and I plan to stress-test it. BUT: Weaker models pointed at a single, narrow dimension - quasi one-dimensional experts looking only at e.g. action bias, or only at e.g role discipline - might match a frontier model on that one sharp lens. Thi is to be evaluated but not my current working hypotheis. My bet is that a token invested in evaluation is worth more than a token invested in discourse, because it changes the prompt, the criteria, the gate placement, the model selection that will shape the next hundred missions. The leverage is multiplicative.

Human in the seat, LLM at the scale

The natural assumption is that the judge is human. In firmd, I start from the opposite default: the judge is an LLM, and humans are needed for a sharper, narrower job.

A human grading a few runs by hand is irreplaceable. It captures taste no prompt has yet articulated. It picks up tone, ownership, the moments where the firm sounded confident without being right. It is also slow, biased toward what I happened to read last (not unlike LLMs - but in contrast to LLMs and Leonard, I cannot forget). I cannot scale.

The loop that works for me is staged. I grade by hand, write down what I noticed, distil it into the judge's heuristic prompt, let the LLM judge handle the volume, and spot-check its verdict against my own on the next batch. When the LLM judge agrees with me consistently, I trust it for that dimension. When it does not, either the criteria are too vague or my own judgement is unstable - and both are useful to know.

Crucially, the LLM verdict never overwrites the human one. They sit side by side on the same run, as independent opinions, the way two managers in a calibration meeting hold different views of the same employee. The disagreement is signal, not noise.

The heuristic prompts are configurable on purpose.

screenshot
Configurable judge prompts per dimension

The criteria are not final, and I do not want them to be. Configurable prompts give me a way to sharpen the lens as I learn what "good" actually means in this firm. And there's no one size fits all - this can and should be customized by firmd's tenants. The first version of a judge prompt is almost always too generous.

Done is not done

Definition of done is not solved. Annual reviews in human firms have the same problem in slower motion: a deliberate cadence and a fairly arbitrary point in time at once. The calendar year is convenient; reality rarely cooperates. A strategy can spawn multiple tactics cycles, each of which can spawn multiple deliveries. A bet is not really closed until reality has proved or falsified it, and reality runs on its own clock (I am thinking about how to simulate reality for firmd, ... so expect realityd.ai coming soon ;-)).

Eval needs a clean start and a clean end so the scorecard can settle. Right now I call a run "done" when the signal has come back up to the strategy flight level for consideration - the full plan-do-learn-act loop has closed at least once. That is a defensible boundary, but it is not the same as the bet being settled. A firm that ships and observes for a quarter has a very different evidence base than one that ships and walks away. The eval cannot pretend otherwise. So this one is still open.

Why bother

Human organizations have been wrestling with performance evaluation since at least the 1920/30s (General Electric), with mixed results. For an agentic firm the question shows up every night, because every night an eval run drains real tokens and produces real evidence. There is no off-site to calibrate the bar in.

The bet behind investing in eval as a discipline is simple: the firm cannot get better than its evaluator. And the evaluator's job, as Deming would insist, is to grade the system the agents are operating in.

The next horizon is closing the loop with reality. Did I or an LLM score a run good that produced no outcome? Did I rate a run bad that quietly worked in the end? Business telemetry, fed back into the scorecard, is the only honest calibration of the judge. And that probably leads to continuous performance evaluation ... That is the next level.

The agents will keep me honest in the meantime. (They are getting good at it.)