What does a good judgement actually need?

When I started building firmd, I expected governance to be the tricky part, and the part that makes the difference, as it is the case with real firms. Two companies with the same people, the same product and the same market can perform very differently depending on how decisions get made, challenged, and closed. I expected the same to be true for a firm run by agents.

Two ideas turned out to matter more than I expected (not only for my bill). They are the same two ideas every human organization runs on.

Deterministic moderation

The first one looks boring and turns out to be the foundation everything else stands on.

A lot of what a firm does should never be re-argued. Whether a meeting can start, whether a phase can advance when nobody has spoken yet, or whether a plan is allowed into delivery without acceptance criteria, these are settled rules rather than opinions to be weighed afresh each time.

In firmd, those rules live in a deterministic moderator that runs without an LLM and without probabilities, leaving no room for the model to improvise differently from one day to the next. If a participant has not acknowledged the strategic intent, the next phase is blocked, and if a delivery plan is missing telemetry, the gate fails, so the same input always produces the same output.

This is the kind of plumbing that makes the rest of the system trustworthy, and human organizations get it right without thinking about it. The fire alarm does not debate whether to go off. Payroll does not negotiate the 25th of the month. Years ago I introduced process automation into corporate governance, and the word I kept reaching for was the German "Handlungssicherheit" (try to translate that!). The point was less about saving time through automation and more about the quiet confidence that the right things happen at the right moments, without anyone having to check. A deterministic process takes care... that trust is what deterministic moderation buys.

And here is the part I underestimated: the secret sauce is neither the hard rules nor the soft judgement. It is the balance between them. Too many rules and the firm is rigid and cannot think. Too few and it is a chatroom with "good intentions" (although, frankly, intention is the one thing i believe AI will never have - anyway, let's not drift...). Most of the craft is deciding which decisions are settled law and which ones still deserve an opinion.

Which brings us to the decisions that are not rules.

Epistemic judgement

Did the agentic team actually converge, or are they just tired? Is this plan tight enough to ship, or are we hiding scope behind a confident tone? Is everyone staying in their professional lane, or has the engineer started designing the pricing model again? You cannot write a regex for that. You need judgement.

I did not design this from a whiteboard. It came out of watching evaluation runs, where three things kept happening. First, the discussion often would not converge; it circled, and nobody called the end. Second, when it did converge, the conclusion was frequently not actionable; the favourite non-decision was "we need more research", all agents were nodding ... great, we have converged ... to do nothing. Which, btw, is fair enough and I accept if humans decide to do nothing. I don't like if my agentic firm comes to the same conclusion ;) Third, agents drifted out of their roles, with the engineer doing pricing and the strategist redesigning the database. A key design principle of firmd is diversity and adversity of opinions. I am betting that anchoring and priming agents on certain lenses improves the outcome. Not unlike human organizations that outperform if there is diversity of thought (and inclusion).

So in firmd, that work of epistemic judgement is done by judge agents, each looking at the discourse through a single lens: a convergence judge, an action-bias judge, a role-boundaries judge. None of them owns the answer. Together they nudge the discourse the way a good leadership team does, applying pressure from different angles so that no single voice carries the whole truth.

This is the epistemic layer of the discourse engine. It is probabilistic, opinionated, and explicitly fallible. Which is exactly why it sits next to the deterministic ("moderator") one. Determinism for the bones, judgement for the muscle.

When to judge?

Judges are powerful, and like any senior reviewer their attention (and their token cost) is worth spending deliberately. Not every phase of a discussion benefits from every lens. Early exploration is probably where convergence judge is probably less needed (design thinking, opening up...), while a role-boundaries judge tends to earn its keep to ensure diversity of thought. In later rounds, convergence is more important, as is action bias (see below).

So firmd lets a tenant admin decide, per judge, when it shows up. Each lens can be switched on or off, scoped to the flight levels where it belongs (strategy, tactics), and limited to the phases where its question is worth asking (exploration, refinement, planning, and so on). It is the same instinct behind a good operating rhythm, which sends the right reviewers to the right stage of work instead of pulling every stakeholder into every meeting.

What does a judge actually need to see?

The next question follows naturally, and it follows from the AI bill ;-). In the eval runs, the judges were eating 30-50% of all the tokens the discourse spent. Judgement was the single most expensive thing the firm did. So: if a judge is going to render a useful opinion, how much of the world does it actually need to see?

The lazy answer I started with was everything. Every turn, every artifact, every memo, every old disagreement, every previous judge call. Let the model sort it out. I thought i was a safe bet as this was not about overloading the substantial discourse about the next feature to build. I did not want to overload that discourse with too much context (I did...), so my point was: how much can it hurt, let the judges see all the world to make their call.

That answer was expensive, and it may also be the worse one. A judge buried in context tends to lose the thread, rereading discussions it has already read instead of focusing on the question in front of it.

Here is where a courtroom analogy started helping me. A judge hearing a dispute over, let's take a simple case, a late delivery does not call for every email the two companies ever exchanged. They ask for the contract, the agreed dates, the shipping records, and the messages around the moment things went wrong. The judge needs relevant evidence: material selected without bias, structured, and contextualized for the specific question being asked. Relevance is the opposite of cherry-picking, where evidence is chosen for the conclusion it supports rather than for its bearing on the question. A different question draws a different evidence file from the very same history.

The same is true for firmd's judges. A convergence judge does not need the full philosophical debate about the market; it needs the strategic intent, the current turns, and whether anyone has actually closed the loop. A role-boundaries judge, on the other hand, genuinely needs the raw (maybe sampled) participant turns, because drift shows up in the wording. Different lens, different evidence.

Managers do this every day

Leadership reporting has the same shape. If a manager asks "are you on track?", the worst possible answer is a raw dump of every Slack thread, every commit, every doubt that crossed someone's mind that week. That is not transparency, it is overload.

But the interesting half of the parallel is not the report, it is the reader. A good manager does not just receive a status update; they read between its lines. They infer whether the confidence is earned, whether "almost done" really means actually done or the famous last 5% that will take 95% of time, whether the silence on a risk is itself the risk. That reading between the lines, probing, interpolation is the actual judgement work.

That is exactly what a judge has to do. To assess convergence, it has to sense whether a discussion reached a real conclusion or simply ran out of energy. To assess action-bias, it has to tell a genuine plan from sophisticated procrastination. To assess role-boundaries, it has to notice tone and ownership, not just keywords. Give it too little and it cannot interpolate. Give it everything and it drowns. The real question is what minimum set of signal lets a judge read between the lines reliably. But, the nice thing about this is: it can be inferred from words.

Evidence packs

The design bet is simple: less can be more if the selection is deterministic, inspectable, and measured against the full-context baseline.

What I did not want was another LLM call to summarize the discussion before the judge sees it. That just moves the cost around and adds one more lossy interpretation. If the summary is wrong, the judge is now judging the summary, not the firm (reminds me of what is going on in large orgaization. A C-level report or board book passes so many layers of aggregation and interpretation that it drifts from reality. This is why good boards know how to find their evidence...)

Instead of layering judgement calls, each judge lens gets a structured evidence pack built from known sources by deterministic rules: the artifacts, the current phase turns, the relevant tool calls, the protocol gate results, the prior feedback for that lens, and a bounded amount of older context where it genuinely helps. No semantic search pretending to understand the whole discussion. Just the rules. A court works the same way: the clerk hands the judge evidence, not a verdict. Interpretation is the judge's job, and you do not want it quietly done upstream.

And every judge call leaves a trace of which evidence mode it used, which sources it saw, how many turns were included or omitted, and whether it had to fall back to full context. Token economy without observability is just hope with a smaller bill.

Why this matters

Agentic systems need governance: roles, gates, reviews, escalation paths, institutional memory. But if every governance step requires the whole world in context, the organization becomes too expensive to think.

Human organizations solved this a long time ago, imperfectly (which makes me hope it can be solved by LLMs, too) but usefully. Courts use admissible evidence, managers use status reports, boards use board packs, and product reviews lean on metrics, customer stories, decisions, and risks. Nobody serious asks every reviewer to reread the entire company history before making every call.

The craft is in selecting the right evidence while preserving enough raw truth to challenge the conclusion.

That is the bet behind adding judge evidence packs in firmd. Not that less context is always better, but that better-selected context can be cheaper, clearer, and sometimes more reliable than dumping everything into the model and hoping attention finds the right thread.

The next evals will tell whether the bet holds. My agents will keep me honest ;)