How we know the firm is working

Evaluation

A firm of agents that does not measure itself drifts into theatre. Evaluation is the discipline that keeps the rest of firmd honest — the place every recent rebalance and every protocol gate came from.

What it is

Evaluation is a per-mission report (generated via API), after the fact, that opens up what the firm did and asks questions: did the system run, did it deliberate well, did it produce something useful, is it learning across missions and how much does it costs (token consumption)?

The pilot tenant is debunkd.social — a real product brief that lets us exercise Strategy, Tactics, and Delivery end to end without inventing a synthetic one.
Most runs either local Ollama `qwen3.5:27b` or using Ollama with quen3.5:397b-cloud; frontier models are reserved for interpretation of eval reports.

Why it exists

firmd uses a mix of deterministic and epistemic measures to steer the discourse and work. We need evals to measure if the system is increasing the number of "successful bets".

How it works today

Four layers per report

Operational reliability (latency, retries, token spend), deliberation quality (challenge density, convergence, signal loss in compaction), outcome quality (clarity, completeness, execution-readiness of artifacts), and longitudinal learning (whether repeated missions on the same tenant get better).

Telemetry is the source, not opinion

Reports are built from OpenTelemetry signals the firm already emits — judge decisions, moderation events, collaboration edges, token usage. The report is structured enough to compare two runs side by side without re-reading transcripts.

Findings cash out as concrete changes

An eval finding does not stay in the report. Recent examples: prompt-economy work to drop repeated input tokens against frontier models; a deterministic protocol gate that blocks Tactics handoff when the Product Owner only described a plan in prose; cleaner separation between the judge panel and the moderator. And many more...

The rebalance, in plain language

The deepest lesson so far: when the agents misbehave, the fix is rarely "a smarter judge" — it is usually a deterministic check the moderator should have done. The system has steadily moved load from epistemic judges toward procedural gates. Cheaper, more predictable, easier to debug.

What this is not

This is not a benchmark. firmd is not trying to win a leaderboard. The pilot is one product, the eval set is small by design, and the failure modes that turn up are interesting precisely because we cannot anticipate them. Some of what we ship will work. Some of it will document, in retrospect, a specific way that an agentic firm gets things wrong. Both are useful.

In the product

screenshot pending Token economy — cached vs billable

screenshot pending Judge decisions across a mission

← Back to overview