firmd.ai
Tech note

Implementation substrate

The pieces firmd is built on. One chapter per piece.

Multi-Tenant SAAS

firmd is multi-tenant by construction. Each customer firm gets its own Kubernetes namespace, its own Mattermost pod, its own firmd engine pod, and its own sandbox worker. The shared substrate — Postgres (one database per tenant on a shared cluster), Dolt (versioned MySQL with tenant-id row partitioning), and Directus (a shared CMS pod with per-tenant role-based access) — is isolated at the application layer rather than the infrastructure layer. A tenant-lifecycle control-plane service provisions all of this idempotently from a single API call.

Two boundaries are absolute: one firm's data, code, and secrets never reach another firm, and agents doing implementation work cannot see firmd's own source, environment, or secrets.

Containerized

firmd is built on Kubernetes and runs on the major hyperscalers — AWS, GCP, Azure — through standard manifests with no provider-specific tricks.

Versioned Storage

Two persistence layers, on purpose. Dolt — Git-for-data, version-controlled MySQL — holds institutional memory: artefacts, hypotheses, decisions, measured outcomes. Postgres handles operational state: tenant lifecycle, queues, secrets metadata.

Dolt is what makes a firm's decisions branchable. You can wargame a strategy on a side branch without losing the main trajectory, and you can ask what was our hypothesis on 2026-01-15 and get a real answer. Postgres handles the noisier transactional load Dolt is not built for.

Python

firmd's engine is Python. Agent calls are typed through PydanticAI so participant responses are structured artefacts, not loose strings; phase transitions and turn-taking run through LangGraph as an explicit state machine, persisted at every step.

The result: agentic behaviour that an operator can debug, replay, and reason about the same way they would any other long-running asyncio process.

LLM

Bring-your-own-key. firmd talks to any OpenAI-compatible endpoint — frontier models from OpenAI, Anthropic, and others; local Ollama on a Mac; or any other OpenAI-API-compatible provider.

A mock-llm mode runs pre-scripted responses so tests stay deterministic and zero-cost. Provider, base URL, and model live in a per-tenant LLM Settings row; switching providers does not require a redeploy. The model-selection boundary keeps the rest of the engine provider-agnostic.

Evaluation

firmd evaluates itself through per-mission reports built from OpenTelemetry signals the engine already emits. Each report covers four layers: operational reliability, deliberation quality, outcome quality, and longitudinal learning.

The instrumentation is OTel-native. Traces, metrics, and structured business events flow over OTLP into a Grafana LGTM stack — Tempo for traces, Mimir for metrics, Loki for logs and business events. We deliberately did not bolt the Traceloop/OpenLLMetry SDK on top: PydanticAI already provides GenAI-convention instrumentation through OpenTelemetry, and a second SDK would double-count without adding signal.

Most defaults on this site — judge lenses, protocol gates, prompt-economy tweaks — started as findings in one of these reports. Full writeup: Evaluation.

← Back to overview