If you run AI agents in production, you almost certainly have observability. Langfuse, Helicone, Arize Phoenix, LangSmith, Datadog's LLM module — you have traces, token counts, latencies, dashboards. You can open any request and see what happened. These are good tools, doing a real job well.
Then a different kind of question arrives. A customer disputes a denied refund. An auditor asks you to show what the agent decided, and on what basis, six months ago. A regulator wants the record of a specific automated decision — one they can check themselves. And your observability stack, for all its detail, has no answer. It can show you. It cannot prove anything.
Two different questions.
Observability answers "what is happening?" Its whole design — traces, spans, sampling, aggregation — is built to monitor a fleet: the p99 latency, the error rate, the cost curve, the regression in aggregate. That is genuinely hard and genuinely useful.
Audit answers a different question: "can you prove what happened?" Not describe it — prove it, to someone who has no reason to take your word. That is not a dashboard feature. It is a different primitive.
Three things a trace can't do.
Hold a single agent decision up against what an auditor, a counterparty, or a court actually needs:
- Reproduce it. Given the same recorded inputs, can you re-run that exact decision and get the same result, byte for byte? A trace records that something happened; it does not let you replay it.
- Show it wasn't altered. The trace lives in a system you control, in mutable storage. Nothing in it proves it hasn't been edited or back-dated since. "Our logs say so" is a claim, not evidence.
- Let someone verify it without trusting you. To check a trace, a third party has to trust your platform and your vendor. The whole point of an audit is that they shouldn't have to.
Observability tells you what your agents are doing in aggregate. It can't account for any single one of them to someone who doesn't trust you.
It's a different shape, not a missing checkbox.
You can't bolt proof onto a tracing tool without it becoming a different product. Observability's data model is the trace; its primitives are spans and metrics; its workload is aggregation. A verifiable audit layer has the opposite shape: its data model is the record of one decision, its primitives are canonical hashes and anchors, its workload is capture, verify, replay. Same raw inputs — your agent's calls — but a different machine.
That machine is a Verifiable Decision Record: each decision becomes a portable, canonical object with a cryptographic receipt. Recompute its digest and a single altered byte is exposed. Anchor that digest to public infrastructure and its existence in time is provable. Hand it to anyone and they can verify it with maths alone — no account on your platform, no trust in your vendor, none in us. Only the digest is ever published; the payload never leaves your environment.
Use both.
This is not "replace your observability." Keep it — it answers the question it was built for. Determs sits underneath, for the decisions that carry consequences: the ones you'll one day need to reproduce, prove unaltered, and hand to someone who doesn't trust you. Observability is for watching the fleet. A verifiable record is for accounting for the single decision that ends up in a dispute.
It is an open standard with a reference implementation you can use today: wrap your client, capture a record, verify one yourself. If you've ever been asked to prove what an agent did and had only a dashboard to point at, that's the gap this fills.