Logs are not proofs

Your agent did something yesterday at 14:23. A customer is on the phone. They want to know why their refund was denied — or routed to the wrong team, or escalated, or auto-approved. The trace ID is in your logs. You pull it up.

You see the prompt. You see the model name. You see the response. You see, in a structured-enough form, what the agent did. You can describe what happened. You cannot prove it.

If a regulator asks tomorrow whether that decision was reproducible, the honest answer is no. If your prompt has changed since, the honest answer is no. If you replay the same input through the same agent right now, the honest answer is probably similar, but not the same, and we cannot tell which differences are noise and which are real.

That is the gap Determs sits in.

Observability is not replay. Replay is not proof.

LLM observability tools are everywhere now. Langfuse, Helicone, Arize Phoenix, Braintrust, Datadog's LLM module. They are good tools. They do one thing well: they show you traces — sequences of model calls, latencies, token counts, error rates. They aggregate. They graph. They alert.

None of them give you replay. None of them give you proof.

Replay is the property that given the same recorded input, you can re-run the same decision and get the same output, byte for byte. Proof is the property that nobody can have modified that record between when it happened and when you read it back. These are different properties from observability, and they require a different primitive.

Logs let you describe an event. Traces let you correlate events. Metrics let you aggregate events. None of them let you reproduce a single event with the rigor of a cryptographic receipt.

"But the LLM is stochastic." Yes. And no.

The standard objection is: deterministic replay of a language model is meaningless, because the model itself is non-deterministic. Same prompt, different output. End of story.

That objection misreads what needs to be deterministic. The model is only one input to your agent. The rest is yours:

the prompt you constructed, including system instructions and few-shot examples
the messages you assembled from history
the tools you exposed
the model name, version, and parameters (temperature, top_p, max_tokens, seed)
the order in which tool calls were resolved
the rules you applied after the model returned

Every one of those is deterministic, or can be made deterministic. The agent as a whole is a pipeline; the LLM is one step. The pipeline can carry a verifiable record around the non-deterministic step.

A deterministic replay layer does not pretend to reproduce the model's output from scratch. It records the actual output that occurred, anchors it to the inputs that produced it, hashes the whole thing, and gives you back a record you can verify and replay independently. The LLM's stochasticity becomes a property of one column in a row, not a property that disqualifies the entire row.

The primitive: capsule, receipt, replay.

Determs is built on three things.

A capsule is a unit of execution with a typed input and a typed output. For agents, the capsule's input is the full record of an action: the agent identifier, the model, the parameters, the messages, the tools, and the response the model produced. The capsule's output is a structured summary plus three SHA-256 digests:

input_digest — hash of canonical(model + params + input)
output_digest — hash of canonical(model response)
record_digest — hash of canonical(the whole record)

The receipt is what comes out of executing a capsule on a record. It binds the capsule version, the input, and the output to a stable identifier. It is bit-exact reproducible: any two executions of the same capsule on the same input produce the same digests, today and a year from now.

Replay is the operation of running the capsule again on a stored record and comparing. If the digests match, the record has not been tampered with and the logic still produces the same answer. If they diverge, something changed — either the record was modified or the capsule logic has evolved. Both are useful signals; both are observable.

What it looks like in practice.

The Python SDK wraps your existing LLM client. Every call produces a record:

# Drop-in around the Anthropic SDK.
import anthropic
from determs.anthropic import wrap as wrap_anthropic
from determs.storage import FileStorage

client = wrap_anthropic(
    anthropic.Anthropic(),
    agent_id="support-triage",
    storage=FileStorage("./records"),
)

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    messages=[{"role": "user", "content": "My order is late"}],
)
# A record landed in ./records/<action_id>.json.

The record is plain JSON. It contains everything needed to reproduce the decision pipeline — except the call to the model itself, which is replaced by the model's actual response. You can store it in S3, in Postgres, in a tar archive, or hand it to a customer's compliance team on a USB key. It is a self-contained artefact.

The CLI verifies it:

$ determs verify --record ./record.json

{
  "verified": true,
  "profile": "ai.agent.action",
  "record_digest": "44f2b54990f7df71a16ff603b606225bcf2800004fa107c42950076c4b581fe4",
  "checks": {
    "subject_digest": true,
    "record_digest": true,
    "input_digest": true,
    "output_digest": true
  }
}

If a single character of the record has been edited after capture — a content string, a tool argument, a token count — the digests diverge and verify exits non-zero. There is no quiet corruption.

Logs tell you something happened. Receipts let you prove it.

This is a category, not a feature.

A reasonable response at this point is: fine, but couldn't an observability tool just add this? The answer is no, not without becoming a different product.

Observability is shaped around aggregation. Its data model is the trace, its primitives are spans and metrics, its workloads are sampling and downsampling. The whole stack is optimized for "give me the p99 latency of this agent's responses last week", not for "give me a verifiable receipt of this single action".

A replay-and-proof layer has the opposite shape. Its data model is the record. Its primitives are hashes and capsules. Its workloads are capture, replay, verify. Aggregation is a secondary concern; faithful single-record reproduction is the primary one.

The two product categories use the same raw inputs (your agent calls) but solve different problems. Observability tells you what your agents are doing in aggregate. A replay layer lets you account for any single one of them.

Why now.

Two forces are converging.

The first is operational. AI agents are now in production at scale. Teams are running them in customer support, in onboarding, in compliance review, in code review, in dispute resolution. These are decisions with real consequences. The first incident — and there is always a first incident — exposes the gap between "we have logs" and "we can reproduce what happened". The teams who get bitten by this once start asking what a replay layer would look like.

The second is regulatory. The EU AI Act, internal audit requirements, SOC 2 frameworks expanding to include AI, customer questionnaires asking how you reproduce a denied decision — all of these treat algorithmic decisions as auditable artefacts. A trace is not enough; the auditor wants a record they can reproduce. A log file does not pass that bar.

The combination means that the gap Determs sits in is no longer a theoretical one. Teams who deploy agents seriously will eventually need a replay layer. The question is whether they get one designed for the job, or they bolt one together from logs and prayer.

What Determs is and is not.

Determs is a deterministic replay and verifiable audit layer for AI agents. The engine is a Rust binary; the Python SDK wraps the Anthropic and OpenAI clients, both sync and async, with and without streaming. Records are local files by default; nothing leaves your environment unless you choose to send it somewhere.

Determs is not an observability tool. It is not a prompt evaluator. It is not an orchestration framework. It is not another logs vault. It does not aggregate, does not graph, does not alert. Those are good products to have, and we are not them.

Determs is an open standard — the Verifiable Decision Record — plus an open-core reference implementation (engine, CLI, SDK). The point of an open, trustless format is that you don't have to take our word for anything: verification is maths over public infrastructure, not faith in a vendor. A neutral registry and a compliance layer are built on top; those are the commercial surface.

If you recognise the question this post opened with — can you actually prove what your agent did? — and you want to use or implement the format, get in touch.