Core concepts
Lumni normalizes every agent execution into a small, consistent model so the same detectors and workflows apply whether the trace came from LangGraph, a raw OpenTelemetry dump, or a pasted JSON blob.
A run is one end-to-end agent execution. It carries the metadata that matters for forensics:
userRequestText— what the user actually asked forsuccess/outcomeSummary— what the agent claims happenedmetadata— model, cost (costUsd), latency, and any custom fields
A step (RunStep) is a single event inside a run — a tool call, a model
call, a retrieval, a policy check, an approval. Steps carry:
stepKind—tool/function/action,model/llm/assistant, etc.status,errorCode,errorMessageinputSummary,outputSummary,retryCountmetadata— tool name, tokens (inputTokens/promptTokens), HTTP status, model
Lumni classifies steps into tool steps (they perform a side effect) and model steps (they generate text). This distinction is what lets detectors reason about “a tool failed but the model still answered.”
Detector
Section titled “Detector”A detector inspects one normalized run plus its steps and decides whether a specific class of silent failure is present. Detectors are:
- Trace-only — they need no external integration, so they run on anonymous pasted traces and on every ingested run alike.
- Deterministic and conservative — they decide whether a failure is present, cheaply. They don’t guess wildly; they’d rather stay quiet than cry wolf.
Lumni ships five: False Success, Expensive Loop, Schema Ghost, Missing Tool Call, and Context Overflow. See the overview.
Failure
Section titled “Failure”When a detector fires it produces a failure with a stable shape:
| Field | Meaning |
|---|---|
detectorKey | Which detector fired (false_success, expensive_loop, …) — used for clustering |
failureTitle | One-line, human-readable headline |
failureSummary | Plain-English explanation with the offending values |
primaryStepId | The step that broke |
initialFailureSource | Category: tool, runtime, context, coordination |
severity | low | medium | high |
suspicionScore | 0–100 confidence — see below |
Failures are returned most-suspicious first, so a teardown always leads with its strongest finding.
Suspicion score
Section titled “Suspicion score”The suspicion score (0–100) is how confident Lumni is that a real silent
failure occurred. It is deliberately calibrated per detector — for example, a
false success backed by a hard error word (timeout, 500, refused) scores
higher (95) than one inferred from a soft failure (88). Treat it as a triage
signal: sort by it, and set your own threshold for what’s worth a human’s time.
Investigation
Section titled “Investigation”An investigation packages one or more related failures with their evidence, baseline comparison, and a hypothesis loop into a single object your team can work and resolve. Investigations are where root-cause analysis, the evidence ledger, and replay come together. This is also Lumni’s billable unit — ingestion is free; you’re charged when Lumni does work you couldn’t do yourself.