Core concepts

Lumni normalizes every agent execution into a small, consistent model so the same detectors and workflows apply whether the trace came from LangGraph, a raw OpenTelemetry dump, or a pasted JSON blob.

Run

A run is one end-to-end agent execution. It carries the metadata that matters for forensics:

userRequestText — what the user actually asked for
success / outcomeSummary — what the agent claims happened
metadata — model, cost (costUsd), latency, and any custom fields

Step

A step (RunStep) is a single event inside a run — a tool call, a model call, a retrieval, a policy check, an approval. Steps carry:

stepKind — tool / function / action, model / llm / assistant, etc.
status, errorCode, errorMessage
inputSummary, outputSummary, retryCount
metadata — tool name, tokens (inputTokens / promptTokens), HTTP status, model

Lumni classifies steps into tool steps (they perform a side effect) and model steps (they generate text). This distinction is what lets detectors reason about “a tool failed but the model still answered.”

Detector

A detector inspects one normalized run plus its steps and decides whether a specific class of silent failure is present. Detectors are:

Trace-only — they need no external integration, so they run on anonymous pasted traces and on every ingested run alike.
Deterministic and conservative — they decide whether a failure is present, cheaply. They don’t guess wildly; they’d rather stay quiet than cry wolf.

Lumni ships five: False Success, Expensive Loop, Schema Ghost, Missing Tool Call, and Context Overflow. See the overview.

Failure

When a detector fires it produces a failure with a stable shape:

Field	Meaning
`detectorKey`	Which detector fired (`false_success`, `expensive_loop`, …) — used for clustering
`failureTitle`	One-line, human-readable headline
`failureSummary`	Plain-English explanation with the offending values
`primaryStepId`	The step that broke
`initialFailureSource`	Category: `tool`, `runtime`, `context`, `coordination`
`severity`	`low` \| `medium` \| `high`
`suspicionScore`	0–100 confidence — see below

Failures are returned most-suspicious first, so a teardown always leads with its strongest finding.

Suspicion score

The suspicion score (0–100) is how confident Lumni is that a real silent failure occurred. It is deliberately calibrated per detector — for example, a false success backed by a hard error word (timeout, 500, refused) scores higher (95) than one inferred from a soft failure (88). Treat it as a triage signal: sort by it, and set your own threshold for what’s worth a human’s time.

Investigation

An investigation packages one or more related failures with their evidence, baseline comparison, and a hypothesis loop into a single object your team can work and resolve. Investigations are where root-cause analysis, the evidence ledger, and replay come together. This is also Lumni’s billable unit — ingestion is free; you’re charged when Lumni does work you couldn’t do yourself.