Skip to content

Core concepts

Lumni normalizes every agent execution into a small, consistent model so the same detectors and workflows apply whether the trace came from LangGraph, a raw OpenTelemetry dump, or a pasted JSON blob.

A run is one end-to-end agent execution. It carries the metadata that matters for forensics:

  • userRequestText — what the user actually asked for
  • success / outcomeSummary — what the agent claims happened
  • metadata — model, cost (costUsd), latency, and any custom fields

A step (RunStep) is a single event inside a run — a tool call, a model call, a retrieval, a policy check, an approval. Steps carry:

  • stepKindtool / function / action, model / llm / assistant, etc.
  • status, errorCode, errorMessage
  • inputSummary, outputSummary, retryCount
  • metadata — tool name, tokens (inputTokens / promptTokens), HTTP status, model

Lumni classifies steps into tool steps (they perform a side effect) and model steps (they generate text). This distinction is what lets detectors reason about “a tool failed but the model still answered.”

A detector inspects one normalized run plus its steps and decides whether a specific class of silent failure is present. Detectors are:

  • Trace-only — they need no external integration, so they run on anonymous pasted traces and on every ingested run alike.
  • Deterministic and conservative — they decide whether a failure is present, cheaply. They don’t guess wildly; they’d rather stay quiet than cry wolf.

Lumni ships five: False Success, Expensive Loop, Schema Ghost, Missing Tool Call, and Context Overflow. See the overview.

When a detector fires it produces a failure with a stable shape:

FieldMeaning
detectorKeyWhich detector fired (false_success, expensive_loop, …) — used for clustering
failureTitleOne-line, human-readable headline
failureSummaryPlain-English explanation with the offending values
primaryStepIdThe step that broke
initialFailureSourceCategory: tool, runtime, context, coordination
severitylow | medium | high
suspicionScore0–100 confidence — see below

Failures are returned most-suspicious first, so a teardown always leads with its strongest finding.

The suspicion score (0–100) is how confident Lumni is that a real silent failure occurred. It is deliberately calibrated per detector — for example, a false success backed by a hard error word (timeout, 500, refused) scores higher (95) than one inferred from a soft failure (88). Treat it as a triage signal: sort by it, and set your own threshold for what’s worth a human’s time.

An investigation packages one or more related failures with their evidence, baseline comparison, and a hypothesis loop into a single object your team can work and resolve. Investigations are where root-cause analysis, the evidence ledger, and replay come together. This is also Lumni’s billable unit — ingestion is free; you’re charged when Lumni does work you couldn’t do yourself.