Detectors overview

Detectors are the heart of Lumni. Each one inspects a single normalized run plus its steps and decides whether one specific class of silent failure is present — the kind where no exception was thrown, so evals and generic observability miss it.

The five detectors

Detector	Catches	Typical score
False Success	A step failed, but the agent claimed the task completed	88–95
Expensive Loop	The same tool retried ≥3× with no progress or backoff	70–92
Schema Ghost	A tool returned nothing, then the model fabricated specific data	82
Missing Tool Call	User asked for an action, agent claimed success, no tool ever fired	80
Context Overflow	Input tokens exceeded the model’s context window	76

Why they’re safe to run everywhere

Every detector is trace-only: everything it needs is derived from the trace itself, with no external integration. That has two consequences:

It’s safe to run on anonymous, unauthenticated input — which is exactly what powers the public playground.
It runs on every ingested run with no extra setup or credentials.

Design principles

Deterministic. Given the same trace, a detector always returns the same result. No sampling, no flakiness.
Conservative. A detector decides whether a failure is present — cheaply. It would rather stay silent than raise a false alarm. Narrative explanations and suggested fixes are produced downstream (optionally by an LLM judge), not by the detector itself.
Ranked. When multiple detectors fire on one run, Lumni returns the failures most-suspicious first, so a teardown always leads with its strongest finding.

Reading a suspicion score

The suspicion score (0–100) is tuned per detector. For example, a false success backed by a hard error word (timeout, 500, refused, exception) scores 95, while a softer signal scores 88. Loops scale with how many times the step repeated. Use the score to triage: sort by it, set a threshold, and route the top of the list to a human or an investigation.