Detectors overview
Detectors are the heart of Lumni. Each one inspects a single normalized run plus its steps and decides whether one specific class of silent failure is present — the kind where no exception was thrown, so evals and generic observability miss it.
The five detectors
Section titled “The five detectors”| Detector | Catches | Typical score |
|---|---|---|
| False Success | A step failed, but the agent claimed the task completed | 88–95 |
| Expensive Loop | The same tool retried ≥3× with no progress or backoff | 70–92 |
| Schema Ghost | A tool returned nothing, then the model fabricated specific data | 82 |
| Missing Tool Call | User asked for an action, agent claimed success, no tool ever fired | 80 |
| Context Overflow | Input tokens exceeded the model’s context window | 76 |
Why they’re safe to run everywhere
Section titled “Why they’re safe to run everywhere”Every detector is trace-only: everything it needs is derived from the trace itself, with no external integration. That has two consequences:
- It’s safe to run on anonymous, unauthenticated input — which is exactly what powers the public playground.
- It runs on every ingested run with no extra setup or credentials.
Design principles
Section titled “Design principles”- Deterministic. Given the same trace, a detector always returns the same result. No sampling, no flakiness.
- Conservative. A detector decides whether a failure is present — cheaply. It would rather stay silent than raise a false alarm. Narrative explanations and suggested fixes are produced downstream (optionally by an LLM judge), not by the detector itself.
- Ranked. When multiple detectors fire on one run, Lumni returns the failures most-suspicious first, so a teardown always leads with its strongest finding.
Reading a suspicion score
Section titled “Reading a suspicion score”The suspicion score (0–100) is
tuned per detector. For example, a false success backed by a hard error word
(timeout, 500, refused, exception) scores 95, while a softer signal
scores 88. Loops scale with how many times the step repeated. Use the score to
triage: sort by it, set a threshold, and route the top of the list to a human or
an investigation.