Skip to content

Detectors overview

Detectors are the heart of Lumni. Each one inspects a single normalized run plus its steps and decides whether one specific class of silent failure is present — the kind where no exception was thrown, so evals and generic observability miss it.

DetectorCatchesTypical score
False SuccessA step failed, but the agent claimed the task completed88–95
Expensive LoopThe same tool retried ≥3× with no progress or backoff70–92
Schema GhostA tool returned nothing, then the model fabricated specific data82
Missing Tool CallUser asked for an action, agent claimed success, no tool ever fired80
Context OverflowInput tokens exceeded the model’s context window76

Every detector is trace-only: everything it needs is derived from the trace itself, with no external integration. That has two consequences:

  • It’s safe to run on anonymous, unauthenticated input — which is exactly what powers the public playground.
  • It runs on every ingested run with no extra setup or credentials.
  • Deterministic. Given the same trace, a detector always returns the same result. No sampling, no flakiness.
  • Conservative. A detector decides whether a failure is present — cheaply. It would rather stay silent than raise a false alarm. Narrative explanations and suggested fixes are produced downstream (optionally by an LLM judge), not by the detector itself.
  • Ranked. When multiple detectors fire on one run, Lumni returns the failures most-suspicious first, so a teardown always leads with its strongest finding.

The suspicion score (0–100) is tuned per detector. For example, a false success backed by a hard error word (timeout, 500, refused, exception) scores 95, while a softer signal scores 88. Loops scale with how many times the step repeated. Use the score to triage: sort by it, set a threshold, and route the top of the list to a human or an investigation.