Skip to content

Clustering & baselines

One failing run is a data point. Fifty failing runs that share a shape are a regression. Lumni’s clustering and baseline comparison turn a stream of individual detections into the story of an incident.

Failures are grouped by their detector key and pattern into clusters. This tells you at a glance:

  • How many runs are hitting the same silent failure
  • When it started (the first occurrence)
  • Whether it’s growing, steady, or fading

Because the detector key is stable, a cluster stays coherent over time — you can watch a false_success cluster form right after a deploy and dissolve after a fix.

A baseline is a recent known-good version of the run: the same task, before the failure appeared. Comparing the failing run to its baseline is how Lumni frames the central RCA question — what changed?

Baseline comparison lines up:

  • The step sequence (did a step appear, disappear, or change order?)
  • Tool inputs and outputs (did a schema or response shape change?)
  • Model, prompt, and context size
  • Any recorded changes — a deploy, a dependency bump, a policy revision — that landed between baseline and failure

A cluster with a clear baseline delta is the natural seed for an investigation: you pull the cluster, compare against baseline, form a hypothesis about the change that caused it, and then prove the fix with replay.