Clustering & baselines

One failing run is a data point. Fifty failing runs that share a shape are a regression. Lumni’s clustering and baseline comparison turn a stream of individual detections into the story of an incident.

Clustering

Failures are grouped by their detector key and pattern into clusters. This tells you at a glance:

How many runs are hitting the same silent failure
When it started (the first occurrence)
Whether it’s growing, steady, or fading

Because the detector key is stable, a cluster stays coherent over time — you can watch a false_success cluster form right after a deploy and dissolve after a fix.

Baselines

A baseline is a recent known-good version of the run: the same task, before the failure appeared. Comparing the failing run to its baseline is how Lumni frames the central RCA question — what changed?

Baseline comparison lines up:

The step sequence (did a step appear, disappear, or change order?)
Tool inputs and outputs (did a schema or response shape change?)
Model, prompt, and context size
Any recorded changes — a deploy, a dependency bump, a policy revision — that landed between baseline and failure

From cluster to investigation

A cluster with a clear baseline delta is the natural seed for an investigation: you pull the cluster, compare against baseline, form a hypothesis about the change that caused it, and then prove the fix with replay.