Clustering & baselines
One failing run is a data point. Fifty failing runs that share a shape are a regression. Lumni’s clustering and baseline comparison turn a stream of individual detections into the story of an incident.
Clustering
Section titled “Clustering”Failures are grouped by their detector key and pattern into clusters. This tells you at a glance:
- How many runs are hitting the same silent failure
- When it started (the first occurrence)
- Whether it’s growing, steady, or fading
Because the detector key is stable, a
cluster stays coherent over time — you can watch a false_success cluster form
right after a deploy and dissolve after a fix.
Baselines
Section titled “Baselines”A baseline is a recent known-good version of the run: the same task, before the failure appeared. Comparing the failing run to its baseline is how Lumni frames the central RCA question — what changed?
Baseline comparison lines up:
- The step sequence (did a step appear, disappear, or change order?)
- Tool inputs and outputs (did a schema or response shape change?)
- Model, prompt, and context size
- Any recorded changes — a deploy, a dependency bump, a policy revision — that landed between baseline and failure
From cluster to investigation
Section titled “From cluster to investigation”A cluster with a clear baseline delta is the natural seed for an investigation: you pull the cluster, compare against baseline, form a hypothesis about the change that caused it, and then prove the fix with replay.