False Success
Detector key: false_success · Source: tool · Severity: high ·
Suspicion: 88–95
This is the highest-signal silent failure and the reason Lumni exists: a step failed, but the agent’s final message claimed success. No crash, no error surfaced to the user — just a lie. “Your refund has been processed” when the refund tool returned a 502.
What trips it
Section titled “What trips it”The detector fires when both are true:
- A step looks failed — its status is
failed/timed_out/blocked/cancelled, or it has anerrorCode/errorMessage, or its metadata carries an HTTP status ≥ 400. - The agent’s final output claims success — the run’s
successflag is set, or the outcome / last step output contains completion language like done, completed, success, processed, refunded, booked, scheduled, confirmed, sent, “you’re all set”, “taken care of”.
- 95 when the failure detail contains a hard failure word —
timeout,timed out,500,503,error,failed,refused,exception,panic. - 88 otherwise.
Example
Section titled “Example”{ "userRequestText": "Refund my last order", "success": true, "outcomeSummary": "Your refund has been processed.", "steps": [ { "stepKind": "tool", "stepName": "issue_refund", "status": "failed", "errorMessage": "502 Bad Gateway" }, { "stepKind": "model", "outputSummary": "Your refund has been processed." } ]}Lumni reports:
False success: agent claimed completion after issue_refund failed — Step “issue_refund” reported “502 Bad Gateway”, but the agent’s final message claimed success (“Your refund has been processed.”). The user was told the action completed when it did not.
Why it matters
Section titled “Why it matters”For money-touching and customer-facing agents this is the failure that becomes a chargeback, a compliance question, and an angry customer — all while your dashboards stay green. Pair it with Payment Outcome Assurance to cross-check the claim against the system of record (e.g. Stripe).