Payment Error Threshold Intelligence
An AI-assisted, merchant-aware alerting system that turns noisy wallet-error streams into ranked, ownership-attributed incidents.

In production, I shipped a Python and DataDog pipeline that cut payment-failure detection from 1.5 hours to under a minute. But naive rules ("alert when error rate > 5%") still flood ops teams, because some errors are normal, some are merchant-side config, and some signal a real operator outage. This build goes deeper: detect genuine degradation and attribute ownership, without crying wolf.
A two-tier, dual-condition threshold model. Pre-alerts fire when errors rise above a merchant/operator/error baseline AND success trends unhealthy; full alerts escalate when both breach harder. Explainable by design, with an AI copilot that drafts plain-English summaries and merchant-safe comms around the alerts, never the detection itself.
Two conditions beat one threshold.
For each merchant / operator / error reason I baseline the mean and spread of the error rate and its correlation with success. A pre-alert needs error_rate ≥ mean + 1σ AND success ≤ a warning percentile; a full alert tightens both. The dual condition is what stops harmless error spikes from paging anyone.
Whose problem is it?
Isolated to one merchant with config/credit signals → merchant-side. Errors rising across many merchants as success drops broadly → operator-side. That attribution is what turns an alert into an action.
From 4,002 candidates to 175 real incidents.
On a 30-day holdout across 100 merchants and 2 operators, success-rate guardrails cut 4,002 single-condition candidates to 1,269 dual-condition alerts, a 68% reduction, and surfaced 175 full-alert incidents as a ranked triage queue with owner, spread, and impact score.
From raw error spikes to ranked incidents · 30-day backtest, 100 merchants
The trick is the dual condition: a page only fires when errors spike and success dips together — cutting the noise 68% before anything ever reaches a person.