0%
Preparing for landing
← All work
Simpaisa

Payment Error Threshold Intelligence

An AI-assisted, merchant-aware alerting system that turns noisy wallet-error streams into ranked, ownership-attributed incidents.

Role / Product Manager, design & automationYear / 2025Status / Shipped in production · deepened as a portfolio build
Anomaly detection and alert-funnel dashboard
68%ALERT-NOISE REDUCTION
100MERCHANTS MONITORED
The problem

In production, I shipped a Python and DataDog pipeline that cut payment-failure detection from 1.5 hours to under a minute. But naive rules ("alert when error rate > 5%") still flood ops teams, because some errors are normal, some are merchant-side config, and some signal a real operator outage. This build goes deeper: detect genuine degradation and attribute ownership, without crying wolf.

What I did

A two-tier, dual-condition threshold model. Pre-alerts fire when errors rise above a merchant/operator/error baseline AND success trends unhealthy; full alerts escalate when both breach harder. Explainable by design, with an AI copilot that drafts plain-English summaries and merchant-safe comms around the alerts, never the detection itself.

Python / pandasNumPyStatistical thresholdsDatadogAI copilot
THE MODEL

Two conditions beat one threshold.

For each merchant / operator / error reason I baseline the mean and spread of the error rate and its correlation with success. A pre-alert needs error_rate ≥ mean + 1σ AND success ≤ a warning percentile; a full alert tightens both. The dual condition is what stops harmless error spikes from paging anyone.

OWNERSHIP

Whose problem is it?

Isolated to one merchant with config/credit signals → merchant-side. Errors rising across many merchants as success drops broadly → operator-side. That attribution is what turns an alert into an action.

BACKTEST

From 4,002 candidates to 175 real incidents.

On a 30-day holdout across 100 merchants and 2 operators, success-rate guardrails cut 4,002 single-condition candidates to 1,269 dual-condition alerts, a 68% reduction, and surfaced 175 full-alert incidents as a ranked triage queue with owner, spread, and impact score.