AI in Fraud Ops: What's Hype and What Actually Works in 2025

Joe Tallett · · 11 min read
Abstract AI system operating in fraud operations context

There's a particular kind of vendor conversation that PSP fraud ops teams have learned to navigate carefully. It goes: AI is transforming fraud detection. Our platform uses large language models / graph neural networks / generative adversarial networks to stop fraud before it happens. Results speak for themselves.

We've had that conversation from the other side of the table — we're a vendor building AI fraud tooling, and we know how easy it is to make claims that sound compelling but dissolve under operational scrutiny. So here's our honest attempt to separate the functional from the overhyped, based on what we actually see working when we're in the weeds with PSP fraud teams.

What Genuinely Works: Supervised ML for Transaction Scoring

Supervised machine learning for transaction risk scoring has a real, documented track record. Models trained on labelled historical fraud data — with features drawn from transaction attributes, account age, velocity, counterparty network characteristics, device and session context — consistently outperform rules-based systems at surface-level fraud detection. The improvement over rules-based approaches on card-not-present fraud and account takeover is well-demonstrated.

The caveats are important. These models work best on fraud types where the signal is in the transaction itself — where the payment instruction looks different from legitimate transactions in ways that the model can learn. Account takeover fits this profile: the device is wrong, the session behaviour is wrong, the payment destination is unusual. The model has something to learn from.

For APP fraud — where the payment instruction is identical to a legitimate payment because the genuine account holder is making it — the transaction-layer signal is thin. Supervised ML on transaction features gives you marginal improvement over rules-based scoring for APP fraud. The baseline detection rate isn't great, and adding more model complexity doesn't substantially change that because the features that would discriminate are not in the transaction data.

This is a technical limitation, not a vendor quality problem. The models are doing what they're designed to do correctly. The problem is the feature set, not the modelling approach.

What Works Conditionally: Graph Network Analysis

Graph-based fraud detection — building network representations of payment flows and entity relationships, then detecting anomalous network patterns — has genuine utility in the APP fraud context, specifically for mule account network detection.

Mule accounts that receive multiple inbound transfers from unconnected sources over a short window create a distinctive network signature. Graph models that track payment network topology can identify these patterns faster than rules-based counterparty monitoring. Several major UK banking groups have deployed graph-based mule detection and report meaningful uplift in identifying mule account networks within days of activation, rather than weeks.

The limitation is that graph analysis is reactive to transaction data. It identifies mule networks after they've started receiving funds — which means the first victim in a new campaign typically isn't protected by mule account graph detection, though subsequent victims benefit. The earlier in a campaign's lifecycle you can detect the mule network, the greater the protection.

Graph analysis also requires significant data volume to build meaningful network representations. For smaller PSPs with fewer transactions, the graph is sparse and signal-to-noise is lower.

Where the Hype Runs Ahead: Generative AI for Fraud Detection

Here's where we want to be honest about our own domain. There's a narrative in the vendor market — and we contribute to it, including in our own messaging — that generative AI models can simulate scammer conversation behaviour and therefore generate synthetic training data, detect scam conversation patterns, and engage with scammers conversationally to extract fingerprints.

Some of that is true. Some of it is harder than the narrative suggests.

Generating synthetic scammer conversation data using large language models is feasible and genuinely useful for training detection classifiers. The synthetic data isn't perfect — it doesn't capture the full distribution of real scammer linguistic behaviour, particularly for non-English primary language scammers, and it tends to over-represent the more linguistically sophisticated scam scripts — but it supplements real data in useful ways and allows faster iteration on classifier training than waiting for new confirmed fraud cases.

Using generative AI to detect scam conversation patterns in real-time is harder in practice than in concept. The challenge is precision: a generative model that flags a conversation as potentially fraudulent needs to do so with high enough specificity that the fraud ops queue isn't overwhelmed with false positives from legitimate high-urgency communications. Investment conversations, HMRC notices, insurance communications — all of these contain legitimate urgency language that a loosely calibrated scam detector will flag. Getting the precision high enough to be operationally useful takes significantly more fine-tuning than vendors often acknowledge.

Engaging with scammers conversationally to extract fingerprints — which is essentially what AVIEL's honeybot does — is technically achievable but operationally constrained. The honeybot needs to maintain contextual coherence across a multi-message conversation, probe without revealing that it's probing, and generate a useful fingerprint rather than just a conversation transcript. Building something that does all three reliably required substantially more iteration than we expected when we started. We're being direct about this because we think the honest account is more useful to PSP teams evaluating technology than a polished pitch.

What Definitely Doesn't Work as Advertised

A few specific claims we've seen in the market that deserve scrutiny:

"Our AI detects APP fraud with 95% accuracy." Accuracy is a misleading metric for imbalanced datasets. If 0.1% of transactions are APP fraud, a model that classifies everything as not-fraud achieves 99.9% accuracy. What matters is precision and recall at a specific operating threshold that the fraud ops team has defined. Ask for precision-recall curves, not accuracy figures.

"We use NLP to monitor all customer communications for fraud signals." Monitoring all customer communications raises significant RIPA and UK GDPR issues that aren't resolved by calling it fraud prevention. The question is whether the specific architecture of the monitoring — what is captured, what is retained, under what authorisation — complies with the relevant legal framework. "We use NLP" is not an answer to that question.

"Our real-time model updates prevent concept drift." Real-time model updates on production transaction data can introduce instability and adversarial vulnerability that offline periodic retraining doesn't. Some fraud detection vendors conflate "frequent updates" with "better detection." The relevant question is whether the validation framework for new model versions is rigorous enough to prevent a bad update from increasing false positives across the customer base.

What We Think the Near-Term Progress Looks Like

The AI tooling in fraud ops that will move meaningfully in the next two years is not in the transaction-scoring layer — that's a relatively mature market. The meaningful progress is in pre-transaction signal.

Conversation-layer detection — whether through honeybot engagement, communication platform API partnerships, or indirect signals derived from customer interaction patterns — is the frontier. The technical challenges are real: legal compliance, data minimisation, precision calibration, adversarial robustness against scammers who adapt to detection. But these are solvable engineering problems, and the signal value justifies the investment.

Cross-PSP fingerprint sharing infrastructure — standardised scammer fingerprint schemas, near-real-time sharing mechanisms, and the legal governance frameworks to make them compliant — is the second area. This is partly a technical problem and partly an industry coordination problem. Both are tractable.

The third area is explainability tooling for fraud ops analysts. Current ML systems produce risk scores. Good fraud ops teams can work with scores, but they work better with enriched context: why is this transfer flagged, what features contributed most to the score, what is the specific campaign fingerprint match if there is one. Making fraud AI explainable isn't just a regulatory preference — it makes the human-in-the-loop more effective and reduces both false positive costs and missed detection rates.

Where we're putting our own engineering effort is squarely in the first two. The third is a consequence of doing the first two well enough to have something worth explaining.