Wednesday, April 10, 2024

Credit Card Fraud Identification Using Machine Learning on Graphs

Addressing financial fraud is crucial in safeguarding individuals and economies. Highlighting the severity of financial fraud, Forbes reports that in 2022, Americans lost a staggering $8.8 billion to fraudulent activities in the consumer domain. This amount reflects a disturbing increase of over 30% from the previous year, as revealed by the Federal Trade Commision (FTC). In the corporate sector, a study conducted by the Association of Certified Fraud Examiners, reviewing various cases from January 2010 to December 2011, revealed that the average organization experiences a 5% loss of its annual revenue due to fraudulent activities. Applied to the estimated 2011 Gross World Product, this figure translates to a potential total fraud loss of more than $3.5 trillion.

Fraudulent financial activity includes the following: credit card fraud, money laundering, identity theft, embezzlement, insider trading, terrorism financing, APP fraud, insurance fraud, as well as tax evasion.

In this blog post, we focus on credit card fraud, which is the most common type of fraud according to a 2022 report by FTC, and we demonstrate the advantages of leveraging graph analytics to automatically identify fraudulent transactions from a set of financial transactions, using machine learning (ML). Our work is inspired by a publicly available article, which discusses the same problem and demonstrates different applications of graph analytics when modeling and solving the aforementioned problem. We go one step further and we experimentally investigate the effect of graph analytics on the effectiveness of the ML solution. As described in detail below, our findings reveal that, for the problem at hand, the employment of graph analytics results in increase in precision and F1 score while only incurring a small drop in recall. Our solution is built on top of RelationalAI's AI coprocessor, a powerful relational knowledge graph engine for graph analytics.

Financial Transactions Dataset

To allow for easy reproducibility of the results, we used a public dataset from the Credit Card Transactions Fraud Detection Dataset on Kaggle. The dataset contains information about simulated credit card transactions between customers and merchants, for the period between 01-Jan-2019 and 31-Dec-2020. Each transaction is described by the transaction timestamp, the transaction amount, the customer and merchant descriptions, their locations at the time of the transaction, the category of the transaction, the customer gender and their job description. The dataset also contains a binary label indicating whether the transaction was fraudulent (1) or legitimate (0). A small sample of the dataset is shown below:

time	client	merchant	category	amount	job	gender	latitude	longitude	is_fraud
2019-01-01 00:00:18	7031861896520	fraud_Rippi, Kub and Mann	misc_net	4.97	Psychologist, counseling	M	36.0113	-82.0483	0
2019-01-01 00:00:44	630423337322	fraud_Helle, Gutmann and Zieme	grocery_pos	107.23	Special educational needs teacher	F	49.159	-118.186	1
2019-01-01 00:00:51	38859492057661	fraud_Lind, Buckridge	entertainment	220.11	Nature conservation officer	M	43.1507	-112.151	0

The dataset is already split into train (that can be used to train the ML model on) and test (which is used to check the efficiency of the ML model) sets. The statistics are summarized in the table below:

The ratio of legitimate / fraudulent in the training set is 171 / 1 (99.42% of transactions are legitimate, 0.58% are fraudulent), while in the test set the ratio is 258 / 1 (99.61% of transactions are legitimate, 0.39% are fraudulent). Those ratios indicate that the dataset is highly imbalanced, with the majority of transactions being legitimate. This is a common pattern for fraud datasets and it requires special treatment for any ML model to provide a reasonable answer. In our solution we experimentally found that random undersampling from the majority class with a ratio of 0.85/ 0.15 works best in mitigating the effect of the imbalance.

Problem Definition

We model the fraudulent transaction identification problem with the help of graph structures: a graph G is a pair G = (V, E), where V is a set whose elements are called nodes and E is a set of paired vertices, whose elements are called edges. Additional attributes can be assigned to each node. When modeling real-world problems with graphs, the graph nodes represent real world entities and the edges describe relationships between those entities. The type of a node describes the real-world entity that the node is assigned to, while the type of an edge describes the relationship between entities. In the financial transaction example, our entities are customers, merchants and financial transactions and their relationships are the different interactions between those entities. This can be presented with the graph structure shown in Figure 1.

Figure 1. Modeling the fraudulent transaction identification problem using a graph

In this figure we have the following nodes and edges:

Customer nodes: nodes representing the customer entities. There is one node for each unique customer in the dataset. The type of those nodes is ‘customer’. Each customer node is also paired with the attributes that further describe each customer (e.g., customer ID, name, gender, location).
Merchant nodes: nodes representing merchants. There is one node for each unique merchant in the dataset. The type of those nodes is ‘merchant’. Each merchant node is also paired with the attributes that further describe each merchant (e.g., merchant ID, name, location).
Transaction nodes: nodes representing transactions. There is one node for each unique transaction in the dataset. The type of those nodes is ‘transaction’. Each transaction node is also paired with the attributes that further describe each transaction (e.g., transaction ID, amount, category). An important transaction node attribute is the label, which denotes whether the relevant transaction is fraudulent or legitimate. The value for the label attribute is only given for those transactions whose label is known, otherwise it is left blank.
Edges between customers and transactions: For each transaction node there is an edge between that node and the customer node representing the customer that performed that transaction. The type of those edges is ‘performs_transaction’.
Edges between merchants and transactions: For each transaction node there is an edge between that node and the merchant node representing the merchant that participated in that transaction. The type of those edges is ‘in_transaction’.

Using the above graph structure we can map the automatic identification of fraudulent transactions problem to a node label prediction one, where we have to predict the label of each node of type ‘transaction’ for those transaction nodes where the former is not present. In ML terms this is a

node binary classification problem, where we have to classify a node of type ‘transaction’ as fraudulent or legitimate.

RelationalAI’s Solution

To tackle the aforementioned problem we have chosen the XGBoost ML model, which has shown consistent state-of-the-art prediction performance for a multitude of classification problems.

The input to the XGBoost model is a set of features (i.e., numerical attributes): in our case, those features are the metrics that (a) describe a financial transaction and (b) give some indication whether a transaction is fraudulent (for example the origin of the transaction may be related to this fact while a transaction ID less so). For each transaction whose label we want to predict, we invoke the model by instantiating each input feature with its relevant value. The result of the model invocation is the numerical value of the prediction: in our case this is a binary (0/1) value describing whether the transaction that is described by the input features is fraudulent (1) or legitimate (0).

The choice of features is important - strong indicators can greatly improve the solution. In what follows we discuss how we chose the features for our problem, a process widely known as feature engineering,.

Tabular Feature Engineering

As previously described, each transaction is paired with a set of attributes that describe it. A subset of those attributes - ideally the ones that have predictive power for the problem at hand - can be employed (verbatim or with simple mathematical transformations) as input to the XGBoost model. The process is largely empirical: during this stage, data scientists employ creative ways (based on statistical analysis, intuition/experience and feedback from domain experts) to concatenate, extract or otherwise synthesize additional features from the original dataset fields. The resulting set is often described as tabular features. For the fraudulent transaction identification task, we have engineered the following, tabular features:

Client age, derived from the client date of birth
Distance between client and merchant, derived from the client and merchant (latitude, longitude) coordinates
Transaction day of the week, derived from the transaction timestamp
Transaction time of the day, derived from the transaction timestamp

Graph Feature Engineering

The modeling of the fraudulent transaction identification task using graphs that we described above creates opportunities for the creation of additional features, based on the graph structure of the data. Those features represent real work concepts/metrics related to the (long or short range) interaction of the different entities that are exposed by the graph structure. For example, if using the graph we can identify customers that are involved in abnormally many transactions, those customers have an elevated possibility to be involved in some suspicious activity and are bound to participate in fraudulent transactions. We will be referring to those as graph features from now on.

To identify these graph features, we employed the RelationalAI (RAI) AI coprocessor to Snowflake. It provides a rich interface for graph analytics functions, which allows for the computation of graph features and the extraction of insights from graphs. Leveraging the RAI on Snowflake we have computed and used the following features:

Node degree: the number of edges incident to a node
Degree centrality: the fraction of nodes connected to a node
Eigenvector centrality: the centrality of a node in a graph based on the centrality of its neighbors
PageRank centrality: the centrality of a node in a graph based on the recursive centrality of all nodes that link to it

All the above features rank nodes based on their importance within the graph structure. Not knowing which metric is the most predictive for the task, it is logical to compute various combinations of those features and subsequently choose the most beneficial. This process is known as feature selection.

Feature Selection

Through feature analysis we found that the following graph features improved the performance of the model:

Client eigenvector centrality
Merchant eigenvector centrality

As an example of the feature selection process, we describe how the client eigenvector centrality was chosen: we started by inspecting the client eigenvector centrality distributions for the legitimate and fraudulent class (shown in the Figure 2 below).

Figure 2. Distribution of eigenvector centralities for the legitimate and fraudulent transactions

By inspecting the figure it becomes obvious that the client eigenvector distributions have significantly different forms. This is a strong indication that the metric has a very different behavior between fraudulent and legitimate transactions and can thus be used to differentiate the two classes (fraudulent vs legitimate).

Results

To experimentally validate the proposed solution and at the same time inspect the effect of graph features we compared the performance of the XGBoost model when the graph features are present and when they are absent from the input. For each of the two cases we report the results for the subset of features that provides the best performance. We repeated each experiment 5 times, taking a different random sample of the majority class each time. The results are shown in Figure 3 below:

Figure 3. Prediction performance with and without graph features

We observe that when adding the graph features, we get an absolute 4.4% improvement in precision, a small hit in recall (2.2%) and an increase of 4.3% in F1 score. This is a significant improvement, given the imbalanced nature of the data. Note that we report performance metrics on the fraud class only, since this is the class that we are primarily interested in.

Conclusion

Graphs can provide a powerful tool for data scientists to combat financial fraud. In this post, we discussed the problem of automatic identification of fraudulent financial transactions and we described our solution, built on top of RelationalAI's AI coprocessor, a powerful graph analytics engine. As part of our solution we (a) mitigated the large class imbalance through subsampling and (b) leveraged graph features to further improve our solution. In particular, we demonstrated that by employing the client and merchant eigenvector centrality we were able to boost the prediction performance by 4.4% in precision and 4.3% in F1 score. Our work has highlighted the significance of utilizing graphs for the successful accomplishment of the financial fraud detection which could result in substantial cost savings.

About the Authors

Zografoula Vagena joined RelationalAI in 2017. After graduating with a PhD in Computer Science from the University of California, Riverside, she held different positions both in research (IBM Research, Microsoft Research, University of Southern Denmark, Rice University, Université Paris Cité) and industry (Concentra Consulting, Logicblox Inc, Infor Inc) and is currently part of RelationalAI as a principal data scientist. Her work spans all aspects of data management and analysis (data management systems, data science, optimization).

Spiros Politis joined RelationalAI as a senior data scientist in 2022. He holds a Masters’ degree in Computer Science from the University of Bristol, U.K. and a Masters’ degree in Data Science from Athens University of Economics and Business. His 25 year career includes positions in IT Consulting (Information Systems Impact Ltd., Agilis), Telecommunications (OTE group of companies), and Machine Learning / Engineering (Aisera).