\n\n \n","children":[{"type":"text","text":""}]},{"type":"p","children":[{"type":"text","text":"Addressing financial fraud is crucial in safeguarding individuals and economies.\nHighlighting the severity of financial fraud,\n"},{"type":"a","url":"https://www.forbes.com/advisor/credit-cards/most-scammed-states-in-america/","title":null,"children":[{"type":"text","text":"Forbes"}]},{"type":"text","text":"\nreports that in 2022, Americans lost a staggering $8.8 billion to fraudulent\nactivities in the consumer domain. This amount reflects a disturbing increase of\nover 30% from the previous year, as revealed by the Federal Trade Commision\n(FTC). In the corporate sector, a\n"},{"type":"a","url":"https://www.acfe.com/about-the-acfe/newsroom-for-media/press-releases/press-release-detail?s=ACFE-Estimates-Organizations-Lose-5-percent-to-Fraud","title":null,"children":[{"type":"text","text":"study"}]},{"type":"text","text":"\nconducted by the Association of Certified Fraud Examiners, reviewing various\ncases from January 2010 to December 2011, revealed that the average organization\nexperiences a 5% loss of its annual revenue due to fraudulent activities.\nApplied to the estimated 2011 Gross World Product, this figure translates to a\npotential total fraud loss of more than $3.5 trillion."}]},{"type":"p","children":[{"type":"text","text":"Fraudulent financial activity includes the following: credit card fraud, money\nlaundering, identity theft, embezzlement, insider trading, terrorism financing,\nAPP fraud, insurance fraud, as well as tax evasion."}]},{"type":"p","children":[{"type":"text","text":"In this blog post, we focus on credit card fraud, which is the most common type\nof fraud according to a 2022\n"},{"type":"a","url":"https://public.tableau.com/app/profile/federal.trade.commission/viz/TheBigViewAllSentinelReports/TopReports","title":null,"children":[{"type":"text","text":"report by FTC"}]},{"type":"text","text":",\nand we demonstrate the advantages of leveraging graph analytics to automatically\nidentify fraudulent transactions from a set of financial transactions, using\nmachine learning (ML). Our work is inspired by a publicly available\n"},{"type":"a","url":"https://medium.com/@mygreatlearning/graph-machine-learning-for-credit-card-fraud-analysis-f63baf3211e5","title":null,"children":[{"type":"text","text":"article"}]},{"type":"text","text":",\nwhich discusses the same problem and demonstrates different applications of\ngraph analytics when modeling and solving the aforementioned problem. We go one\nstep further and we experimentally investigate the effect of graph analytics on\nthe effectiveness of the ML solution. As described in detail below, our findings\nreveal that, for the problem at hand, the employment of graph analytics results\nin increase in precision and F1 score while only incurring a small drop in\nrecall. Our solution is built on top of RelationalAI's AI coprocessor, a\npowerful relational knowledge graph engine for graph analytics."}]},{"type":"h2","children":[{"type":"text","text":"Financial Transactions Dataset"}]},{"type":"p","children":[{"type":"text","text":"To allow for easy reproducibility of the results, we used a public dataset from\nthe\n"},{"type":"a","url":"https://www.kaggle.com/datasets/kartik2112/fraud-detection","title":null,"children":[{"type":"text","text":"Credit Card Transactions Fraud Detection Dataset"}]},{"type":"text","text":"\non Kaggle. The dataset contains information about simulated credit card\ntransactions between customers and merchants, for the period between 01-Jan-2019\nand 31-Dec-2020. Each transaction is described by the transaction timestamp, the\ntransaction amount, the customer and merchant descriptions, their locations at\nthe time of the transaction, the category of the transaction, the customer\ngender and their job description. The dataset also contains a binary label\nindicating whether the transaction was fraudulent (1) or legitimate (0). A small\nsample of the dataset is shown below:"}]},{"type":"mdxJsxFlowElement","children":[{"type":"text","text":""}],"name":"table","props":{"align":["left","center","center","center","center","center","center","center","center","right"],"tableRows":[{"tableCells":[{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"time"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"client"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"merchant"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"category"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"amount"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"job"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"gender"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"latitude"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"longitude"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"is_fraud"}]}]}}]},{"tableCells":[{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"2019-01-01 00:00:18"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"7031861896520"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"fraud_Rippi, Kub and Mann"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"misc_net"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"4.97"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"Psychologist, counseling"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"M"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"36.0113"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"-82.0483"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"0"}]}]}}]},{"tableCells":[{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"2019-01-01 00:00:44"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"630423337322"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"fraud_Helle, Gutmann and Zieme"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"grocery_pos"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"107.23"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"Special educational needs teacher"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"F"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"49.159"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"-118.186"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"1"}]}]}}]},{"tableCells":[{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"2019-01-01 00:00:51"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"38859492057661"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"fraud_Lind, Buckridge"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"entertainment"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"220.11"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"Nature conservation officer"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"M"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"43.1507"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"-112.151"}]}]}},{"value":{"type":"root","children":[{"type":"p","children":[{"type":"text","text":"0"}]}]}}]}]}},{"type":"p","children":[{"type":"text","text":"The dataset is already split into train (that can be used to train the ML model\non) and test (which is used to check the efficiency of the ML model) sets. The\nstatistics are summarized in the table below:"}]},{"type":"mdxJsxFlowElement","name":"ImgFig","children":[{"type":"text","text":""}],"props":{"src":"/blog/credit-card-fraud-identification-machine-learning-graphs/credit-card-fraud-detection-ml-stats.png","alt":"Credit card fraud detection dataset"}},{"type":"p","children":[{"type":"text","text":"The ratio of legitimate / fraudulent in the training set is 171 / 1 (99.42% of\ntransactions are legitimate, 0.58% are fraudulent), while in the test set the\nratio is 258 / 1 (99.61% of transactions are legitimate, 0.39% are fraudulent).\nThose ratios indicate that the dataset is highly imbalanced, with the majority\nof transactions being legitimate. This is a common pattern for fraud datasets\nand it requires special treatment for any ML model to provide a reasonable\nanswer. In our solution we experimentally found that random undersampling from\nthe majority class with a ratio of 0.85/ 0.15 works best in mitigating the\neffect of the imbalance."}]},{"type":"h2","children":[{"type":"text","text":"Problem Definition"}]},{"type":"p","children":[{"type":"text","text":"We model the fraudulent transaction identification problem with the help of\ngraph structures: a graph G is a pair G = (V, E), where V is a set whose\nelements are called nodes and E is a set of paired vertices, whose elements are\ncalled edges. Additional attributes can be assigned to each node. When modeling\nreal-world problems with graphs, the graph nodes represent real world entities\nand the edges describe relationships between those entities. The type of a node\ndescribes the real-world entity that the node is assigned to, while the type of\nan edge describes the relationship between entities. In the financial\ntransaction example, our entities are customers, merchants and financial\ntransactions and their relationships are the different interactions between\nthose entities. This can be presented with the graph structure shown in\nFigure 1."}]},{"type":"mdxJsxFlowElement","name":"ImgFig","children":[{"type":"text","text":""}],"props":{"src":"/blog/credit-card-fraud-identification-machine-learning-graphs/credit-card-fraud-detection-graph.png","alt":"fraud detection graph","caption":"Figure 1. Modeling the fraudulent transaction identification problem using a graph"}},{"type":"p","children":[{"type":"text","text":"In this figure we have the following nodes and edges:"}]},{"type":"ul","children":[{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Customer nodes:","bold":true},{"type":"text","text":" nodes representing the customer entities. There is one\nnode for each unique customer in the dataset. The type of those nodes is\n‘"},{"type":"html_inline","value":"customer","children":[{"type":"text","text":""}]},{"type":"text","text":"’. Each customer node is also paired with the attributes\nthat further describe each customer (e.g., customer ID, name, gender,\nlocation)."}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Merchant nodes:","bold":true},{"type":"text","text":" nodes representing merchants. There is one node for each\nunique merchant in the dataset. The type of those nodes is\n‘"},{"type":"html_inline","value":"merchant","children":[{"type":"text","text":""}]},{"type":"text","text":"’. Each merchant node is also paired with the attributes\nthat further describe each merchant (e.g., merchant ID, name, location)."}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Transaction nodes:","bold":true},{"type":"text","text":" nodes representing transactions. There is one node for\neach unique transaction in the dataset. The type of those nodes is\n‘transaction’. Each transaction node is also paired with the attributes that\nfurther describe each transaction (e.g., transaction ID, amount, category). An\nimportant transaction node attribute is the "},{"type":"html_inline","value":"label","children":[{"type":"text","text":""}]},{"type":"text","text":", which denotes\nwhether the relevant transaction is fraudulent or legitimate. The value for\nthe label attribute is only given for those transactions whose label is known,\notherwise it is left blank."}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Edges between customers and transactions:","bold":true},{"type":"text","text":" For each transaction node there\nis an edge between that node and the customer node representing the customer\nthat performed that transaction. The type of those edges is\n‘"},{"type":"html_inline","value":"performs_transaction","children":[{"type":"text","text":""}]},{"type":"text","text":"’."}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Edges between merchants and transactions:","bold":true},{"type":"text","text":" For each transaction node there\nis an edge between that node and the merchant node representing the merchant\nthat participated in that transaction. The type of those edges is\n‘"},{"type":"html_inline","value":"in_transaction","children":[{"type":"text","text":""}]},{"type":"text","text":"’."}]}]}]},{"type":"p","children":[{"type":"text","text":"Using the above graph structure we can map the automatic identification of\nfraudulent transactions problem to a node label prediction one, where we have to\npredict the label of each node of type ‘"},{"type":"html_inline","value":"transaction","children":[{"type":"text","text":""}]},{"type":"text","text":"’ for those\ntransaction nodes where the former is not present. In ML terms this is a"}]},{"type":"p","children":[{"type":"html_inline","value":"node binary classification problem","children":[{"type":"text","text":""}]},{"type":"text","text":", where we have to classify a node of\ntype ‘"},{"type":"html_inline","value":"transaction","children":[{"type":"text","text":""}]},{"type":"text","text":"’ as fraudulent or legitimate."}]},{"type":"h2","children":[{"type":"text","text":"RelationalAI’s Solution"}]},{"type":"p","children":[{"type":"text","text":"To tackle the aforementioned problem we have chosen the XGBoost ML model, which\nhas shown consistent state-of-the-art prediction performance for a multitude of\nclassification\n"},{"type":"a","url":"https://medium.com/latinxinai/xgboost-the-king-of-machine-learning-algorithms-6b5c0d4acd87","title":null,"children":[{"type":"text","text":"problems"}]},{"type":"text","text":"."}]},{"type":"p","children":[{"type":"text","text":"The input to the XGBoost model is a set of features (i.e., numerical\nattributes): in our case, those features are the metrics that (a) describe a\nfinancial transaction and (b) give some indication whether a transaction is\nfraudulent (for example the origin of the transaction may be related to this\nfact while a transaction ID less so). For each transaction whose label we want\nto predict, we invoke the model by instantiating each input feature with its\nrelevant value. The result of the model invocation is the numerical value of the\nprediction: in our case this is a binary (0/1) value describing whether the\ntransaction that is described by the input features is fraudulent (1) or\nlegitimate (0)."}]},{"type":"p","children":[{"type":"text","text":"The choice of features is important - strong indicators can greatly improve the\nsolution. In what follows we discuss how we chose the features for our problem,\na process widely known as "},{"type":"html_inline","value":"feature engineering,","children":[{"type":"text","text":""}]},{"type":"text","text":"."}]},{"type":"h2","children":[{"type":"text","text":"Tabular Feature Engineering"}]},{"type":"p","children":[{"type":"text","text":"As previously described, each transaction is paired with a set of attributes\nthat describe it. A subset of those attributes - ideally the ones that have\npredictive power for the problem at hand - can be employed (verbatim or with\nsimple mathematical transformations) as input to the XGBoost model. The process\nis largely empirical: during this stage, data scientists employ creative ways\n(based on statistical analysis, intuition/experience and feedback from domain\nexperts) to concatenate, extract or otherwise synthesize additional features\nfrom the original dataset fields. The resulting set is often described as\ntabular features. For the fraudulent transaction identification task, we have\nengineered the following, tabular features:"}]},{"type":"ul","children":[{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Client age","bold":true},{"type":"text","text":", derived from the client date of birth"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Distance between client and merchant","bold":true},{"type":"text","text":", derived from the client and merchant\n(latitude, longitude) coordinates"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Transaction day of the week","bold":true},{"type":"text","text":", derived from the transaction timestamp"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Transaction time of the day","bold":true},{"type":"text","text":", derived from the transaction timestamp"}]}]}]},{"type":"h2","children":[{"type":"text","text":"Graph Feature Engineering"}]},{"type":"p","children":[{"type":"text","text":"The modeling of the fraudulent transaction identification task using graphs that\nwe described above creates opportunities for the creation of additional\nfeatures, based on the graph structure of the data. Those features represent\nreal work concepts/metrics related to the (long or short range) interaction of\nthe different entities that are exposed by the graph structure. For example, if\nusing the graph we can identify customers that are involved in abnormally many\ntransactions, those customers have an elevated possibility to be involved in\nsome suspicious activity and are bound to participate in fraudulent\ntransactions. We will be referring to those as graph features from now on."}]},{"type":"p","children":[{"type":"text","text":"To identify these graph features, we employed the RelationalAI (RAI) AI\ncoprocessor to Snowflake. It provides a rich interface for\n"},{"type":"a","url":"https://docs.relational.ai/preview/snowflake/library/functions#sql-library-reference-for-snowflake-functions","title":null,"children":[{"type":"text","text":"graph analytics functions"}]},{"type":"text","text":",\nwhich allows for the computation of graph features and the extraction of\ninsights from graphs. Leveraging the RAI on Snowflake we have computed and used\nthe following features:"}]},{"type":"ul","children":[{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Node degree","bold":true},{"type":"text","text":": the number of edges incident to a node"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Degree centrality","bold":true},{"type":"text","text":": the fraction of nodes connected to a node"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Eigenvector centrality","bold":true},{"type":"text","text":": the centrality of a node in a graph based on the\ncentrality of its neighbors"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"PageRank centrality","bold":true},{"type":"text","text":": the centrality of a node in a graph based on the\nrecursive centrality of all nodes that link to it"}]}]}]},{"type":"p","children":[{"type":"text","text":"All the above features rank nodes based on their importance within the graph\nstructure. Not knowing which metric is the most predictive for the task, it is\nlogical to compute various combinations of those features and subsequently\nchoose the most beneficial. This process is known as "},{"type":"html_inline","value":"feature selection","children":[{"type":"text","text":""}]},{"type":"text","text":"."}]},{"type":"h2","children":[{"type":"text","text":"Feature Selection"}]},{"type":"p","children":[{"type":"text","text":"Through feature analysis we found that the following graph features improved the\nperformance of the model:"}]},{"type":"ul","children":[{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Client","bold":true},{"type":"text","text":" eigenvector centrality"}]}]},{"type":"li","children":[{"type":"lic","children":[{"type":"text","text":"Merchant","bold":true},{"type":"text","text":" eigenvector centrality"}]}]}]},{"type":"p","children":[{"type":"text","text":"As an example of the feature selection process, we describe how the client\neigenvector centrality was chosen: we started by inspecting the client\neigenvector centrality distributions for the legitimate and fraudulent class\n(shown in the Figure 2 below)."}]},{"type":"mdxJsxFlowElement","name":"ImgFig","children":[{"type":"text","text":""}],"props":{"src":"/blog/credit-card-fraud-identification-machine-learning-graphs/credit-card-fraud-detection-eigenvector.png","alt":"eigenvector","caption":"Figure 2. Distribution of eigenvector centralities for the legitimate and fraudulent transactions"}},{"type":"p","children":[{"type":"text","text":"By inspecting the figure it becomes obvious that the client eigenvector\ndistributions have significantly different forms. This is a strong indication\nthat the metric has a very different behavior between fraudulent and legitimate\ntransactions and can thus be used to differentiate the two classes (fraudulent\nvs legitimate)."}]},{"type":"h2","children":[{"type":"text","text":"Results"}]},{"type":"p","children":[{"type":"text","text":"To experimentally validate the proposed solution and at the same time inspect\nthe effect of graph features we compared the performance of the XGBoost model\nwhen the graph features are present and when they are absent from the input. For\neach of the two cases we report the results for the subset of features that\nprovides the best performance. We repeated each experiment 5 times, taking a\ndifferent random sample of the majority class each time. The results are shown\nin Figure 3 below:"}]},{"type":"mdxJsxFlowElement","name":"ImgFig","children":[{"type":"text","text":""}],"props":{"src":"/blog/credit-card-fraud-identification-machine-learning-graphs/credit-card-fraud-detection-prediction.png","alt":"prediction","caption":"Figure 3. Prediction performance with and without graph features"}},{"type":"p","children":[{"type":"text","text":"We observe that when adding the graph features, we get an absolute 4.4%\nimprovement in "},{"type":"text","text":"precision","bold":true},{"type":"text","text":", a small hit in "},{"type":"text","text":"recall","bold":true},{"type":"text","text":" (2.2%) and an\n"},{"type":"text","text":"increase","bold":true},{"type":"text","text":" of 4.3% in "},{"type":"text","text":"F1 score","bold":true},{"type":"text","text":". This is a significant improvement, given\nthe imbalanced nature of the data. Note that we report performance metrics on\nthe "},{"type":"text","text":"fraud class only","bold":true},{"type":"text","text":", since this is the class that we are primarily\ninterested in."}]},{"type":"h2","children":[{"type":"text","text":"Conclusion"}]},{"type":"p","children":[{"type":"text","text":"Graphs can provide a powerful tool for data scientists to combat financial\nfraud. In this post, we discussed the problem of automatic identification of\nfraudulent financial transactions and we described our solution, built on top of\nRelationalAI's AI coprocessor, a powerful graph analytics engine. As part of our\nsolution we (a) mitigated the large class imbalance through subsampling and (b)\nleveraged graph features to further improve our solution. In particular, we\ndemonstrated that by employing the "},{"type":"html_inline","value":"client","children":[{"type":"text","text":""}]},{"type":"text","text":" and "},{"type":"html_inline","value":"merchant eigenvector\ncentrality","children":[{"type":"text","text":""}]},{"type":"text","text":" we were able to boost the prediction performance by 4.4% in\nprecision and 4.3% in F1 score. Our work has highlighted the significance of\nutilizing graphs for the successful accomplishment of the financial fraud\ndetection which could result in substantial cost savings."}]},{"type":"h2","children":[{"type":"text","text":"About the Authors"}]},{"type":"p","children":[{"type":"html_inline","value":"Zografoula Vagena","children":[{"type":"text","text":""}]},{"type":"text","text":" joined RelationalAI in 2017. After graduating with a PhD\nin Computer Science from the University of California, Riverside, she held different\npositions both in research (IBM Research, Microsoft Research, University of Southern\nDenmark, Rice University, Université Paris Cité) and industry (Concentra Consulting,\nLogicblox Inc, Infor Inc) and is currently part of RelationalAI as a principal data\nscientist. Her work spans all aspects of data management and analysis (data management\nsystems, data science, optimization)."}]},{"type":"p","children":[{"type":"html_inline","value":"Spiros Politis","children":[{"type":"text","text":""}]},{"type":"text","text":" joined RelationalAI as a senior data scientist in 2022. He\nholds a Masters’ degree in Computer Science from the University of Bristol, U.K.\nand a Masters’ degree in Data Science from Athens University of Economics and Business.\nHis 25 year career includes positions in IT Consulting (Information Systems Impact\nLtd., Agilis), Telecommunications (OTE group of companies), and Machine Learning\n/ Engineering (Aisera)."}]}],"_content_source":{"queryId":"src/content/resources/credit-card-fraud-detection-machine-learning-graphs.mdx","path":["resource","body"]}},"_content_source":{"queryId":"src/content/resources/credit-card-fraud-detection-machine-learning-graphs.mdx","path":["resource"]}}}}};
globalThis.tina_info = tina;
})();
Credit Card Fraud Identification Using Machine Learning on Graphs · RelationalAI
Credit Card Fraud Identification Using Machine Learning on Graphs
Addressing financial fraud is crucial in safeguarding individuals and economies.
Highlighting the severity of financial fraud,
Forbes
reports that in 2022, Americans lost a staggering $8.8 billion to fraudulent
activities in the consumer domain. This amount reflects a disturbing increase of
over 30% from the previous year, as revealed by the Federal Trade Commision
(FTC). In the corporate sector, a
study
conducted by the Association of Certified Fraud Examiners, reviewing various
cases from January 2010 to December 2011, revealed that the average organization
experiences a 5% loss of its annual revenue due to fraudulent activities.
Applied to the estimated 2011 Gross World Product, this figure translates to a
potential total fraud loss of more than $3.5 trillion.
Fraudulent financial activity includes the following: credit card fraud, money
laundering, identity theft, embezzlement, insider trading, terrorism financing,
APP fraud, insurance fraud, as well as tax evasion.
In this blog post, we focus on credit card fraud, which is the most common type
of fraud according to a 2022
report by FTC,
and we demonstrate the advantages of leveraging graph analytics to automatically
identify fraudulent transactions from a set of financial transactions, using
machine learning (ML). Our work is inspired by a publicly available
article,
which discusses the same problem and demonstrates different applications of
graph analytics when modeling and solving the aforementioned problem. We go one
step further and we experimentally investigate the effect of graph analytics on
the effectiveness of the ML solution. As described in detail below, our findings
reveal that, for the problem at hand, the employment of graph analytics results
in increase in precision and F1 score while only incurring a small drop in
recall. Our solution is built on top of RelationalAI's AI coprocessor, a
powerful relational knowledge graph engine for graph analytics.
Financial Transactions Dataset
To allow for easy reproducibility of the results, we used a public dataset from
the
Credit Card Transactions Fraud Detection Dataset
on Kaggle. The dataset contains information about simulated credit card
transactions between customers and merchants, for the period between 01-Jan-2019
and 31-Dec-2020. Each transaction is described by the transaction timestamp, the
transaction amount, the customer and merchant descriptions, their locations at
the time of the transaction, the category of the transaction, the customer
gender and their job description. The dataset also contains a binary label
indicating whether the transaction was fraudulent (1) or legitimate (0). A small
sample of the dataset is shown below:
time
client
merchant
category
amount
job
gender
latitude
longitude
is_fraud
2019-01-01 00:00:18
7031861896520
fraud_Rippi, Kub and Mann
misc_net
4.97
Psychologist, counseling
M
36.0113
-82.0483
0
2019-01-01 00:00:44
630423337322
fraud_Helle, Gutmann and Zieme
grocery_pos
107.23
Special educational needs teacher
F
49.159
-118.186
1
2019-01-01 00:00:51
38859492057661
fraud_Lind, Buckridge
entertainment
220.11
Nature conservation officer
M
43.1507
-112.151
0
The dataset is already split into train (that can be used to train the ML model
on) and test (which is used to check the efficiency of the ML model) sets. The
statistics are summarized in the table below:
The ratio of legitimate / fraudulent in the training set is 171 / 1 (99.42% of
transactions are legitimate, 0.58% are fraudulent), while in the test set the
ratio is 258 / 1 (99.61% of transactions are legitimate, 0.39% are fraudulent).
Those ratios indicate that the dataset is highly imbalanced, with the majority
of transactions being legitimate. This is a common pattern for fraud datasets
and it requires special treatment for any ML model to provide a reasonable
answer. In our solution we experimentally found that random undersampling from
the majority class with a ratio of 0.85/ 0.15 works best in mitigating the
effect of the imbalance.
Problem Definition
We model the fraudulent transaction identification problem with the help of
graph structures: a graph G is a pair G = (V, E), where V is a set whose
elements are called nodes and E is a set of paired vertices, whose elements are
called edges. Additional attributes can be assigned to each node. When modeling
real-world problems with graphs, the graph nodes represent real world entities
and the edges describe relationships between those entities. The type of a node
describes the real-world entity that the node is assigned to, while the type of
an edge describes the relationship between entities. In the financial
transaction example, our entities are customers, merchants and financial
transactions and their relationships are the different interactions between
those entities. This can be presented with the graph structure shown in
Figure 1.
Figure 1. Modeling the fraudulent transaction identification problem using a graph
In this figure we have the following nodes and edges:
Customer nodes: nodes representing the customer entities. There is one
node for each unique customer in the dataset. The type of those nodes is
‘customer’. Each customer node is also paired with the attributes
that further describe each customer (e.g., customer ID, name, gender,
location).
Merchant nodes: nodes representing merchants. There is one node for each
unique merchant in the dataset. The type of those nodes is
‘merchant’. Each merchant node is also paired with the attributes
that further describe each merchant (e.g., merchant ID, name, location).
Transaction nodes: nodes representing transactions. There is one node for
each unique transaction in the dataset. The type of those nodes is
‘transaction’. Each transaction node is also paired with the attributes that
further describe each transaction (e.g., transaction ID, amount, category). An
important transaction node attribute is the label, which denotes
whether the relevant transaction is fraudulent or legitimate. The value for
the label attribute is only given for those transactions whose label is known,
otherwise it is left blank.
Edges between customers and transactions: For each transaction node there
is an edge between that node and the customer node representing the customer
that performed that transaction. The type of those edges is
‘performs_transaction’.
Edges between merchants and transactions: For each transaction node there
is an edge between that node and the merchant node representing the merchant
that participated in that transaction. The type of those edges is
‘in_transaction’.
Using the above graph structure we can map the automatic identification of
fraudulent transactions problem to a node label prediction one, where we have to
predict the label of each node of type ‘transaction’ for those
transaction nodes where the former is not present. In ML terms this is a
node binary classification problem, where we have to classify a node of
type ‘transaction’ as fraudulent or legitimate.
RelationalAI’s Solution
To tackle the aforementioned problem we have chosen the XGBoost ML model, which
has shown consistent state-of-the-art prediction performance for a multitude of
classification
problems.
The input to the XGBoost model is a set of features (i.e., numerical
attributes): in our case, those features are the metrics that (a) describe a
financial transaction and (b) give some indication whether a transaction is
fraudulent (for example the origin of the transaction may be related to this
fact while a transaction ID less so). For each transaction whose label we want
to predict, we invoke the model by instantiating each input feature with its
relevant value. The result of the model invocation is the numerical value of the
prediction: in our case this is a binary (0/1) value describing whether the
transaction that is described by the input features is fraudulent (1) or
legitimate (0).
The choice of features is important - strong indicators can greatly improve the
solution. In what follows we discuss how we chose the features for our problem,
a process widely known as feature engineering,.
Tabular Feature Engineering
As previously described, each transaction is paired with a set of attributes
that describe it. A subset of those attributes - ideally the ones that have
predictive power for the problem at hand - can be employed (verbatim or with
simple mathematical transformations) as input to the XGBoost model. The process
is largely empirical: during this stage, data scientists employ creative ways
(based on statistical analysis, intuition/experience and feedback from domain
experts) to concatenate, extract or otherwise synthesize additional features
from the original dataset fields. The resulting set is often described as
tabular features. For the fraudulent transaction identification task, we have
engineered the following, tabular features:
Client age, derived from the client date of birth
Distance between client and merchant, derived from the client and merchant
(latitude, longitude) coordinates
Transaction day of the week, derived from the transaction timestamp
Transaction time of the day, derived from the transaction timestamp
Graph Feature Engineering
The modeling of the fraudulent transaction identification task using graphs that
we described above creates opportunities for the creation of additional
features, based on the graph structure of the data. Those features represent
real work concepts/metrics related to the (long or short range) interaction of
the different entities that are exposed by the graph structure. For example, if
using the graph we can identify customers that are involved in abnormally many
transactions, those customers have an elevated possibility to be involved in
some suspicious activity and are bound to participate in fraudulent
transactions. We will be referring to those as graph features from now on.
To identify these graph features, we employed the RelationalAI (RAI) AI
coprocessor to Snowflake. It provides a rich interface for
graph analytics functions,
which allows for the computation of graph features and the extraction of
insights from graphs. Leveraging the RAI on Snowflake we have computed and used
the following features:
Node degree: the number of edges incident to a node
Degree centrality: the fraction of nodes connected to a node
Eigenvector centrality: the centrality of a node in a graph based on the
centrality of its neighbors
PageRank centrality: the centrality of a node in a graph based on the
recursive centrality of all nodes that link to it
All the above features rank nodes based on their importance within the graph
structure. Not knowing which metric is the most predictive for the task, it is
logical to compute various combinations of those features and subsequently
choose the most beneficial. This process is known as feature selection.
Feature Selection
Through feature analysis we found that the following graph features improved the
performance of the model:
Client eigenvector centrality
Merchant eigenvector centrality
As an example of the feature selection process, we describe how the client
eigenvector centrality was chosen: we started by inspecting the client
eigenvector centrality distributions for the legitimate and fraudulent class
(shown in the Figure 2 below).
Figure 2. Distribution of eigenvector centralities for the legitimate and fraudulent transactions
By inspecting the figure it becomes obvious that the client eigenvector
distributions have significantly different forms. This is a strong indication
that the metric has a very different behavior between fraudulent and legitimate
transactions and can thus be used to differentiate the two classes (fraudulent
vs legitimate).
Results
To experimentally validate the proposed solution and at the same time inspect
the effect of graph features we compared the performance of the XGBoost model
when the graph features are present and when they are absent from the input. For
each of the two cases we report the results for the subset of features that
provides the best performance. We repeated each experiment 5 times, taking a
different random sample of the majority class each time. The results are shown
in Figure 3 below:
Figure 3. Prediction performance with and without graph features
We observe that when adding the graph features, we get an absolute 4.4%
improvement in precision, a small hit in recall (2.2%) and an
increase of 4.3% in F1 score. This is a significant improvement, given
the imbalanced nature of the data. Note that we report performance metrics on
the fraud class only, since this is the class that we are primarily
interested in.
Conclusion
Graphs can provide a powerful tool for data scientists to combat financial
fraud. In this post, we discussed the problem of automatic identification of
fraudulent financial transactions and we described our solution, built on top of
RelationalAI's AI coprocessor, a powerful graph analytics engine. As part of our
solution we (a) mitigated the large class imbalance through subsampling and (b)
leveraged graph features to further improve our solution. In particular, we
demonstrated that by employing the client and merchant eigenvector
centrality we were able to boost the prediction performance by 4.4% in
precision and 4.3% in F1 score. Our work has highlighted the significance of
utilizing graphs for the successful accomplishment of the financial fraud
detection which could result in substantial cost savings.
About the Authors
Zografoula Vagena joined RelationalAI in 2017. After graduating with a PhD
in Computer Science from the University of California, Riverside, she held different
positions both in research (IBM Research, Microsoft Research, University of Southern
Denmark, Rice University, Université Paris Cité) and industry (Concentra Consulting,
Logicblox Inc, Infor Inc) and is currently part of RelationalAI as a principal data
scientist. Her work spans all aspects of data management and analysis (data management
systems, data science, optimization).
Spiros Politis joined RelationalAI as a senior data scientist in 2022. He
holds a Masters’ degree in Computer Science from the University of Bristol, U.K.
and a Masters’ degree in Data Science from Athens University of Economics and Business.
His 25 year career includes positions in IT Consulting (Information Systems Impact
Ltd., Agilis), Telecommunications (OTE group of companies), and Machine Learning
/ Engineering (Aisera).