Browser Session Analytics: The Key to Fraud Detection

Bookmark

This talk will show how a fraud detection model has been developed based on the data from the browsing sessions of the different users. Tools such as PySpark and Spark ML have been used in this initiative due to a large amount of data.


The model created was able to identify a grouping of characteristics that covered 10% of the total sessions in which 88% were deemed fraudulent. This allows analysts to spend more of their time on higher-risk cases.



Transcription


Hello, I'm Javier Alcaide. I work as a data scientist in Blue Tab Solutions, designing and developing machine learning solutions. In Blue Tab, we are experts on advanced analytics and big data, which allows us to help our clients in this kind of projects. Throughout the last few years, the financial fraud has grown dramatically, and this trend has been getting worse with the pandemic situation. At the beginning of the year, one of our clients in the financial sector asked us to improve the way they had to detect financial fraud on their online banking applications. To solve this problem, they provided us with a dataset from Adobe Omniture, containing around 80 million records of the different online banking app sessions, each one with 45 fields of information, along with a dataset containing the frauds detected by their fraud team in the recent months. We attacked the problem using our client's big data platform, and due to the size of the datasets, we decided to use Spark for the processing and analysis of the data. Our approach uses a well-known data mining methodology, CRISPM. This process divides the solution in five major phases. The first one is business understanding. The purpose of this phase is to align the objectives of the project with their business objectives. We focused on understanding the client's expectations and the project goals. With this knowledge of the problem, we designate a preliminary planning in order to achieve the objectives. The second phase is data understanding. We consider this the most important phase of the methodology. On it, the goal is to know the data, its structure and distribution, and the quality of it. We started with a univariate analysis of the dataset's columns against the target. Our conclusions from this analysis were crucial to decide which variables would be included in the training of the model. In this phase, we discovered, for example, that on the 70% of the fraudulent sessions, the mobile cash page was accessed from the web application. The 90% of the sessions opened from this particular device, UMI Plus, were fraudulent. This covered around 15% of the frauds. In around 75% of the fraudulent sessions, the operating system we used was Windows 8.1. The extraction of this insight is the differential value that a data scientist can offer in the creation of models. Through this acquired knowledge and selecting the best features, we were able to create much more accurate models for the detection of fraudulent transactions. The third phase is data preparation. When the variables are selected, it is time to prepare the dataset to train the different models. It is typically necessary to cleanse the data, ensuring that new values and outliers are identified. This, combined with mathematical transformations such as exponential or logarithmic functions, can improve the dispersion of distribution, which helps better train the model. Entire cleansing and transformations result in a new dataset with more than 200 features. We use the Pearson correlation matrix to group the features in correlated families, where we can choose the best one in the model. The fourth phase is modeling and validation. Once the training dataset was constructed, we used algorithms contained in the libraries of Spark ML, specifically decision trees, random forest classifier, and gradient boosting classifier to create our models. For the validation, we decided to use area under the rock curve as a metric because the target was not balanced in the dataset, which implies that metrics as accuracy cannot be used. In the deployment phase, the last one, we use our client's big data platform based on HDFS and Spark to deploy the model. It runs once a day with the data of the date before, which has around 6 million records. Since the model is designed and developed using Spark, it is possible to deploy it in any platform, cloud or on-premise, capable of deploying Spark apps. After the validation of the model, we found that the GVT classifier yielded the best result, with a score of 0.94 on the area under the curve. The model created was able to identify a grouping of sessions which covered 10% of the total sessions, where the 90% of the frauds were included. This allows analysts to spend more of their time on higher risk cases. In conclusion, in order to have more accurate models, it is important to use the full population of the data. This would be impossible without working with big data tools such as PySpark. These great results are based on the previous study of the variables and the insights obtained during the analysis. On the other hand, this kind of model becomes outdated quite fast, so it is necessary to train it regularly, usually every two months. The next steps would be to work with this model in real time, so the clients can take action swiftly when the fraud is detected, such as asking for a double authentication or blocking the transactions if the model predicts a fraudulent session. Thank you very much, and every question is welcome.
7 min
02 Jul, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic