Hello, I'm Javier Alcaide. I work as a
data scientist in Blue Tab Solutions, designing and developing
machine learning solutions. In Blue Tab, we are experts on
advanced analytics and big
data, which allows us to help our clients in this kind of projects. Throughout the last few years, the financial fraud has grown dramatically, and this trend has been getting worse with the pandemic situation. At the beginning of the year, one of our clients in the financial sector asked us to improve the way they had to detect financial fraud on their online banking applications. To solve this problem, they provided us with a dataset from Adobe Omniture, containing around 80 million records of the different online banking app sessions, each one with 45 fields of information, along with a dataset containing the frauds detected by their fraud team in the recent months. We attacked the problem using our client's big
data platform, and due to the size of the datasets, we decided to use Spark for the processing and analysis of the
data. Our approach uses a well-known
data mining methodology, CRISPM. This process divides the solution in five major phases. The first one is business understanding. The purpose of this phase is to align the objectives of the project with their business objectives. We focused on understanding the client's expectations and the project goals. With this knowledge of the problem, we designate a preliminary planning in order to achieve the objectives. The second phase is
data understanding. We consider this the most important phase of the methodology. On it, the goal is to know the
data, its structure and distribution, and the quality of it. We started with a univariate analysis of the dataset's columns against the target. Our conclusions from this analysis were crucial to decide which variables would be included in the training of the model. In this phase, we discovered, for example, that on the 70% of the fraudulent sessions, the mobile cash page was accessed from the web application. The 90% of the sessions opened from this particular device, UMI Plus, were fraudulent. This covered around 15% of the frauds. In around 75% of the fraudulent sessions, the operating system we used was Windows 8.1. The extraction of this insight is the differential value that a
data scientist can offer in the creation of models. Through this acquired knowledge and selecting the best features, we were able to create much more accurate models for the detection of fraudulent transactions. The third phase is
data preparation. When the variables are selected, it is time to prepare the dataset to train the different models. It is typically necessary to cleanse the
data, ensuring that new values and outliers are identified. This, combined with mathematical transformations such as exponential or logarithmic functions, can improve the dispersion of distribution, which helps better train the model. Entire cleansing and transformations result in a new dataset with more than 200 features. We use the Pearson correlation matrix to group the features in correlated families, where we can choose the best one in the model. The fourth phase is modeling and
validation. Once the training dataset was constructed, we used algorithms contained in the libraries of Spark ML, specifically decision trees, random forest classifier, and gradient boosting classifier to create our models. For the
validation, we decided to use area under the rock curve as a metric because the target was not balanced in the dataset, which implies that metrics as accuracy cannot be used. In the deployment phase, the last one, we use our client's big
data platform based on HDFS and Spark to deploy the model. It runs once a day with the
data of the date before, which has around 6 million records. Since the model is designed and developed using Spark, it is possible to deploy it in any platform,
cloud or on-premise, capable of deploying Spark apps. After the
validation of the model, we found that the GVT classifier yielded the best result, with a score of 0.94 on the area under the curve. The model created was able to identify a grouping of sessions which covered 10% of the total sessions, where the 90% of the frauds were included. This allows analysts to spend more of their time on higher risk cases. In conclusion, in order to have more accurate models, it is important to use the full population of the
data. This would be impossible without working with big
data tools such as PySpark. These great results are based on the previous study of the variables and the insights obtained during the analysis. On the other hand, this kind of model becomes outdated quite fast, so it is necessary to train it regularly, usually every two months. The next steps would be to work with this model in real time, so the clients can take action swiftly when the fraud is detected, such as asking for a double
authentication or blocking the transactions if the model predicts a fraudulent session. Thank you very much, and every question is welcome.