Fraud Data Analyst
Started on Jan. 6, 2023
BNP Paribas Personal Finance is the leader in personal financing in France and in Europe through its consumer credit activity. A 100% subsidiary of the BNP Paribas Group, BNP Paribas Personal Finance gathers more than 20 000 employees and operates in some thirty countries. Under various brands like Cetelem, Cofinoga and Findomestic, BNP Paribas Personal Finance provides its clients with a complete range of consumer credit, available in stores and car dealership or directly through customer relationship centers and via the company’s local websites.
BNP Paribas Personal Finance had developed an active strategy of support for retailers, carmakers and dealers, Web merchants, and various financial institutions (banking and insurance), based on its credit's market experience and its ability to offer services adapted to the activity and the commercial strategy of its business partners. It is also a key player in terms of responsible credit and budgetary awareness.
BNPP Personal Finance is, by nature, exposed to Credit Risk, and relies heavily on quantitative models to manage it. Within BNP Paribas Personal Finance Central Risk Department is responsible for the relevance of risk rating models used within all local entities and for maintaining a high level of expertise with integrating new statistical techniques in new modelling environments.
The Credit Process Optimization team is part of the RISK department of BNPP PF, within Risk Personal Finance Global Credit Decision-making Policies, we contribute to the rationalization and the optimization of risk decision processes through an analytical approach. We support local risk teams to improve the efficiency of credit processes including fraud part, participating to the best balance between profitability, customer journey and risk profiles.
Fraud is a major problem for merchants. Criminals use a wide variety of methods to attack organizations across systems, channels, processes and products. The development of fraud detection methods is thus of crucial importance. Fraud detection is a challenging problem because fraudsters make their best efforts to make their behaviour look legitimate. Another difficulty is that the number of legitimate records is far greater than the number of fraudulent cases.
The aim of our challenge is to find the best way to process and analyse basket data from one of our retailer partners in order to detect fraud cases.
Using these basket data, fraudulent customers should be detected, to be then refused in the future.
This dataset contains a list of financed purchases, in which there is information about the contents of the baskets.
For each line of the dataset there are 147 columns, in which 144 are grouped in 6 categories:
For each of these categories there are 24 instances that will either be filled with relevant information when an item exists in the basket, or null when it does not. For example, if an application has 3 items in the basket there will be information in the columns item1 to item3, cash_price1 to cash_price3, make1 to make3, model1 to model3, goods_code1 to goods_code3 and Nbr_of_prod_purchas1 to Nbr_of_prod_purchas3; but the remaining columns of these categories will be null (e.g. item4 to item24).
The variable Nb_of_items corresponds to the unique number of items in the basket, and to the sum of all of them in such cases where one of the Nbr_of_prod_purchas instances is more than 1.
Finally, the fraud_flag variable indicates if that application is fraudulent or not.
|ID (Num)||Unique identifier||1|
|item1 à item24 (Char)||Goods category for item 1 to item 24||Computer|
|cash_price1 à cash_price24 (Num)||Cash price for item 1 to item 24||850|
|make1 à make24 (Char)||Manufacturer for item 1 to item 24||Apple|
|model1 à model24 (Char)||Model description for item 1 to item 24||Apple Iphone XX|
|goods_code1 à goods_code24 (Char)||Retailers code for item 1 to item 24||2378284364|
|Nbr_of_prod_purchas1 à Nbr_of_prod_purchas24 (Num)||Number of products for item 1 to item 24||2|
|Nb_of_items (Num)||Total number of items||7|
|ID (Num)||Unique identifier|
|fraud_flag (Num)||Fraud = 1, Not fraud = 0|
Shape: 115 988 observations, 147 columns.
Distribution of Y:
Fraud (Y=1) : 1 681 observations
No Fraud (Y=0) : 114 307 observations
The fraud rate for the whole population is around 1.4%.
The sampling method applied is a simple random without replacement. 80% of the initial dataset has been used to create the training sample and 20% for the test sample.
Shape: 92 790 observations, 147 columns.
Distribution of Y_train:
Fraud (Y=1) : 1 319 observations
No Fraud (Y=0) : 91 471 observations
Shape: 23 198 observations, 147 columns.
Distribution of Y_test:
Fraud (Y=1) : 362 observations
No Fraud (Y=0) : 22 836 observations
The objective is to identify fraudulent operations within a population by predicting a fraud risk/probability. Hence, the metric that will be used is the area under the precision-recall curve, also noted the PR-AUC.
The precision-recall curve is obtained by plotting the precision ( ) on the y-axis and the recall ( ) on the x-axis for all values of the probability thresholds between 0 and 1.
This metric is very useful for properly evaluating a model’s performance on the minority class in severely imbalanced classification problems.
The higher the PR-AUC, the better the model is at correctly detecting the minority class.
In this challenge, the PR-AUC will be calculated as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:
where and are the precision and recall at the nth threshold.
N.B. This implementation corresponds to the average_precision_score metric of sklearn.
Hence, your submission file should have the following format:
|ID (Num)||Unique identifier|
|fraud_flag (Num)||Probability estimate (positive float between 0 and 1) for the minority class (fraud). The higher the riskier.|
You can use the Y_test_random and Y_test_benchmark .csv files as examples to verify the exact format and size of a valid submission file for this challenge.
This first benchmark is a naive one and is based on a baseline model which randomly predicts probabilities between 0 and 1.
This second benchmark is based on our current solution where several pre-processing steps are applied, and a fine-tuned Machine Learning model is used to predict fraud risk.
Files are accessible when logged in and registered to the challenge
Fraud Data Analyst