Pladifes is a reserch project hosted at the Institut Louis Bachelier. Impactfull provides sustainable finance consulting services as well as raw ESG data from corporates.
Started on Jan. 6, 2023
As more and more investors and people are willing to engage in environmental and social concerns, corporate extra financial characteristics (also refered to as ESG for Environment, Social and Governance) have become more and more important.
Impactfull is a company proving sustainable finance consulting as well as extra financial data based on verified raw data sourced from the corporates. Their data is composed of 30+ indicators (within environmental, social and governance categories) and is extracted from sustainability reports.
Pladifes is a research project hosted at the Institut Louis Bachelier, a nonprofit association promoting research in economics and finance. It is an EquipEx+ (code: ANR-21-ESRE-0036), financed by the ANR and aiming at facilitating access to extra financial data for research purposes.
This challenge results from a collaboration between the two aformentioned parties, as an opportunity for both to gain visibility and to allow students to work on (hopefully) interesting extra financial data.
The Pladifes team is also planning to include the best submission of the challenge in its database, allowing researchers to use it for academic project purposes.
The goal of the challenge is to predict the missing values for 15 corporate extra financial indicators (up to 96% missing values). These indicators are available over three years (2018, 2019, 2020) and come from sustainability disclosures.
On both input sets (X_train and X_test), some of missing values are artificially added compared to the output ones (y_train, y_test). These additional missing values are used to compute the model performance by comparing imputed values with true hidden ones. Otherwise, input and output files have exactly the same number of rows and columns.
The objective is thus to train a missing value imputer model on the train data and to use it on the test data to fill the holes. Most accurate model will win the challenge !
The input data contains 15 extra financial indicators on 10 000 companies over up to three years (2018, 2019, 2020).
Each line is defined by a unique βIDβ and corresponds to a certain firm (defined by βcompany_idβ) and a given year (defined by βyearβ). The data has been selected so that there are not more than 96% missing value for a given indicator. Companies are anonymized and are split in train set and test set such that a given company can only be found in train or test.
The first line of the input file contains the header, and the columns are separated by commas. Overall sizes are under 10Mo.
The output files (y_train, y_test) would be the original data and the input one (X_train, X_test) the same one but with additional missing values. For submission, participants need to take the test input file (X_test) and return a version without any missing value remaining (see random submission example). This will then be compared to the output test file.
The number of masked values per columns is at least 100, at most 15% of the non missing values available (for both X_train and X_test).
Below is a description of the columns of the datasets (common to all files).
Column names | Description | Missing value (%) |
---|---|---|
anonimized_id | The index (unique) | 0% |
company_id | The company id (unique) | 0% |
year | The year (2018, 2019, 2020) | 0% |
region | The region (5 options) | 0% |
headquarters_country | The headquarter country (99 options) | 0% |
industry | The primary industry sector (153 options) | 0% |
market cap | The "size of the company" (ie stock value x number of stocks, in $) | 0% |
employees | The number of employees of the company | 16% |
revenue | The annual revenue of the company (in $) | 02% |
scope_1 | The scope 1 (direct) GHG emission of the company (in T/CO2e) | 66% |
scope_2 | The scope 2 (indirect, owned) GHG emission of the company (in T/CO2e) | 66% |
scope_3 | The scope 3 (indirect, not owned) GHG emission of the company (in T/CO2e) | 69% |
waste_production | The annual amount of waste produced (in T) | 84% |
waste_recycling | The annual amount of waste recycled (in T) | 89% |
water_consumption | The annual amount of water consumed (in T) | 72% |
water_withdrawal | The annual amount of water withdrawn (in T) | 74% |
energy_consumption | The annual energy consumption (in KWH) | 82% |
hours_of_training | The number of hours of training for all employees for one year (in H) | 87% |
gender_pay_gap | Mean men's annual earnings above women's annual earnings (in %) | 96% |
independant_board members_percentage | The percentage of independent members in the board of the company (in %) | 75% |
legal_costs_paid for_controversies | The annual amount of the legal costs paid for controversies (in $) | 59% |
ceo_compensation | The annual CEO compensation (in $) | 92% |
All features being continuous, it was decided to use the Normalized Root Mean Square Error (NRMSE) as the metric for this challenge. Here, we choose the difference between maximum and minimum of the actual values as the denominator to normalize the Root Mean Square Error (RMSE):
.
NRMSEs are computed for each feature with missing values and averaged to get the final score. Hence, it represented the average error across all features, respectively normalized by the range of each one. The goal is thus to minimize this quantity.
The error between imputed and actual values on the test data is measured by the output NRMSE computed througth the following steps, iterated on all 15 columns with missing values:
We filter the dataframe to keep only rows with initialy masked values on input data.
If relevant, we log scale the data.
We compute the NRMSE (Eq (2)) for this column.
Once all NRMSE are computed, we average them to get the final result (EQ (3)). If the parameter verbose is set to True, NRMSE per column is also printed. This allows tracking best/worse predicted indicators.
MICE (Multiple Imputation by Chained Equation) is the most used algorithm in the industry to address missing value imputation problems. It is accessible using the commonly used scikit-learn python machine learning package, via the IterativeImputer class.
MICE is an iterative imputation method. It is a process where each feature is modeled as a function of the other features, e.g. a regression problem where missing values are predicted. Each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features.
For the benchmark, we take sklearn IterativeImputer class, fit it on the train data (on y_train, to get the maximum amount of values) and use it to transfrom X_test such that there is no missing values remaining.
Here are NRMSE scores on test for most "classical" missing values imputation methods:
Method | NRMSE on train | NRMSE on the full test |
---|---|---|
Mean imputation | 0.531747 | 0.521362 |
Median imputation | 0.532504 | 0.521616 |
IterativeImputer (no tuning) | 0.410721 | 0.501585 |
IterativeImputer (no tuning) + median imputation for outliers | 0.227358 | 0.236061 |
The final benchmark is realised using sklearn default IterativeImputer, followed by a median imputation for all values such that or
Files are accessible when logged in and registered to the challenge
Pladifes is a reserch project hosted at the Institut Louis Bachelier. Impactfull provides sustainable finance consulting services as well as raw ESG data from corporates.