# Challenge Data

### ESG indicators missing value estimation by Pladifes

#### Description

##### Dates

Started on Jan. 6, 2023

##### Challenge context

As more and more investors and people are willing to engage in environmental and social concerns, corporate extra financial characteristics (also refered to as ESG for Environment, Social and Governance) have become more and more important.

Impactfull is a company proving sustainable finance consulting as well as extra financial data based on verified raw data sourced from the corporates. Their data is composed of 30+ indicators (within environmental, social and governance categories) and is extracted from sustainability reports.

Pladifes is a research project hosted at the Institut Louis Bachelier, a nonprofit association promoting research in economics and finance. It is an EquipEx+ (code: ANR-21-ESRE-0036), financed by the ANR and aiming at facilitating access to extra financial data for research purposes.

This challenge results from a collaboration between the two aformentioned parties, as an opportunity for both to gain visibility and to allow students to work on (hopefully) interesting extra financial data.

The Pladifes team is also planning to include the best submission of the challenge in its database, allowing researchers to use it for academic project purposes.

##### Challenge goals

The goal of the challenge is to predict the missing values for 15 corporate extra financial indicators (up to 96% missing values). These indicators are available over three years (2018, 2019, 2020) and come from sustainability disclosures.

On both input sets (X_train and X_test), some of missing values are artificially added compared to the output ones (y_train, y_test). These additional missing values are used to compute the model performance by comparing imputed values with true hidden ones. Otherwise, input and output files have exactly the same number of rows and columns.

The objective is thus to train a missing value imputer model on the train data and to use it on the test data to fill the holes. Most accurate model will win the challenge !

##### Data description

The input data contains 15 extra financial indicators on 10 000 companies over up to three years (2018, 2019, 2020).

Each line is defined by a unique “ID” and corresponds to a certain firm (defined by “company_id”) and a given year (defined by “year”). The data has been selected so that there are not more than 96% missing value for a given indicator. Companies are anonymized and are split in train set and test set such that a given company can only be found in train or test.

The first line of the input file contains the header, and the columns are separated by commas. Overall sizes are under 10Mo.

The output files (y_train, y_test) would be the original data and the input one (X_train, X_test) the same one but with additional missing values. For submission, participants need to take the test input file (X_test) and return a version without any missing value remaining (see random submission example). This will then be compared to the output test file.

The number of masked values per columns is at least 100, at most 15% of the non missing values available (for both X_train and X_test).

## Column descriptions

Below is a description of the columns of the datasets (common to all files).

Column names Description Missing value (%)
anonimized_id The index (unique) 0%
company_id The company id (unique) 0%
year The year (2018, 2019, 2020) 0%
region The region (5 options) 0%
headquarters_country The headquarter country (99 options) 0%
industry The primary industry sector (153 options) 0%
market cap The "size of the company" (ie stock value x number of stocks, in $) 0% employees The number of employees of the company 16% revenue The annual revenue of the company (in$) 02%
scope_1 The scope 1 (direct) GHG emission of the company (in T/CO2e) 66%
scope_2 The scope 2 (indirect, owned) GHG emission of the company (in T/CO2e) 66%
scope_3 The scope 3 (indirect, not owned) GHG emission of the company (in T/CO2e) 69%
waste_production The annual amount of waste produced (in T) 84%
waste_recycling The annual amount of waste recycled (in T) 89%
water_consumption The annual amount of water consumed (in T) 72%
water_withdrawal The annual amount of water withdrawn (in T) 74%
energy_consumption The annual energy consumption (in KWH) 82%
hours_of_training The number of hours of training for all employees for one year (in H) 87%
gender_pay_gap Mean men's annual earnings above women's annual earnings (in %) 96%
independant_board members_percentage The percentage of independent members in the board of the company (in %) 75%
legal_costs_paid for_controversies The annual amount of the legal costs paid for controversies (in $) 59% ceo_compensation The annual CEO compensation (in$) 92%

## Metric description

### a) NRMSE

All features being continuous, it was decided to use the Normalized Root Mean Square Error (NRMSE) as the metric for this challenge. Here, we choose the difference between maximum and minimum of the actual values as the denominator to normalize the Root Mean Square Error (RMSE):

$(1) \, RMSE = \sqrt{{\sum (X_i - \hat X_i)} \over N}$

$(2) \, NRMSE = {RMSE \over (X_{max} - X_{min})}$ .

NRMSEs are computed for each feature with missing values and averaged to get the final score. Hence, it represented the average error across all features, respectively normalized by the range of each one. The goal is thus to minimize this quantity.

### b) Whole metric calculation pipeline

The error between imputed and actual values on the test data is measured by the output NRMSE computed througth the following steps, iterated on all 15 columns with missing values:

1. We filter the dataframe to keep only rows with initialy masked values on input data.

2. If relevant, we log scale the data.

3. We compute the NRMSE (Eq (2)) for this column.

Once all NRMSE are computed, we average them to get the final result (EQ (3)). If the parameter verbose is set to True, NRMSE per column is also printed. This allows tracking best/worse predicted indicators.

$(3) \, score\_final = moyenne_{colonnes}(NRMSE)$

## Benchmark

### a) Benchmark algorithm: MICE

MICE (Multiple Imputation by Chained Equation) is the most used algorithm in the industry to address missing value imputation problems. It is accessible using the commonly used scikit-learn python machine learning package, via the IterativeImputer class.

MICE is an iterative imputation method. It is a process where each feature is modeled as a function of the other features, e.g. a regression problem where missing values are predicted. Each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features.

For the benchmark, we take sklearn IterativeImputer class, fit it on the train data (on y_train, to get the maximum amount of values) and use it to transfrom X_test such that there is no missing values remaining.

### b) Benchmark metrics

Here are NRMSE scores on test for most "classical" missing values imputation methods:

Method NRMSE on train NRMSE on the full test
Mean imputation 0.531747 0.521362
Median imputation 0.532504 0.521616
IterativeImputer (no tuning) 0.410721 0.501585
IterativeImputer (no tuning) + median imputation for outliers 0.227358 0.236061

The final benchmark is realised using sklearn default IterativeImputer, followed by a median imputation for all values such that $value < X_{min\_on\_train}$ or $value > X_{max\_on\_train}.$

#### Files

Files are accessible when logged in and registered to the challenge

#### The challenge provider

Pladifes is a reserch project hosted at the Institut Louis Bachelier. Impactfull provides sustainable finance consulting services as well as raw ESG data from corporates.