Welcome to the Challenge Data website!

Each year, we organize machine learning challenges from data provided by public services, companies and laboratories: general documentation and FAQ. Seasons begin in January; the challenges are introduced in the context of Stéphane Mallat's lesson at the Collège de France.
A prize ceremony for the best participants of the preceding season will be held in February at the College de France (03/02/2022).

For participants

Guide to create an account, choose your challenges and submit solutions.

For professors

Guide to create a course project from selected challenges and to follow student progresses.

For Challenge providers

If you are interested in organizing a challenge with us, do not hesitate to contact us!
Each year, we organize a call for projects during the summer. Projects are selected and beta-tested between september and december, and launched in january. Relevant information can be found in the providers guide to submit a challenge for the next season.

About us

Challenge Data is managed by  the Data team (ENS Paris), in partnership with the Collège de France, and the DataLab at Institut Louis Bachelier.
It is supported by the CFM chair and the PRAIRIE Institute.



🕹 Challenges of the 2022 season 🕹


Can you predict the tide ?

Participants will have to forecast the sea surges in two western European coastal cities.

We place ourselves in a forecast setup: knowing the surge values and the sea-level pressure field in the last 5 days, we want to predict the surge values in the next five days. It is hence a time series prediction problem. The signals we consider are:

  • the surge, which is a function of the time.
  • the sea-level pressure, which is a function of the time, the latitude and the longitude.

The score (y^,y)\ell(\hat{y}, y) we use to measure the quality of the prediction y^\hat{y} compared to the true values yy is a weighted version of the mean square error (MSE). The weights depend linearly on the forecast time, with a bigger weight for the first forecast time and a lower weight for the last forecast time. The prediction for the two cities are computed independently, and the final loss is their sum:

def surge_prediction_metric(y_true, y_pred):
    w = np.linspace(1, 0.1, 10)[np.newaxis]
    surge1_cols = [
        'surge1_t0', 'surge1_t1', 'surge1_t2', 'surge1_t3', 'surge1_t4',
        'surge1_t5', 'surge1_t6', 'surge1_t7', 'surge1_t8', 'surge1_t9' ]
    surge2_cols = [
        'surge2_t0', 'surge2_t1', 'surge2_t2', 'surge2_t3', 'surge2_t4',
        'surge2_t5', 'surge2_t6', 'surge2_t7', 'surge2_t8', 'surge2_t9' ]
    surge1_score = (w * (y_true[surge1_cols].values - y_pred[surge1_cols].values)**2).mean()
    surge2_score = (w * (y_true[surge2_cols].values - y_pred[surge2_cols].values)**2).mean()

    return surge1_score + surge2_score

Since the surge values are normalised (zero mean and standard deviation 1), 11 - \ell can be seen as a percentage of explained variance. With a trivial zero prediction of all values, the score is 1\ell \approx 1 , meaning that we explain 0 % of the variance. A score bigger than one is hence worse that the zero prediction and can be considered as "bad".

Prediction of missing Bid-Ask spread values

The goal of this challenge is to recover missing values in financial time series covering 300 futures contracts.

The financial time series in question contains “daily bid-ask spreads”, which are simply some daily average of the “bid-ask spreads” observed throughout the day. This bid-ask spread is the difference between the highest price a buyer is ready to pay for the futures contract (highest bid price) and the lowest price a seller will accept (lowest ask price) for a given asset.

A large price difference (large spread) between bid and ask reflects the fact that only a few participants are ready to sell or to buy (as they are not making efforts to give a price closer to what the other side would like). Conversely, a small bid-ask spread reflects a liquid market where participants are more willing to trade. Anticipating the average bid-ask spread of the next trading day is thus important for knowing how much one can expect to trade during the day.

In this challenge, we removed the daily spread for some days of a 10 years history, for each of about 300 futures contracts.

The goal of the challenge consists in predicting these missing values using other features of the time series.

Participants can ask questions, find answers and share their findings on the official CFM Data Challenge forum.

Data Centric Movie Reviews Sentiment Classification

The goal of the challenge is to predict the sentiment (positive or negative) of movie reviews. The interest of the challenge lies in the training pipeline being kept fixed. You won't be able to choose the model to use, or have to create complex ensembles of models that add no real value. Instead, you will have to select, augment, clean, generate, source, etc… a training dataset starting from a given dataset. Actually you will be allowed to give anything to the model. To allow you to iterate fast on your experiments, we provide you with the training script, which uses a rather simple model, fastText. Your submissions (ie, the training dataset) will be used to train the model on our servers, which will then be tested on a hidden test set of movie reviews. We reveal a small fraction of this test set (10 texts) to give a sense of the test data distribution.

Regional Climate Forecast 2022

How accurately can we predict regional temperature anomalies based on past and neighbouring climate observations ?

Semantic segmentation of industrial facility point cloud

The goal of this challenge is to perform a semantic segmentation of a 3D point cloud.

The point cloud of one EDF industrial facility digital mock-ups is composed of 45 billions of 3D points. The reconstruction work consisting of the fitting of 90 000 geometric primitives on the point cloud. To perform this task, the operators have to manually segment part of the point cloud corresponding to an equipment to then fit the suitable geometric primitive. This manual segmentation is the more tedious of the global production scheme. Therefore, EDF R&D studies solutions to perform it automatically.

as-built digital mock-up creation with CAD reconstruction from point cloud

Because EDF industrial facilities are sensitive and hardly accessible or available for experiments, our team works with the EDF Lab Saclay boiling room. The digital mock-up of this test environment has been produced with the same methodology than the other industrial facilities.

For the ENS challenge, EDF provides a dataset with a cloud of 2.1 billion points acquired in an industrial environment, the boiling room of EDF Lab Saclay whose design is sufficiently close to an industrial building for this segmentation task. Each point of the cloud has been manually given a ground truth label.

The project purpose is a semantic segmentation task of a 3D point cloud. It consists in training a machine learning model ff to automatically segment the point cloud x=(xi)1iNx=(x_i)_{1 \leq i \leq N} in different classes y=(yi)1iNy=(y_i)_{1 \leq i \leq N} where NN is the point cloud size. The model infers a label class yi=f(xi)y_i = f(x_i) for each point xix_i .

To assess the results, we compute the weighted F1-score over all CC classes (sklearn.metrics.f1_score). It is defined by:

F1:=i=0C1wiPi×RiPi+Ri, F_1 := \sum_{i=0}^{C-1} w_i \frac{P_i \times R_i}{P_i + R_i}, ```

where PiP_i , RiR_i are respectively the point-wise precision and recall of the class ii , and wiw_i is the inverse of the number of true instances for class ii .

Solar forecasting using Copernicus radiation images

The aim of this challenge is to propose machine learning and deep learning approaches on sequences of images to provide better short-term forecast of future image of SSI on horizontal plan, noted GHI (Global Horizontal Irradiance), for time horizon ranging from 15 minutes to 1 hour, with a time resolution of 15 min and a spatial resolution of 3 km.

More precisely, we are interested in a square region of interest (RoI) of size 51 pixels x 51 pixels (approx. 150 km). With an assumption of max cloud speed of 10+m/s10+ m/s , and considering solar forecasting up to 1 hour ahead, the observation region (OR) encompassing the RoI have a size of 81 pixels x 81 pixels (approx. 240 km).

At a given time tt , one hour after the sunrise and one hour before the sunset, considering the sequence of the 4 previous GHI\text{GHI} images on the OR every 15 min, the solar forecasting aims at predicting the GHI\text{GHI} images on the RoI for the next times ahead, ranging from the next 15 min up to the next hour with a time step of 15 min. This forecast of GHI\text{GHI} for the location pp , done at the time tt for the future time t+Δtt + \Delta t is noted: GHI^(p, t+Δtt)\widehat{\text{GHI}}(p,\ t + \Delta t|t)

The learning phase is done on one year of data and the test phase is done on a separate year.

In this challenge, we will only consider the cloud effects on GHI\text{GHI} , assuming that the concomitant and collocated GHI\text{GHI} under clear-sky condition (with no cloud) is perfectly known and noted GHIcls\text{GHI}_{cls} .

Contextual information of interest are the corresponding solar zenith angles (SZA) θS\theta_{S} , solar azimuth angle (SAA) αS\alpha_{S} .

Do not hesitate to refer to the full Copernicus documentation in the supplementary files, as several technical aspects of the challenge are further explained and detailed.

Learning factors for stock market returns prediction

The goal of this challenge is to design/learn factors for stock return prediction using the exotic parameter space introduced in the context section.

Participants will be able to use three-year data history of 5050 stock from the same stock market (training data set) to provide the model parameters (A,β)(A,\beta) as outputs. Then the predictive model associated with these parameters will be tested to predict the returns of 5050 other stocks over the same three-year time period (testing data set).

We allow D=250D=250 days for the time depth and F=10F=10 for the number of factors.

Metric. More precisely, we assess the quality of the predictive model with parameters (A,β)(A,\beta) as follows. Let R~tR50\tilde R_t\in\R^{50} be the returns of the 5050 stocks of the testing data set over the three-year period (t=0753t=0\ldots753 ) and let S~t=S~t(A,β)\tilde S_{t} = \tilde S_{t}(A,\beta) be the participants' predictor for R~t\tilde R_{t} . The metric to maximize is defined by

Metric(A,β):=1504t=250753S~t,R~tS~tR~t \mathrm{Metric}(A,\beta):= \frac 1{504}\sum_{t=250}^{753} \frac{\langle \tilde S_{t}, \tilde R_{t}\rangle}{\|\tilde S_{t}\|\|\tilde R_{t}\|}

if Ai,Ajδij106|\langle A_i,A_j\rangle-\delta_{ij}|\leq 10^{-6} for all i,ji,j and Metric(A,β):=1\mathrm{Metric}(A,\beta):=-1 otherwise.

By construction the metric takes its values in [1,1][-1,1] and equals to 1-1 as soon as there exists a couple (i,j)(i,j) breaking too much the orthonormality condition.

Output structure. The output expected from the participants is a vector where the model parameters A=[A1,,A10]R250×10A=[A_1,\ldots,A_{10}]\in\mathbb R^{250\times 10} and βR10\beta\in\R^{10} are stacked as follows

Output=[A1A10β]R2510 \text{Output} = \left[\begin{matrix} A_1 \\ \vdots \\ A_{10} \\ \beta \end{matrix}\right]\in\mathbb R^{2510}

Return Forecasting of Cryptocurrency Clusters

The goal of the challenge is to predict the returns vs. bitcoin of clusters of cryptoassets.

At Napoleon, we are interested in detecting which assets are likely to move together in the same direction, i.e. assets whose returns (absolute price changes) are statistically positively correlated. Such assets are regrouped into "clusters", which can be seen as the crypto equivalent of equity sectors or industries. The knowledge of such clusters can then be used to optimize portfolios, build advanced trading strategies (long/short absolute, market neutral), evaluate the systematic risk, etc. In order to build new trading strategies, it can be helpful to know whether a given sector/cluster will outperform the market represented by the bitcoin. For this reason, given a cluster C={A1,...,An}\mathcal{C} = \{A_1, ..., A_n\} composed of nn assets AiA_i , this challenge aims at predicting the return relatively to bitcoin of an equally weighted portfolio composed of {A1,...,An}\{ A_1, ..., A_n\} in the next hour, given series of returns for the last 23 hours for assets in the cluster, as well as some metadata.

Bankers and markets

Your goal is to understand how financial markets react when central bankers deliver official speeches.

We do not provide the speeches themselves - otherwise the participants would quickly find out the date and the market moves ! - but we provide a transformed version. They were processed by a predefined BERT-style transformer, and this gives the input of the problem. The output is the mean price evolution of a collection of 39 different time series; these time series correspond to 13 different markets mesured at 3 different time scales.

We have computed the difference between closing prices of these 13 markets at 3 different maturities and the price of these markets at the closing time of the date of the speech. We are not interessed in very short time effects (between the beginning of the speech and the closing of the same day) and leaking effects (trading occuring because of information leakage before the beginning of the speech). A few tests have given us an indication that if a speech has an effect on the markets, it seems to intervene before the end of 2 weeks following the date of the speech: we have chosen 1 day lag, 1 week lag and 2 week lag to measure the possible effects on the markets.

As expected, at first sight it was difficult to distinguish an effect. We have therefore developped a technique to boost the response of the transformer using numerical NLP techniques. We deliver here the result of these boost. It is not miraculous, and the small number of points in the dataset is a real handicap.

The 13 markets are the following :

  1. VIX: Index for the volatility of US stocks.
  2. V2X: Index for the volatility of european stocks.
  3. EURUSD: Change Rate EURO - US Dollars.
  4. EUROUSDV1M: Volatility of at the money 1 month options on the EURUSD.
  5. SPX: Index of the US Stocks.
  6. SX5E: Index of Euro Stocks.
  7. SRVIX: Swap Rate Volatility Index , it is an interest rate volatility index.
  8. CVIX: Crypto Volatility Index, it is a crypto-currency volatility index.
  9. MOVE: developed by Merrill Lynch, measures fear within the bond market.
  10. USGG2YR: US Bonds 2 years.
  11. USGG10YR: US Bonds 10 years.
  12. GDBR2YR: German Bonds 2 years.
  13. GDBR10YR: German Bonds 2 years.
Predicting odor compound concentrations

Can you predict the concentration of Sulfur dioxide (SO2) at one location from a network of sensors?

Using measurement data from ATMO Normandie sensor network, weather data, and land use data from Copernicus Corine Land Cover (CLC), the goal is to do Multivariate Time Series Forecasting and predict the SO2 hourly concentration in μg / m³ corresponding to the next 12 hours at the Le Havre, MAS station from the last 48 hours.

Real estate price prediction

The project is a regression task that deals with real estate price estimation. Estimating housing real estate price is quite a common topic, with an important litterature on estimating prices based on usual data such as: location, surface, land, number of bedrooms, age of the building... The approaches are usually sufficient to estimate the price range but lack precision. However, few have worked to see if adding photos of the asset would bring complementary information, enabling a more precise price estimation.

The objective is thus to work on modelling French housing real estate prices based on usual hierarchical tabular data and, a few photos (between 1 and 6) for each asset and see if it allows better performance than a model trained without the photos.

We will value results interpretability to get a better understanding about the valuable features.

What do you see in the stock market data?

The goal of the challenge is to train machine learning models to look for anomalies within stock market data.

It is relatively easy to design one algorithm seeking one specific kind of event, and then to implement as many algorithms as there are types of atypical events. However, it would be also beneficial to be able to detect any type of atypical events thanks to one model, which could learn to recognize common features between these different events.

From a sample of market data: time series price and volume data, the challenger is invited to predict the existence of an atypical event on an hourly basis by financial instrument. Any kind of approach can be experimented but the AMF is particularly interested in computer vision techniques on reconstructed time series plots if the challenger thinks that it is relevant.

Learning biological properties of molecules from their structure

The goal is to discover the biological properties of new chemical compounds using already existing experimental data.

The current costs of bringing a new medical drug to the market are huge, reaching 2.0 billion US dollars and 10-15 years of continuous research. A desire to eliminate many of these unnecessary costs has accelerated the emergence and acceptance of the science of Cheminformatics. Based on the concept that "similar chemicals have similar properties", one would take existing experimental data Y and build statistical correlative models to create a map between structures of chemical compounds and the observed Y values. Thus, the property Y of novel chemical compounds would not have to be measured. Instead, one would simply draw a structure of a completely new molecule on the computer screen and submit it to the correlative model to predict it.

Computers cannot perceive chemical structures (atoms plus interatomic connectivity) the way human chemists do. A translation of chemical structures into terms understandable by computers is thus necessary. Sophisticated algorithms exist that take molecular connectivity tables and, sometimes, 3D atomic coordinates to generate molecular descriptors – numeric variables that describe molecular structures. Our software is able to calculate the same set of N molecular descriptors per compound, where N is on the order of several hundred. Collecting the properly aligned vectors of descriptors for M chemical compounds, each with known observed Y value, forms an MxN training matrix X. Since raw values of different molecular descriptors are calculated on different scales, normalization to a common scale is required prior to modeling (e.g., the -1 to 1 scale). Not all the descriptors provide meaningful input into a successful Y = f(X) model. Therefore, choosing the “right” descriptors for modeling (a.k.a. feature selection) is the first critical step in model building. Once the suitable subset of N columns is chosen, the corresponding reduced training matrix along with M-dimensional vector Y is submitted to a model training algorithm (e.g., machine learning). Model performance is evaluated on an independent, external test set of T compounds encoded with the same set of N descriptors.