NEW! A forum is now available to exchage about the challenges. Do not hesitate to use it: https://les-mathematiques.net/vanilla/index.php?p=/categories/challengedata
Each year, we organize machine learning challenges from data provided by public services, companies and laboratories: general documentation and FAQ. Seasons begin in January; the challenges are introduced in the context of Stéphane Mallat's lesson at the Collège de France.
A prize ceremony for the best participants of the preceding season will be held in February at the College de France (03/02/2022).
Guide to create an account, choose your challenges and submit solutions.
Guide to create a course project from selected challenges and to follow student progresses.
If you are interested in organizing a challenge with us, do not hesitate to contact us!
Each year, we organize a call for projects during the summer. Projects are selected and beta-tested between september and december, and launched in january. Relevant information can be found in the providers guide.
Challenge Data is managed by the Data team (ENS Paris), in partnership with the Collège de France and the DataLab at Institut Louis Bachelier.
It is supported by the CFM chair, the PRAIRIE Institute and IDRIS from CNRS.
Can you detect small particles from a noisy 3D scan?
Many applications are used to track spheres in three dimensions. However, the most difficult part of detecting colloids is how densely packed they are. Most methods struggle to detect a majority of the particles, and there is a higher incidence of false detections due to spheres that are touching.
This is defined as the volume fraction (ϕ
volfrac for short) and represents the volume of the particles divided by the total volume of the region of interest. Below you can see an image of various real images of colloids at different volume fractions.
Moreover, it is impossible to create manually labelled data for this project since there are thousands of colloids per 3d volume, and more importantly manual labelling would be too subjective. Due to this we have created a simulated training dataset. To reduce photobleaching of the tiny particles during imaging we need to reduce the confocal laser power. This results in a low contrast to noise ratio (CNR). CNR is defined as (see the wikipedia article for more detail):
where bμ and fμ are the background and foreground mean brightnesses, and σ is the standard deviation of gaussian noise.
It also contributes to a high signal to noise ratio (SNR) which is simply:
In addition to the high SNR and low CNR, this problem has further unique challenges:
x_test) will be for larger sizes than the training samples (128×128×128 )
The figure below shows some of the steps of the simulation:
The simulation steps are very simple:
Our metric is average precision (AP). This is similar to ROC and contains information on precision (fraction of correct detections), recall (how many of all particles are detected), as well as distance of the detections from the truth given by the threshold. Almost all available AP implementations are in 2D, but we provide in the supplementary files a script
custom_metric.py that will measure this for you in 3D for spheres. Precision is key and is the most crucial factor for good detections, however, the higher the recall the bigger the sample that can be used for downstream analysis.
This example shows how to read x, y, and metadata and analyses precision, recall, and average precision.
from custom_metric import average_precision, get_precision_recall, plot_pr from read import read_x, read_y import pandas as pd from matplotlib import pyplot as plt index = 0 x = read_x(x_path, index) y = read_y(y_path, index) metadata = pd.read_csv(m_path, index_col=0) metadata = metadata.iloc[index].to_dict() # Find the precision and recall prec, rec = get_precision_recall(y, y, diameter=metadata['r']*2, threshold=0.5) print(prec, rec) # Should be 1,1 # Find precision and recall of half the positions as prediction prec, rec = get_precision_recall(y, y[:len(y)//2], diameter=metadata['r']*2, threshold=0.5) print(prec, rec) # Should be 1,~0.5 # Find the average precision # Note the canvas size has to be provided to remove predicitons around the borders which usually cause errors # The first value is what matters (ap) ap, precisions, recalls = average_precision(y, y[:len(y)//2], diameter=metadata['r']*2, canvas_size=x.shape) # the metric also provides the precisions and recalls to visualise the performance fig = plot_pr(ap, precisions, recalls, name='prediction', tag='x-', color='red') plt.show()
Precision tells us the fraction of predictions that are true, and recall tells us how many of all the particles did we detect.
In physics, precision is key, therefore this is the most important aspect of the detection. You'll find that the benchmark TrackPy enjoys very high precision, usually above 99%. However, this comes at the cost of lower recall, it usually detects only 30% of the particles.
The precision and recalls are measured over different error thresholds from 0 to 1 at 0.1 increments. The AP accross different thresholds provides a good overall metric for the performance of the predictor (higher is better). Informally, the goal is to push recall as high as possible, without sacrificing a precision below 95%.
Here the goal is to segment structures using their shape, but no exhaustive annotations. The training data is composed of two types of images:
1) CT images with anatomical segmentations masks of individual structures
2) Raw CT images, without any segmented structures
The test set is made of new images with their corresponding segmented structures, and the metric measures the capacity to correctly segment and separate the different structures on an image.
Note: The segmented structures are not covering the entirety of the image, some pixels being not part of identifiable structures, as we see on the image above. They are thus considered part of the background.
The aim of our challenge is to find the best way to process and analyse basket data from one of our retailer partners in order to detect fraud cases.
Using these basket data, fraudulent customers should be detected, to be then refused in the future.
The challenge proposed by Owkin is a weakly-supervised binary classification problem. Weak supervision is crucial in digital pathology due to the extremely large dimensions of whole-slide images (WSIs), which cannot be processed as is. To use standard machine learning algorithms one needs, for each slide, to extract smaller images (called tiles) of size 224x224 pixels (approx 112 µm²). Since a slide is given a single binary annotation (presence or absence of mutation) and is mapped to a bag of tiles, one must learn a function that maps multiple items to a single global label. This framework is known as multiple-instance learning (MIL). More precisely, if one of the pooled tiles exhibits a mutation pattern, presence of mutation is predicted while if none of the tiles exhibit the pattern, absence of mutation is predicted. This approach alleviates the burden of obtaining locally annotated tiles, which can be costly or impractical for pathologists.
In this challenge, we aim to predict whether a patient has a mutation of the gene PIK3CA, directly from a slide. For computational purposes, we kept a total of 1,000 tiles per WSI. Each tile was selected such that there is tissue in it.
Here we display an example of whole slide image with 1,000 tiles highlighted in black.
Figure 1: Example of a whole slide image with the 1,000 tiles selected during preprocessing highlighted in black
Some of those tiles are displayed below. The coordinates are indicated in parenthesis for each tile.
Figure 2: Example of 224x224 pixels tiles extracted at a 20x magnification with their (x, y)-coordinates
The aim is to model the electricity price from weather, energy (commodities) and commercial data for two European countries - France and Germany. Let us stress that the problem here is to explain the electricity price with simultaneous variables and thus this is not a prediction problem.
More precisely, the goal of this challenge is to learn a model that outputs from these explanatory variables a good estimation for the daily price variation of electricity futures) contracts, in France and Germany. These contracts allow you to receive (or to deliver) a given amount of electricity at a specified price by the contract delivered at a specified time in the future (at the contract's maturity). Thus, futures contracts are financial instruments that give you some expected value on the future price of electricity under actual market conditions - here, we focus on short-term maturity contracts (24h). Let us stress that electricity future exchange is a dynamic market in Europe.
Regarding the explanatory variables, the participants are provided with daily data for each country which involve weather quantitative measurements (temperature, rain, wind), energetic production (commodity price changes), and electricity use (consumption, exchanges between the two countries, import-export with the rest of Europe).
The score function (metric) used is the Spearman's correlation between the participant's output and the actual daily price changes over the testing data set sample.
Feel free to visit our dedicated forum and our LinkedIn page for more information about the challenge and QRT.
What if misleading correlations are present in the training dataset?
human_age is an image classification benchmark with a distribution change in the unlabeled data: we classify old and young people. Text is also superimposed on the images: either written "old" or "young". In the training dataset, which is labeled, the text always matches the face. But in the unlabeled (test) dataset, the text matches the image in 50% of the cases, which creates an ambiguity.
We thus have 4 types of images:
Age young, text young (AYTY),
Age old, text old (AOTO),
Age young, text old (AYTO),
Age old, text young (AOTY).
Types 1 and 2 appear in both datasets, types 3 and 4 appear only in the unlabeled dataset.
To resolve this ambiguity, participants can submit solutions to the leaderboard multiple times, testing different hypotheses (challengers may consider solutions that require two or more submissions to the leaderboard).
We use the accuracy on the unlabeled set of human_age as our metric.
At the end of the challenge, the different teams will have to send their code to email@example.com with the heading: "Challenge Data - Submission". The awarding of the prize will be conditional on this submission and on compliance of the following rules:
Participants are not allowed to label images by hand.
Participants are not allowed to use other datasets. They are only allowed to use the datasets provided.
Participants are not allowed to use arbitrary pre-trained models. Only ImageNet pre-trained models are allowed.
The aim of this challenge is to give SNCF-Transilien the tools to provide an accurate train occupancy rate forecasting. Thus, deliver precise real time crowding information (RTCI) to its passengers through digital services.
When a traveler is waiting for a k-train at station s, the train may be several stations ahead, but we want to give the travelers the most consistent information possible about how busy the train will be when they board it. To do this, we need to predict for each train the actual occupancy rates at the next stations. Here, we propose to predict, more simply, the occupancy rate at the next station only.
The goal of this challenge is to predict future precipitation rate fields (estimated using radar echoes) given the
past precipitation rate fields.
The precipitation rate is the volume of water falling per unit area, the classic unit is the mm (Litre / m²).
The dataset includes a wide variety of precipitation scenarios and will not be representative of natural frequencies of events.
The quality of the predictions will be judged according to the Mean Squared Logarithmic Error metric.
The goal of the challenge is to predict the missing values for 15 corporate extra financial indicators (up to 96% missing values). These indicators are available over three years (2018, 2019, 2020) and come from sustainability disclosures.
On both input sets (X_train and X_test), some of missing values are artificially added compared to the output ones (y_train, y_test). These additional missing values are used to compute the model performance by comparing imputed values with true hidden ones. Otherwise, input and output files have exactly the same number of rows and columns.
The objective is thus to train a missing value imputer model on the train data and to use it on the test data to fill the holes. Most accurate model will win the challenge !
The goal of the challenge is to be able to rebuild the Table of Content (ToC) of financial annual reports, based on the text blocks of the document and their metadata (position, font, font-size, etc.). This task may be split into two sub-tasks, consisting first in determining whether a block of text is a title (binary classification of block of text) and then in determining the level of the title, in the annotation (level 0 corresponds to the title of document then level 1 is the section, level 2 is the sub-section, etc.). The screenshot below shows an example of the labelling tool developed by the AMF to annotate every text block from a financial annual report.
Figure 1: Annotation of titles in the document
It should be noted that financial annual reports have been supposed to be produced in “machine-readable” format (XHTML), since ESEF (European Single Electronic Format) regulation entered into force in 2022. However in practice, documents are not “machine-readable”, except for IFRS consolidated financial statements for which ESEF designs specific “tags”. This means that it is not possible to identify the different sections of document thanks to the XHTML tags.
The challenge goal is to determine if the audio file contains biosonars (dolphin clicks) or noises transients (reef, shrimp noises...).
The goal is to estimate the main direction that will occur during the last two hours of trading session, given the preceding history of the day.
To avoid to suffer of usual market noise, we only consider 3 states:
Reducing greenhouse gas emissions of heating and cooling systems of large buildings
Optimizing the control of the thermal systems of these buildings
Significant CO2 savings are already achieved by Accenta:
Explore new ways to improve these algorithms based on reinforcement learning methods