Challenges of the 2023 season

Organized by the DataTeam @ENS, the Data Lab @ILB and IDRIS team @CNRS

Colloids: How small can we go?

Can you detect small particles from a noisy 3D scan?

Problem definition

Many applications are used to track spheres in three dimensions. However, the most difficult part of detecting colloids is how densely packed they are. Most methods struggle to detect a majority of the particles, and there is a higher incidence of false detections due to spheres that are touching.

This is defined as the volume fraction (ϕ\phi or volfrac for short) and represents the volume of the particles divided by the total volume of the region of interest. Below you can see an image of various real images of colloids at different volume fractions.


Moreover, it is impossible to create manually labelled data for this project since there are thousands of colloids per 3d volume, and more importantly manual labelling would be too subjective. Due to this we have created a simulated training dataset. To reduce photobleaching of the tiny particles during imaging we need to reduce the confocal laser power. This results in a low contrast to noise ratio (CNR). CNR is defined as (see the wikipedia article for more detail):

CNR=bμfμσ CNR = \dfrac{b_{\mu} - f_{\mu}}{\sigma}

where bμb_{\mu} and fμf_{\mu} are the background and foreground mean brightnesses, and σ\sigma is the standard deviation of gaussian noise.

It also contributes to a high signal to noise ratio (SNR) which is simply:

SNR=fμσ SNR = \dfrac{f_{\mu}}{\sigma}

In addition to the high SNR and low CNR, this problem has further unique challenges:

  • Relatively small training sample size (n=1400n=1400 )
  • Relatively large 3D input size (64×64×6464 \times 64 \times 64 )
  • The final prediction (x_test) will be for larger sizes than the training samples (128×128×128128 \times 128 \times 128 )

The figure below shows some of the steps of the simulation:

Sim The simulation steps are very simple:

  1. Background is drawn
  2. Foreground (spheres) is drawn
  3. The image is blurred
  4. The point spread function is convolved
  5. Zoom of the previous image
  6. Noise is added


Our metric is average precision (AP). This is similar to ROC and contains information on precision (fraction of correct detections), recall (how many of all particles are detected), as well as distance of the detections from the truth given by the threshold. Almost all available AP implementations are in 2D, but we provide in the supplementary files a script that will measure this for you in 3D for spheres. Precision is key and is the most crucial factor for good detections, however, the higher the recall the bigger the sample that can be used for downstream analysis.

This example shows how to read x, y, and metadata and analyses precision, recall, and average precision.

from custom_metric import average_precision, get_precision_recall, plot_pr
from read import read_x, read_y
import pandas as pd
from matplotlib import pyplot as plt

index = 0
x = read_x(x_path, index)
y = read_y(y_path, index)
metadata = pd.read_csv(m_path, index_col=0)
metadata = metadata.iloc[index].to_dict()

# Find the precision and recall 
prec, rec = get_precision_recall(y, y, diameter=metadata['r']*2, threshold=0.5)
print(prec, rec) # Should be 1,1
# Find precision and recall of half the positions as prediction
prec, rec = get_precision_recall(y, y[:len(y)//2], diameter=metadata['r']*2, threshold=0.5)
print(prec, rec) # Should be 1,~0.5

# Find the average precision
# Note the canvas size has to be provided to remove predicitons around the borders which usually cause errors
# The first value is what matters (ap)
ap, precisions, recalls = average_precision(y, y[:len(y)//2], diameter=metadata['r']*2, canvas_size=x.shape)

# the metric also provides the precisions and recalls to visualise the performance
fig = plot_pr(ap, precisions, recalls, name='prediction', tag='x-', color='red')

Precision tells us the fraction of predictions that are true, and recall tells us how many of all the particles did we detect.

In physics, precision is key, therefore this is the most important aspect of the detection. You'll find that the benchmark TrackPy enjoys very high precision, usually above 99%. However, this comes at the cost of lower recall, it usually detects only 30% of the particles.

The precision and recalls are measured over different error thresholds from 0 to 1 at 0.1 increments. The AP accross different thresholds provides a good overall metric for the performance of the predictor (higher is better). Informally, the goal is to push recall as high as possible, without sacrificing a precision below 95%.

Few-shots to learn anatomic structures in radiology

Here the goal is to segment structures using their shape, but no exhaustive annotations. The training data is composed of two types of images:

1) CT images with anatomical segmentations masks of individual structures

  • They act as the ground truth definition of what are anatomical structures.
  • However, they do not intend to be representative of all of the possible structures and their diversity, but can still be used as training material.
  • This makes this problem a mix of a zero-shot learning problem (some structures in the test set are not found in the train set) and few-shot learning problem (some structures are common between the train and test set, but there are limited examples).

2) Raw CT images, without any segmented structures

  • They can be used as additional training material, in an unsupervised setting.

The test set is made of new images with their corresponding segmented structures, and the metric measures the capacity to correctly segment and separate the different structures on an image.

Note: The segmented structures are not covering the entirety of the image, some pixels being not part of identifiable structures, as we see on the image above. They are thus considered part of the background.

How to detect a niche of fraudsters?

The aim of our challenge is to find the best way to process and analyse basket data from one of our retailer partners in order to detect fraud cases.

Using these basket data, fraudulent customers should be detected, to be then refused in the future.

Detecting PIK3CA mutation in breast cancer

The challenge proposed by Owkin is a weakly-supervised binary classification problem. Weak supervision is crucial in digital pathology due to the extremely large dimensions of whole-slide images (WSIs), which cannot be processed as is. To use standard machine learning algorithms one needs, for each slide, to extract smaller images (called tiles) of size 224x224 pixels (approx 112 µm²). Since a slide is given a single binary annotation (presence or absence of mutation) and is mapped to a bag of tiles, one must learn a function that maps multiple items to a single global label. This framework is known as multiple-instance learning (MIL). More precisely, if one of the pooled tiles exhibits a mutation pattern, presence of mutation is predicted while if none of the tiles exhibit the pattern, absence of mutation is predicted. This approach alleviates the burden of obtaining locally annotated tiles, which can be costly or impractical for pathologists.

In this challenge, we aim to predict whether a patient has a mutation of the gene PIK3CA, directly from a slide. For computational purposes, we kept a total of 1,000 tiles per WSI. Each tile was selected such that there is tissue in it.

Here we display an example of whole slide image with 1,000 tiles highlighted in black.

An example of whole slide image

Figure 1: Example of a whole slide image with the 1,000 tiles selected during preprocessing highlighted in black

Some of those tiles are displayed below. The coordinates are indicated in parenthesis for each tile.

Examples of tiles

Figure 2: Example of 224x224 pixels tiles extracted at a 20x magnification with their (x, y)-coordinates

Can you explain the price of electricity?

The aim is to model the electricity price from weather, energy (commodities) and commercial data for two European countries - France and Germany. Let us stress that the problem here is to explain the electricity price with simultaneous variables and thus this is not a prediction problem.

More precisely, the goal of this challenge is to learn a model that outputs from these explanatory variables a good estimation for the daily price variation of electricity futures) contracts, in France and Germany. These contracts allow you to receive (or to deliver) a given amount of electricity at a specified price by the contract delivered at a specified time in the future (at the contract's maturity). Thus, futures contracts are financial instruments that give you some expected value on the future price of electricity under actual market conditions - here, we focus on short-term maturity contracts (24h). Let us stress that electricity future exchange is a dynamic market in Europe.

Regarding the explanatory variables, the participants are provided with daily data for each country which involve weather quantitative measurements (temperature, rain, wind), energetic production (commodity price changes), and electricity use (consumption, exchanges between the two countries, import-export with the rest of Europe).

The score function (metric) used is the Spearman's correlation between the participant's output and the actual daily price changes over the testing data set sample.

Feel free to visit our dedicated forum and our LinkedIn page for more information about the challenge and QRT.

Robustness to distribution changes and ambiguity

What if misleading correlations are present in the training dataset?

human_age is an image classification benchmark with a distribution change in the unlabeled data: we classify old and young people. Text is also superimposed on the images: either written "old" or "young". In the training dataset, which is labeled, the text always matches the face. But in the unlabeled (test) dataset, the text matches the image in 50% of the cases, which creates an ambiguity.

We thus have 4 types of images:

  1. Age young, text young (AYTY),

  2. Age old, text old (AOTO),

  3. Age young, text old (AYTO),

  4. Age old, text young (AOTY).

Types 1 and 2 appear in both datasets, types 3 and 4 appear only in the unlabeled dataset.

To resolve this ambiguity, participants can submit solutions to the leaderboard multiple times, testing different hypotheses (challengers may consider solutions that require two or more submissions to the leaderboard).

We use the accuracy on the unlabeled set of human_age as our metric.

Challenge rules

At the end of the challenge, the different teams will have to send their code to with the heading: "Challenge Data - Submission". The awarding of the prize will be conditional on this submission and on compliance of the following rules:

  • Participants are not allowed to label images by hand.

  • Participants are not allowed to use other datasets. They are only allowed to use the datasets provided.

  • Participants are not allowed to use arbitrary pre-trained models. Only ImageNet pre-trained models are allowed.

Real-time train crowding forecasting

The aim of this challenge is to give SNCF-Transilien the tools to provide an accurate train occupancy rate forecasting. Thus, deliver precise real time crowding information (RTCI) to its passengers through digital services.

When a traveler is waiting for a k-train at station s, the train may be several stations ahead, but we want to give the travelers the most consistent information possible about how busy the train will be when they board it. To do this, we need to predict for each train the actual occupancy rates at the next stations. Here, we propose to predict, more simply, the occupancy rate at the next station only.

Precipitation Nowcasting

The goal of this challenge is to predict future precipitation rate fields (estimated using radar echoes) given the past precipitation rate fields.
The precipitation rate is the volume of water falling per unit area, the classic unit is the mm (Litre / m²).
The dataset includes a wide variety of precipitation scenarios and will not be representative of natural frequencies of events.

The quality of the predictions will be judged according to the Mean Squared Logarithmic Error metric.

ESG indicators missing value estimation

The goal of the challenge is to predict the missing values for 15 corporate extra financial indicators (up to 96% missing values). These indicators are available over three years (2018, 2019, 2020) and come from sustainability disclosures.

On both input sets (X_train and X_test), some of missing values are artificially added compared to the output ones (y_train, y_test). These additional missing values are used to compute the model performance by comparing imputed values with true hidden ones. Otherwise, input and output files have exactly the same number of rows and columns.

The objective is thus to train a missing value imputer model on the train data and to use it on the test data to fill the holes. Most accurate model will win the challenge !

Can you retrieve the Table of Content?

The goal of the challenge is to be able to rebuild the Table of Content (ToC) of financial annual reports, based on the text blocks of the document and their metadata (position, font, font-size, etc.). This task may be split into two sub-tasks, consisting first in determining whether a block of text is a title (binary classification of block of text) and then in determining the level of the title, in the annotation (level 0 corresponds to the title of document then level 1 is the section, level 2 is the sub-section, etc.). The screenshot below shows an example of the labelling tool developed by the AMF to annotate every text block from a financial annual report.

XHTML annotation

      Figure 1: Annotation of titles in the document

It should be noted that financial annual reports have been supposed to be produced in “machine-readable” format (XHTML), since ESEF (European Single Electronic Format) regulation entered into force in 2022. However in practice, documents are not “machine-readable”, except for IFRS consolidated financial statements for which ESEF designs specific “tags”. This means that it is not possible to identify the different sections of document thanks to the XHTML tags.

Biosonar - Odontoceti Click Detection

The challenge goal is to determine if the audio file contains biosonars (dolphin clicks) or noises transients (reef, shrimp noises...).

Prediction of daily stock end-of-day movements on the US market

The goal is to estimate the main direction that will occur during the last two hours of trading session, given the preceding history of the day.

To avoid to suffer of usual market noise, we only consider 3 states:

  • clear decreasing of price;
  • small evolution in both side;
  • clear increasing of price.
Reinforcement Learning for low carbon buildings


Reducing greenhouse gas emissions of heating and cooling systems of large buildings


Optimizing the control of the thermal systems of these buildings

Significant CO2 savings are already achieved by Accenta:

  • by introducing heat storage systems that allow heat to be produced, stored and consumed at the best times;
  • by optimizing management of heat stocks using predictive algorithms, adaptive to the specific conditions of each building..


Explore new ways to improve these algorithms based on reinforcement learning methods