Welcome to the Challenge Data website of ENS and Collège de France
We organize challenges of data sciences from data provided by public services, companies and laboratories: general documentation and FAQ. The prize ceremony is in February at the College de France.
For participants
Guide to create an account, choose your challenges and submit solutions.
For professors
Guide to create a course project from selected challenges and to follow student progresses.
The goal of this challenge is to build a model to automatically detect sleep apnea events from PSG data.
You will have access to samples from 44 nights recorded with a polysomnography and scored for apnea events by a consensus of human experts.
For each of the 44 nights, 200 windows (without intersection) are sampled with the associated labels (which are binary segmentation masks).
Each of these windows contains 90 seconds of signal from 8 physiological signals sampled at 100Hz:
Abdominal belt: Abdominal contraction
Airflow: Respiratory Airflow from the subject
PPG (Photoplethysmogram): Cardiac activity
Thoracic belt: Record Thoracic contraction
Snoring indicator
SPO2: O2 saturation of the blood
C4-A1: EEG derivation
O2-A1: EEG derivation
The segmentation mask is sampled at 1Hz and contains 90 labels (0 = No event, 1 = Apnea event). Both examples can be reproduced using visualization.py provided in the supplementary files.
The 8 PSG signals with the associated segment mask. The apnea events is visible with Abdominal belt, thoracic belt and airflow amplitude dropping sensibly below baseline.
The SPO2 drops after the event.
The 8 PSG signals with the associated segment mask. Two short apnea events are visible with the associated breathing disruption. The SPO2 drops during the second event is likely to be a consequence of the first event.
We want to assess if the events detected by the algorithm are in agreement with the one detected by the sleep experts.
Metric
As we seek to evaluate events-wise agreement between the model and the scorers, the metric cannot be computed directly on the segmentation mask.
First, events are extracted from the binary mask with the following rule:
An apnea event is a sequence of consecutive 1 in the binary mask.
For each apnea events from a window, we extract the start and end index to produce a list of events. This list can be empty if not events are found.
The same processing is applied to the ground-truth masks to extract the ground-truth events.
In order to assess the agreements between the ground-truth and estimated events, the F1-score is computed.
Two events match if their IoU (intersection over union or Jaccard Index) is above 0.3.
Hence a detected event is a True Positive if it matches with a ground-truth event, it's a False Positive otherwise.
On the other hand, a ground-truth event without a matching detected event is a False Negative.
TP, FP, FN are summed over all the windows to compute the F1-score.
The detailed implementation can be found in the metrics file.
The goal of this data challenge is to predict the "colour" of a product, given its image, title, and description. A product can be of multiple colours, making it a multi-label classification problem.
For example, in Rakuten Ichiba catalog, a product with a Japanese title タイトリスト プレーヤーズ ローラートラベルカバー (Titleist Players Roller Travel Cover) associated with an image and sometimes with an additional description. The colour of this product is annotated as Red and Black. There are other products with different titles, images, with possible descriptions, and associated colour attribute tags. Given these information on the products, like the example above, this challenge proposes to model a multi-label classifier to classify the products into its corresponding colour attributes.
Metric
The metric used in this challenge to rank the participants is the weighted-F1 score.
Scikit-Learn package has an F1 score implementation (link) and can be used for this challenge with its average parameter set to "weighted".
This data challenge aims at introducing a new statistical model to predict and analyze air quality in big buildings using observations stored in the Oze-Energies database. Physics based approaches to build air quality simulation tool in order to simulate complex building behaviors are widespread in the most complex situations. The main drawbacks of such softwares to simulate the behavior of transient systems are:
the significant computational time required to run such models as they integrate many noisy sources and a huge number of parameters and require essentially massive thermodynamics computations;
the fact that they often solely output a single-point estimate at each time, without providing any uncertainty measures to assess their confidence about their predictions.
In order to analyze and predict future air quality to alert and correct building management systems to ensure comfort and satisfactory sanitary conditions, this challenge aims at solving issue ii), i.e. at designing models which takes into account the uncertainty in the exogenous data describing external weather conditions and the occupation of the building. This will allow to provide confidence intervals on the air quality predictions, here on the humidity of the air inside the building.
The goal of the challenge is to classify traders within three categories, HFT, non HFT and MIX.
According to the AMF in-house expert-based classification, based on the knowledge that AMF has on the market players, market players are divided into three categories, HFT, MIX and non-HFT.
From a set of behavioural variables based on order and transaction data, the challenger is invited to predict the category to which a given participant belongs.
The proposed classification algorithm will then be applied to other data sources for which market players are currently not well known by the AMF.
The objective of this challenge is to design a model capable of predicting the usage of some EV charging stations in Paris, more specifically the times when they are available, actively charging a car, plugged, offline or down.
To estimate the validity of the predictions we propose to use two different measures: the coefficient of determination (R2), which shows the skill of the mean prediction; and the reliability, which measures the accuracy of the spread in the prediction.
The mathematical details are available in the associated file.
This challenge tackles the problem of land cover modeling. The goal of this challenge is to predict the proportion of classes of land cover for an input satellite image, in which every pixel is assigned to a land cover label.
In the input 2D wellbore data, the formation boundaries are represented by sinusoids capturing the azimuth* and amplitude* of the dip. Detecting and segmenting the sinusoid manually is a tedious, time-consuming task that can take experts up to several hours for one well. Therefore, we are aiming at leveraging the power of machine learning and deep learning to automatically detect those dips (sinusoids) and segment them not only to save time but also to increase the segmentation performances.
In this project, we are suggesting the development of a deep learning approach to segment the dips given an input electromagnetic image map. The size of the model is important, so it will be an aspect to consider in this data challenge.
* amplitude: is the magnitude of the sinusoid
* azimuth: is the horizontal position of a given point counted from the left side of the image block. It is generally expressed as an angle (knowing that the entire width of the image represents 360° over the wellbore)
More precisely, our thesaurus comprises few hundreds tags (e.g. blues-rock, electric-guitar, happy), regrouped in classes (Genres, Instruments or Moods), partitioned into categories (genres-orchestral, instruments-brass, mood-dramatic, etc.).
Each audio track of our database may be tagged with one or more labels of each class so the auto-tagging process is a multi-label classification problem;
we can train neural networks to learn from audio features and generate numerical predictions to minimise the binary cross entropy with respect to the one-hot encoded labelling of the dataset.
On the other hand, to display the tagging on our front-end, we require a discrete, tag-wise, labelling, so a further interpretation is nedded, to convert the predictions into decisions, and we can use more suitable metrics to evaluate the quality of the tagging.
We want the participants of the challenge to optimise this decision problem, leveraging all the possible information available from the groundtruth and the global predictions to design a selection algorithm producing the most consistent labelling.
In other words, build a multi-label classifier, receiving, as input, the predictions generated by our neural networks for all tags and their categories.
Our suggested benchmark is a column-wise thresholding (see details below) so this strategy uses neither the categorical predictions, nor the possible correlations between tags.
For example, a more row-oriented approach (for each track, select a tag for its prediction value with respect to the predictions for the other tags) or a hierarchical strategy (decide on categories first, then chose tags among the selected categories) may improve the final classifications.
If we find an illiquid asset to be untradeable, then the signal of this asset should not result in a trading position. To counteract this difficulty, an alternative would be to project the signals from illiquid assets to liquid ones.
To do so, the proposed challenge aims at determining the link at a given time t
between the returns of illiquid and liquid assets. The one-day return of a stock j
at a time t
with price Pjt
(adjusted for dividends and stock splits) is defined as:
Rjt=Pjt−1Pjt−1
Let Yt=(R1t,…,RLt)
be the returns of L
liquid assets and Xt=(R1t,…,RNt)
be the returns of N
illiquid assets at a given time t
. The objective of this challenge is to determine a mapping function η:RN→RL
, that would link the returns of N
illiquid assets to the returns of L
liquid assets such that Yt=η(Xt)
.
Since predictive signals can be seen as estimated returns, the signals generated by QRT on N
illiquid assets, defined by X^t
, can be mapped to projected signals Y^t
on L
liquid instruments such that Y^t=η(X^t)
. However, since η
is purely theoretical, the mapping must rely on approximations. Therefore, the idea would be to estimate a model η^
that would predict the returns of L=100
liquid instruments, using the returns of N=100
illiquid instruments, given historical data.
The model η^
can then be seen as a multi-output prediction of L
returns, or as the combination of L
models η^j
, for j=1,…,N
that would individually predict the return of each liquid instrument j
.
For simplicity and practical reasons, we chose to transform this challenge into a classification problem. In practice, we are more interested into being right on the trend instead of the value. Thus, instead of predicting the returns of the liquid assets, the estimated model η^
shall be predicting the signs of the liquid assets.
The metric used to judge the quality of the predictions is a custom weighted accuracy defined by:
f(y,y^)=∥y∥11i=1∑n∣yi∣×1y^i=sign(yi)
where 1y^i=sign(yi)
is equal to 1 if the i
-th prediction y^i∈{−1,1}
is of the same sign as the i
-th true value yi
. This metric gives more importance to the good classification of high value returns. Indeed, it can be more important to be right on a 7% move than on a 0.5% move.