Challenge Data

Prediction of daily stock movements on the US market
by CFM

Login to your account


Competitive challenge
Economic sciences
Time series
10MB to 1GB
Intermediary level


Started on Jan. 1, 2019

Challenge context

Predicting stock returns is key to doing successful investments. They are very close to random, and this task is thus very hard. However, stock returns do exhibit some predictability. Modern machine learning techniques, coupled with data augmented with some clever features based on it, might be able to push the limits of more classic approaches for predicting stock returns.

A video presentation (in French) of the challenge is also available (first presentation of the series).

Challenge goals

The goal of this challenge is to predict the sign of the returns (= price change over some time interval) at the end of about 700 days for about 700 stocks. The returns are residual returns because part of the overall market movement was removed from raw stock returns. They are also relative in that they were produced by first calculating price changes in percent. The returns are thus such that a value of 0 means that the price behaved according to what could be expected from the overall market movement.

The input data contains lists of returns over periods of 5 minutes from the start of the trading day at 9:30 AM to 3:25 PM. The target data is the sign of the return over the last period of the trading day (3:30 PM-4:00 PM), but the training set contains the return itself.

The metrics is the accuracy of the sign of returns. The predicted values can be any value, but they are considered as a kind of probability for a return to be positive: a value larger than 0.5 means that a positive return is predicted (otherwise the prediction is a negative return).

For additional information and questions (to CFM or to the community of participants), please head to the online forum dedicated to CFM Challenges.

Data description

The columns in the input data contain the following quantities:

  • ID: unique row identifier, that matches between inputs and targets. Each ID corresponds to a unique pair (stock, date).
  • eqt_code: "equity code", i.e. unique stock identifier (the stocks are anonymized).
  • date: unique date identifier (the dates are anonymized).
  • 9:30:00-15:20:00, in steps of 5 minutes, in the form HH:MM:00, where HH is the hour and MM are the minutes of the start of a bin. These columns contain returns between the indicated time and 5 minutes later.

The columns of the target are as follows:

  • ID: corresponds to the ID of the (stock, date) pair for which the target is given.
  • end_of_day_return: this represents the return of the stock from 3:30 PM to 4:00 PM. The goal is to predict the sign of this return.

Benchmark description

The benchmark is a LightGBM boosted trees model, with the following parameters:

  • subsample_for_bin: 200000
  • boosting_type: gbdt
  • silent: True
  • min_child_samples: 20
  • min_child_weight: 0.001
  • min_split_gain: 0.0
  • random_state: None
  • objective: None
  • class_weight: None
  • max_depth: -1
  • num_leaves: 31
  • reg_alpha: 0.0
  • subsample_freq: 1
  • learning_rate: 0.05
  • n_estimators: 500
  • colsample_bytree: 0.8
  • subsample: 0.9
  • reg_lambda: 0.0

This correctly predicts the sign of the returns in typically 51.8 % of the cases.


Files are accessible when logged in and registered to the challenge

The challenge provider


Founded in 1991, Capital Fund Management (CFM) is a successful alternative asset manager and a pioneer in the field of quantitative trading applied to capital markets across the globe. Our methodology relies on statistically robust analysis of terabytes of data to inform and enable our asset allocation, trading decisions and automated order execution. Our people’s diversity and dedication contribute to CFM’s unique culture of research, innovation and achievement. We are a Great Place to Work company and we offer a collaborative and informal work environment, attractive offices and facilities.