Challenge Data

Stock trading: prediction of auction volumes
by CFM

Register or login to participate !



Started on Jan. 4, 2021

Challenge context

In many stock exchanges, at the end of a trading day, an auction takes place for each stock. Each stock is then exchanged at a single price, based on the interest that market participants show in the auction.

It is advantageous to do some trading during this auction instead of during the preceding continuous trading, as trading costs are usually lower.

In addition, some market participants (day traders, market makers…) prefer to not hold stocks overnight, because events might affect the stock price between the close of the market and the price at the open on the next day (elections, company announcements, etc.), which may result in a loss. For these market participants, the auction is the last chance to limit such a risk, as it is an opportunity to get rid of their remaining stocks and not hold any overnight.

For market participants that instead want to hold a specific number of stocks by the end of the day (asset managers…), the auction is their last chance to reach this target. This is in principle important, because they have optimized this number of stocks to hold. For example, if they predict that the price of a stock should rise, then it is in their interest to buy as much of the stock as possible, within the limits set by how much they can invest and by the financial risk that they are ready to take.

Market participants may thus want to estimate the expected number of stocks available during the auction, as this allows them to gauge how many they can hope to buy or sell during this final, financially advantageous trading opportunity.

Participants can ask questions, find answers and share their findings on the official CFM Data Challenge forum.

Challenge goals

The goal of this year's challenge is to predict the volume (total value of stock exchanged) available for auction, for 900 stocks over about 350 days.

Data description

Input data

The prediction of the auction volume for a given stock on a given day can be made based on the following 126 input columns:

  • pid: a Product ID, that represents a stock.
  • day: day of the data sample, as an integer. The ordering is chronological, with day 0 coming before day 1, etc.
  • abs_retn (n from 0 to 60): absolute values of stock returns (relative price change) between the last known price (typically the price at the beginning of period n) and the end of period n (as a percentage), where the periods cover a good part of the trading day, don't overlap, and have the same duration. Return n=0 comes before return n=1, etc.
  • rel_voln: like abs_retn, but represents the traded stock volume as a fraction of the volume traded during the period covered (thus, they sum to 1, over a day). The periods are the same as for the returns.
  • LS and NLV: two quantities associated with the trades of the day for the stock in question. Their nature is kept undisclosed for this challenge.

Output data

The output data contains, for a given stock and a given day, the natural logarithm of the auction volume (= total value of traded stocks), as a fraction of the total volume in the 61 given periods. Thus, if the auction volume represents 10 % of the volume traded over all the periods of a day, the target is log 0.10 = -2.30…

Training and test data

The 900 stocks found in the training and test data are the same: it is therefore in principle possible to devise predictions that are customized for each stock.

The training data contains information on about 800 different days, while the test data requires auction volume predictions for about 350 days.

Furthermore, the test inputs correspond to days that come after those of the training data. A challenge is that auction volumes can evolve over time (for instance by becoming relatively larger and larger over time), but we only see what the past (training) auction volumes looks like.

Benchmark description

The benchmark is built by first inputing all the missing values by zero, and then performing a linear fit on all the columns, stock Product ID (pid) excepted.

The difference between the predictions of this linear model and the known (training) targets is then predicted by gradient tree boosting (with LightGBM) on all the columns (hence including the stock IDs). The boosted trees are trained with early stopping, and a cross-validation that measures future predictions after a training on (mostly) past data (with sklearn.model_selection.TimeSeriesSplit). All the other settings are the default ones.



Files are accessible when logged in and registered to the challenge

The challenge provider


Founded in 1991, Capital Fund Management (CFM) is a successful alternative asset manager and a pioneer in the field of quantitative trading applied to capital markets across the globe. Our methodology relies on statistically robust analysis of terabytes of data to inform and enable our asset allocation, trading decisions and automated order execution. Our people’s diversity and dedication contribute to CFM’s unique culture of research, innovation and achievement. We are a Great Place to Work company and we offer a collaborative and informal work environment, attractive offices and facilities.