Quantitative investment strategies require the analysis of historical data to predict the trend of a stock in a near future. However, the extremely low level of signal / noise makes it a very challenging problem. Digging slight information among the enormous amount of available data in the market is a key goal for Qube RT. To do so, Machine Learning techniques can be used to make better trading decisions through deep analysis of thousands of different data sources. In a financial world in constant movement, it is extremely difficult to detect patterns that make a stock move up or down. This challenge is an illustration of the financial stock prediction.
The proposed challenge aims at predicting the return of a stock in the US market using historical data over a recent period of 20 days. The one-day return of a stock j
on day t
with price Pjt
(adjusted from dividends and stock splits) is given by:
In this challenge, we consider the residual stock return, which corresponds to the return of a stock without the market impact. Historical data are composed of residual stock returns and relative volumes, sampled each day during the 20 last business days (approximately one month). The relative volume Vjt
at time t
of a stock j
among the n
stocks is defined by:
is the volume at time t
of a stock j
. We also give additional information about each stock such as its industry and sector.
The metric considered is the accuracy of the predicted residual stock return sign.
3 datasets are provided as csv files, split between training inputs and outputs, and test inputs.
Input datasets comprise 47 columns: the first ID column contains unique row identifiers while the other 46 descriptive features correspond to:
• DATE: an index of the date (the dates are randomized and anonymized so there is no continuity or link between any dates),
• STOCK: an index of the stock,
• INDUSTRY: an index of the stock industry domain (e.g., aeronautic, IT, oil company),
• INDUSTRY_GROUP: an index of the group industry,
• SUB_INDUSTRY: a lower level index of the industry,
• SECTOR: an index of the work sector,
• RET_1 to RET_20: the historical residual returns among the last 20 days (i.e., RET_1 is the return of the previous day and so on),
• VOLUME_1 to VOLUME_20: the historical relative volume traded among the last 20 days (i.e., VOLUME_1 is the relative volume of the previous day and so on),
Output datasets are only composed of 2 columns:
• ID: the unique row identifier (corresponding to the input identifiers)
and the binary target:
• RET: the sign of the residual stock return at time t
The solution files submitted by participants shall follow this output dataset format (i.e contain only two columns, ID and RET, where the ID values correspond to the input test data). An example submission file containing random predictions is provided.
418595 observations (i.e. lines) are available for the training datasets while 198429 observations are used for the test datasets.
We propose a simple baseline using Random Forests fitted with 500 trees and a maximum depth of 8. Only the 5 previous stock returns and relative volumes are used, along with the STOCK and an additional feature representing the mean of RET_1 conditionally to RANK and SECTOR . The missing values are filled with 0.
The public score of this benchmark is 51.31%. A notebook explaining the generation of the benchmark is available in the supplementary files.
Files are accessible when logged in and registered to the challenge
The challenge provider
Qube Research & Technologies Limited is a quantitative and systematic investment manager employing around 150 people with offices in Hong Kong, London, Mumbai and Paris. We are a technology driven firm implementing a scientific approach to financial investment. QRT’s market presence is global and expands across the largest liquid electronic venues. The combination of data, research, technology and trading expertise has shaped our DNA and is at the heart of our innovation and development dynamic. The firm acts as an investment manager managing an open-ended Fund used for management of third party capital.