# Challenge Data

### Stock Return Prediction by QRT

#### Description

Competitive challenge
Economic sciences
Finance
Classification
Time series
10MB to 1GB
Intermediary level

##### Dates

Started on Jan. 6, 2020

##### Challenge context

Quantitative investment strategies require the analysis of historical data to predict the trend of a stock in a near future. However, the extremely low level of signal / noise makes it a very challenging problem. Digging slight information among the enormous amount of available data in the market is a key goal for Qube RT. To do so, Machine Learning techniques can be used to make better trading decisions through deep analysis of thousands of different data sources. In a financial world in constant movement, it is extremely difficult to detect patterns that make a stock move up or down. This challenge is an illustration of the financial stock prediction.

Feel free to visit and register to our dedicated forum at https://challengedata.qube-rt.com/ for more information about the challenge, the data and QRT.

You can watch a video discussing the challenge here.

##### Challenge goals

The proposed challenge aims at predicting the return of a stock in the US market using historical data over a recent period of 20 days. The one-day return of a stock $j$ on day $t$ with price $P_j^t$ (adjusted from dividends and stock splits) is given by:

$R_j^t = \frac{P_j^t}{P_j^{t-1}} - 1$

In this challenge, we consider the residual stock return, which corresponds to the return of a stock without the market impact. Historical data are composed of residual stock returns and relative volumes, sampled each day during the 20 last business days (approximately one month). The relative volume $\mathcal V_j^t$ at time $t$ of a stock $j$ among the $n$ stocks is defined by:

\begin{aligned} \bar{V}_j^t &= \frac{V^t}{\mathrm{median(\{ V_j^{t-1}, \dots, V_j^{t-20}\})}} \\ \mathcal V_j^t &= \bar{V}_j^t - \frac{1}{n} \sum_{i=1}^n \bar{V}_i^t \end{aligned}

where $V_j^t$ is the volume at time $t$ of a stock $j$. We also give additional information about each stock such as its industry and sector.

The metric considered is the accuracy of the predicted residual stock return sign.

##### Data description

3 datasets are provided as csv files, split between training inputs and outputs, and test inputs.

Input datasets comprise 47 columns: the first ID column contains unique row identifiers while the other 46 descriptive features correspond to:

• β’ DATE: an index of the date (the dates are randomized and anonymized so there is no continuity or link between any dates),
• β’ STOCK: an index of the stock,
• β’ INDUSTRY: an index of the stock industry domain (e.g., aeronautic, IT, oil company),
• β’ INDUSTRY_GROUP: an index of the group industry,
• β’ SUB_INDUSTRY: a lower level index of the industry,
• β’ SECTOR: an index of the work sector,
• β’ RET_1 to RET_20: the historical residual returns among the last 20 days (i.e., RET_1 is the return of the previous day and so on),
• β’ VOLUME_1 to VOLUME_20: the historical relative volume traded among the last 20 days (i.e., VOLUME_1 is the relative volume of the previous day and so on),

Output datasets are only composed of 2 columns:

• β’ ID: the unique row identifier (corresponding to the input identifiers)
and the binary target:
• β’ RET: the sign of the residual stock return at time $t$

The solution files submitted by participants shall follow this output dataset format (i.e contain only two columns, ID and RET, where the ID values correspond to the input test data). An example submission file containing random predictions is provided.

418595 observations (i.e. lines) are available for the training datasets while 198429 observations are used for the test datasets.

##### Benchmark description

We propose a simple baseline using Random Forests fitted with 500 trees and a maximum depth of 8. Only the 5 previous stock returns and relative volumes are used, along with the STOCK and an additional feature representing the mean of RET_1 conditionally to RANK and SECTOR . The missing values are filled with 0.

The public score of this benchmark is 51.31%. A notebook explaining the generation of the benchmark is available in the supplementary files.

#### Files

Files are accessible when logged in and registered to the challenge

#### The challenge provider

Qube Research & Technologies Group is a quantitative and systematic investment manager employing around 300 people with offices in Hong Kong, London, Mumbai, Paris and Singapore. We are a technology driven firm implementing a scientific approach to financial investment. QRTβs market presence is global and expands across the largest liquid electronic venues. The combination of data, research, technology and trading expertise has shaped our DNA and is at the heart of our innovation and development dynamic. The firm acts as an investment manager managing open-ended Funds used for management of third party capital.

#### Congratulation for the winners of the challenge

1 Romain Poncet
2 -
3 AgnΓ¨s FranΓ§ois, Jessy Idez

You can find the whole list of winners of the season here