Market surveillance
Started on Jan. 5, 2022
To ensure that markets function properly, the AMF has developed a battery of algorithms that aims to detect atypical events among the tens of millions of market data, collected daily. There are various kind of atypical events: some of them are easily recognizable as sudden variation in prices or in volumes or even high volatility periods with oscillations in prices but others are more sophisticated.
In order to focus the challenge towards the detection of an atypical event occurrence, instead of the detection of one type of event, the challenge context will voluntarily not describe all the different kinds of events.
The goal of the challenge is to train machine learning models to look for anomalies within stock market data.
It is relatively easy to design one algorithm seeking one specific kind of event, and then to implement as many algorithms as there are types of atypical events. However, it would be also beneficial to be able to detect any type of atypical events thanks to one model, which could learn to recognize common features between these different events.
From a sample of market data: time series price and volume data, the challenger is invited to predict the existence of an atypical event on an hourly basis by financial instrument. Any kind of approach can be experimented but the AMF is particularly interested in computer vision techniques on reconstructed time series plots if the challenger thinks that it is relevant.
The dataset is stored in Excel files:
The label file (y_train) indicates when we have detected an event on an hourly basis and by financial instrument. When EVENT is set at TRUE it means an event has been detected whereas when set at FALSE means there is no particular event found in the slot. An event can occur in a short time period (~15 minutes): in this case, the label file does not specify exactly when the event takes place in the hour. If several events appear in the same hour, the label file does not distinguish them and consider the case as if there is only one event during the hour. When the label file indicates multiple events in the same day on the same financial instrument, the challenger should consider that these events are interlinked.
The train and test data contain both the equivalent of 1 month of data (train data are prior to test data) for 97 financial products. Train and test data contain the main information to gain a broad understanding of the market kinematic. The data corresponds to all the orders and trades linked to the financial products within the time period. In order to gain a better understanding of the order book, here are several ressources:
Train and test data exhibit in their rows the same time series for a given trading date t on a certain financial instrument a:
The following fields provide aggregated information about the existing interests of buyers and sellers (the best selling/buying prices and the available quantity immediately tradable):
Since a transaction is a match between one buyer and one seller, in the data, a trade is split into two rows (one for the buying order and another for the selling order that are matching). The two rows must share the same transaction price #10, the same traded quantity #11 and the same timestamp #2 and sequence number #3.
Timestamps between trades show period when buyers and sellers do not meet each other on a transaction price but the evolution of #12 the best ask price and #14 the best bid price provides information about the market trends as well.
When there is no data for a certain financial instrument on a specific date in x_train_file or x_test_file, it means that there are no transactions. In this case you should only find EVENT set at FALSE in the y_train_file or in the y_test_file.
The current benchmark is a naive model always predicting False, whatever the timestamp or the ISIN. It performs well as we are in an unbalanced setup.
Several classical models runned on aggregated features were tested and did not outperform this naive solution that is not business relevant at all, as suspicious movements on the market have to be identified.
Example of classical model: a basic random forest with 4 aggregated features computed on an hourly basis by financial instrument:
The challenger must include a tight consideration of overfitting and labels imbalance, together with more complicated models, to get a better performance.
Files are accessible when logged in and registered to the challenge