Market surveillance

by authority

Started on Jan. 4, 2021

Financial markets are made up of various types of market players, each with specific interests and hence heterogeneous behaviours. For regulators, it is important to know the different types of market players in order to better understand how their behaviour has an impact on the market. Since the emergence of High-Frequency Trading (HFT) [1] more than a decade ago, financial sector authorities as well as academics have widely studied the impact and influence on markets of these market players, who invest in powerful low-latency infrastructure to transact a large number of orders in fractions of a second [2].

HFTs as well as other market players submit orders to an electronic trading mechanism called a Limit Order Book (LOB): orders are requests to buy or sell a given quantity of an asset at a specified price, thus allowing buyers to be matched with sellers at a mutually agreed price. Market players can modify the price or the quantity of their orders as long as they remain in the LOB ; they also can cancel them (modification and cancellation are tagged as « events »).

For this challenge, we focus on equity markets, and since a given stock can be traded on multiple trading venues (market fragmentation), participants can choose to which venue they send an order (each trading venue has its own distinct LOB). Thanks to their advantage in terms of speed, HFTs, who seek direct market gain based on small price variations, often apply arbitrage strategies [3] between several trading venues. Besides, HFTs generally send more events into the LOB than other participants, or at least are able to send two events with « the shortest » delta time possible.

[1] High Frequency trading definition

[2] AMF paper on HFTs behavior on Euronext Paris

[3] Arbitrage definition

The goal of the challenge is to classify traders within three categories, HFT, non HFT and MIX.

According to the AMF in-house expert-based classification, based on the knowledge that AMF has on the market players, market players are divided into three categories, HFT, MIX and non-HFT.

From a set of behavioural variables based on order and transaction data, the challenger is invited to predict the category to which a given participant belongs.

The proposed classification algorithm will then be applied to other data sources for which market players are currently not well known by the AMF.

Each market player $i$ (i.e. participant) is represented by a matrix $X_{i}$ , whose row $r_{a,t}$ provides a given market player’s behaviour variables calculated for a given stock $a$ and a certain trading date $t$ . Since all market players are not active every day nor on the whole scope of assets, the length of the matrices $X_i$ may vary from participant to participant. The columns of $X_i$ contain the features (detailed below).

The objective is to find the function $f$ such as $f(X_i )=y$ , where $y= ⟨HFT│MIX│Non-HFT⟩$ , in other words y refers to the market player’s category.

The y_train file as the y_test file contain the market players’ identification code ($Trader_1, Trader_2,$ etc.) and the category they belong to ($Trader_1 = HFT, Trader_2 = HFT, Trader_3 = MIX$ ). Participants falling under the $MIX$ category are those who can sometimes use HFT algorithms but not systematically.

The x_train file and the x_test file data contain both the equivalent of 1 month of data (x_train data are prior to x_test data). The scope of market players is roughly the same in the y_train and y_test data (but the market players’ identification code have been changed, so that it is not possible to find who is, for example, $Trader_1$ of the train (x and y) files, in the test (x and y) files.

x_Train and x_test data exhibit in their rows the same 35 features calculated for a given market player i on a certain stock $Isin_i$ (whose identification code is an Isin) and a specific trading date $t$ :

*#1*number of trading venues on which the market player trades ;- from all trading venues, statistics over the number of trades observed per second :
*#2*the mean, and,

*#3*the max - statistics over the observed time delta between two trades on the trading venue TV_1[1] :
*#4*min,*#5*median,*#6*min

- statistics over the observed time delta between two trades occurring on trading venue TV_1 and then on trading venue TV_2 :
*#7*min,*#8*median,*#9*min

- statistics over the observed time delta between two trades occurring on trading venue TV_1 and then on trading venue TV_3 :
*#10*min,*#11*median,*#12*min

- statistics over the observed time delta between two trades occurring on trading venue TV_1 and then on trading venue TV_4 :
*#13*min,*#14*median,*#15*min

*#16*from all trading venues, number of seconds during the trading day where at least one trade of the market player i is observed- on trading venue TV_1, three ratios between the number of all types of events[2] sent to the LOB and :
*#17*the number of trades (OTR),*#18*the number of cancellation-type event (OCR),*#19*the number of modification-type event (OMR)

- on trading venue TV_1, statistics over the observed time delta between two all-type events sent :
*#20*min,*#21*mean,*#22*10th percentile,*#23*1st quartile,*#24*median,*#25*3rd quartile,*#26*90th percentile,*#27*max

- on trading venue TV_1, statistics over the observed lifetime of cancelled orders :
*#28*min,*#29*mean,*#30*10th percentile,*#31*1st quartile,*#32*median,*#33*3rd quartile,*#34*90th percentile,*#35*max

The features above are not detailed in the same order as in the challenge files.

For example (based on the columns' order in the train files : « Isin_1, Date_1, Trader_1, 5, 2.3, 10, … » means that for $Isin_1$ on $Date_1$ , we have observed that $Trader_1$ has an OTR equals to 5, an OCR equals to 2.3 and an OMR equals to 10.

Finally, the link between test and train data can be established thanks to the market player’s identification code.

[1] TV_1 is the trading venue with the highest volume traded

[2] Events include both the transactions and the messages that market players can send to the LOB: new order, order modification or order cancellation.

A basic random forest with additional rules based on threshold to determine what percentage of rows by maket players makes them fall into one of each category gave us a micro-averaged F1- score of ~90%. For this « naive » model we have considered that a market player whose :

- at least 85% of his rows are predicted as HFT is an HFT participant,
- at least 50% of his rows are predicted as MIX is a MIX participant,

Otherwise the model considers that the market player is a NON HFT.

Files are accessible when logged in and registered to the challenge