Challenge Data

Return Forecasting of Cryptocurrency Clusters
by Napoleon X

Login to your account to try this challenge!



Started on Jan. 5, 2022

Challenge context

Napoleon Crypto/NaPoleonX is a company specialised in designing quantitative systematic investment solutions, i.e. investment solutions based on algorithms. Napoleon Asset Management ('NAM') is a subsidiary of Napoleon Crypto and a regulated asset management company specialised in designing quantitative investment solutions with the particularity to focus on cryptocurrencies as investment universe. In its research to explore new investment solutions, NAM is always considering using new disruptive machine and deep learning algorithms.

Challenge goals

The goal of the challenge is to predict the returns vs. bitcoin of clusters of cryptoassets.

At Napoleon, we are interested in detecting which assets are likely to move together in the same direction, i.e. assets whose returns (absolute price changes) are statistically positively correlated. Such assets are regrouped into "clusters", which can be seen as the crypto equivalent of equity sectors or industries. The knowledge of such clusters can then be used to optimize portfolios, build advanced trading strategies (long/short absolute, market neutral), evaluate the systematic risk, etc. In order to build new trading strategies, it can be helpful to know whether a given sector/cluster will outperform the market represented by the bitcoin. For this reason, given a cluster C={A1,...,An}\mathcal{C} = \{A_1, ..., A_n\} composed of nn assets AiA_i , this challenge aims at predicting the return relatively to bitcoin of an equally weighted portfolio composed of {A1,...,An}\{ A_1, ..., A_n\} in the next hour, given series of returns for the last 23 hours for assets in the cluster, as well as some metadata.

Data description

It is well known that if we want clusters to remain meaningful, they have to be updated on a regular basis as they evolve over time and change with market types (bull/bear, range, ...). As a result, the clustering is not time consistent, i.e. the relation between asset ii and asset jj is likely to change over time. Clusters are typically updated every week, however each clustering is considered valid for three weeks in this challenge. As a consequence, we build from each cluster 21 samples corresponding to 21 days.

Each sample is a thus a pair {cluster,day}\{cluster, day\} where each cluster CC is composed of nn assets A1A_1 , ..., AnA_n . It is worth noting that the number of assets nn in each cluster is not a shared hyperparameter, and changes from cluster to cluster. For each asset AiA_i and each sample day, the sequence of the hourly returns of the first 23 hours of the day is provided. In addition, two other quantities (mcmc and bcbc ) are given but their nature will be kept undisclosed.

From these input data, the goal is to predict for each sample {cluster,day}\{cluster, day\} the mean return of assets in the cluster relatively to the bitcoin during the last hour of the day. The hourly returns provided are relatively to the bitcoin performance since assets prices are assumed to be in bitcoin: consequently, a positive return means that an asset has outperformed the bitcoin, a negative that it has underperformed.

Dates corresponding to the construction of a cluster or sample are not provided, it is therefore impossible to determine if two clusters were built at the same date or not. Assets are also anonymised: given two clusters C1C_1 and C2C_2 , the cryptocurrency labelled A1A_1 in C1C_1 is not required to match the cryptocurrency labelled by A1A_1 in C2C_2 , A1A_1 of C1C_1 is not even required to be present in C2C_2 .

Input Data Input datasets comprise 29 columns detailed below. Each line is indexed by a unique ID, which corresponds to a cluster (defined by "cluster"), a cluster sample day (defined by "day"), and a cryptocurrency (defined by "asset"). ret1,....,ret23ret_1, ...., ret_{23} correspond to the first 23 hours of the day returns relatively to bitcoin, arranged in chronological order. mdmd and bcbc are the two secret quantities.


  • id
  • cluster
  • day
  • asset
  • md
  • bc
  • ret_1
  • ...
  • ret_23

The input data contain a certain number of NaN values corresponding to missing data whose filling method is left to the discretion of participants.

Additional Data: each asset can be seen as a node in a graph that represents its cluster. For this reason, we also provide in the supplementary files a binary adjacency matrix A{0,1}n×n\mathbf{A} \in \{0, 1\}^{n \times n} for each cluster. Each entry AijA_{ij} of this matrix labels the edge between node ii and node jj . AijA_{ij} indicates whether the co-movement relation between assets ii and jj is highly statistically significant (Aij=1A_{ij}=1 ) or not (Aij=0A_{ij}=0 ). All adjacency matrices are stored in the same pickle file: adjacency_matrices.pkl. This file can be opened with the Python package pickle:

import pickle

with open("adjacency_matrices.pkl", "rb") as file:
    adj = pickle.load(file)

adj is a dictionary whose keys are the indices of the clusters.

Output Data Output datasets comprise only 2 columns:

  • sample_id: a unique sample identifier for a given {cluster,day}\{cluster, day\} pair
  • target: a float, the mean return of cluster's assets relatively to the bitcoin during the last hour of the day

The 'sample_id' identifier of each sample is computed as follows: sample_id=cluster×21+day\mathrm{sample\_id} = \mathrm{cluster} \times 21 + \mathrm{day}

Training and test data Number of clusters: 20912091 , number of days per cluster: 2121 , total number of samples: 436272045×21 43627 \approx 2045 \times 21 (some samples were dismissed because of the lack of sufficient data). The training data contains 30494 cluster samples (1464 clusters, 70%\approx 70 \% ), while the test data contains 13133 samples (627 clusters, 30%\approx 30 \% ). Test samples correspond to dates that come after those of the training data.

The solution files submitted by participants shall follow exactly the same output data format as described above, with 'sample_id' identifiers. An example submission file containing random predictions is provided.

The metric used to rank predictions submitted by participants is the Root Mean Squared Error (RMSE).

Benchmark description

For each cluster C={A1,...,An}C = \{A_1, ..., A_n \} , we compute for each asset AiA_i the average of its past returns ret1,...,ret23ret_1, ..., ret_{23} . Next, we compute the average over assets in the cluster of these averages. Mathematically, the benchmark can be described as follows:

y^=123×ni=1nj=123retj(Ai) \hat{y} = \frac{1}{23 \times n} \sum_{i=1}^n \sum_{j=1}^{23} ret_j(A_i)

import pandas as pd

input_test = pd.read_csv("public/x_test.csv", index_col=0)
input_test["sample_id"] = 21 * input_test["cluster"] + input_test["day"]

ret_cols = ["ret_" + str(i) for i in range(1, 24)]
y_benchmark = pd.DataFrame(index=input_test.groupby("sample_id").mean().index)
y_benchmark["target"] = input_test.loc[:, ["sample_id"] + ret_cols].groupby("sample_id").mean().mean(axis=1)


Files are accessible when logged in and registered to the challenge

The challenge provider


Napoleon Group is building the future of investing around three entities: • Napoleon AM has been granted an AIFM license and will be able to offer investment solutions for professional investors. Napoleon AM will hence propose crypto exposure solutions; • Napoleon Capital is a specialist in quantitative strategies issued from open-competition holding a financial advisor license CIF (Conseil en Investissement Financier/FSA equivalent) granted by ORIAS/AMF; • Napoleon Index is aiming to become registered for index publishing and administration under BMR regulation, blockchain based. Napoleon Group has a French DNA, with an international mindset and a strong will to comply with the highest standard regulations. Napoleon Group is betting on regulation to address institutional investors’ needs. AM industry evolution: The Group vision is to embrace 3 major developments that are reshaping the AM industry: • Quantitative finance is revolutionizing the financial industry through automation and Artificial Intelligence; • Blockchain simplifies many operational processes through increased speed, security, transparency and cost-efficiency; and • Tokenization is the real potential future of financial assets. Crypto assets are being adopted by institutional clients as a new asset class. This will lead to a world of programmable real and financial assets, allowing to trade in an ever more efficient market.