Challenge Data

Football : Can you guess the winner?
by QRT


Login to your account


Description


NO LOGO FOR THIS CHALLENGE
Competitive challenge
Sport
Classification
Tabular
More than 1GB
Intermediary level

Dates

Started on Jan. 10, 2024


Challenge context

Over the last two decades, professional sports around the world have adapted towards a data-driven approach to their decision-making. Sports analytics are part of live broadcasts, fantasy sports, and every-day discussions. This growth has been fueled by an exponential evolution in sports data.

Data science and machine learning can be useful to tackle the growing field of sports analytics. It can be used by fantasy league players and professional gamblers alike to make better informed decisions. Sports betting websites have become quite sophisticated in this area in the last few years. Models can also be used by managers of professional sports teams and recruiters to build rosters and strategically deploy players in a way that increases the team’s chance of winning.

Football has been at the heart of the sports analytics revolution. All types of statistics, both historical and real-time, are available. This challenge leverages Football data obtained from Sportmonks, a top-tier sports data provider widely used to enhance various online applications and websites. For additional details, feel free to explore sportmonks.com.

Feel free to visit and register to our dedicated forum at challengedata.qube-rt.com for more information about the challenge, the data and QRT.


Challenge goals

As this year’s QRT data challenge, we propose a match result prediction challenge. You will be provided with real historical data at the player, team and league level, and be asked to predict which team wins, or if there is a draw.

We have data for many leagues around the world and at different divisions. Your goal is to build a rich predictive models that can work in any football league regardless of competitive level or geographical location.


Data description

We provide data at the team and player level for dozens of football leagues.

The data comes packed in two zip files, X_train.zip, and X_test.zip, as well as two csv files Y_train.csv, and Y_train_supp.csv

The zip files contain the input data, which is divided into 4 csv files. The data is separated into HOME and AWAY team statistics, which is aggregated at the team and player level. All statistics come from real historical matches. They are summaries of the last 5 games prior to the match, as well as season-to-date statistics of the game being predicted.

The ID column links tables in X_train, with Y_train and Y_train_supp. The same holds true for the test data.

Input team data sets comprise the following 3 identifier columns:

  • ID, LEAGUE and TEAM_NAME (note that LEAGUE and TEAM_NAME are not included in the test data)

The following 25 statistics, which are aggregated by sum, average and standard deviation.

  • 'TEAM_ATTACKS'
  • 'TEAM_BALL_POSSESSION'
  • 'TEAM_BALL_SAFE'
  • 'TEAM_CORNERS'
  • 'TEAM_DANGEROUS_ATTACKS'
  • 'TEAM_FOULS'
  • 'TEAM_GAME_DRAW'
  • 'TEAM_GAME_LOST'
  • 'TEAM_GAME_WON'
  • 'TEAM_GOALS'
  • 'TEAM_INJURIES'
  • 'TEAM_OFFSIDES'
  • 'TEAM_PASSES'
  • 'TEAM_PENALTIES'
  • 'TEAM_REDCARDS'
  • 'TEAM_SAVES'
  • 'TEAM_SHOTS_INSIDEBOX'
  • 'TEAM_SHOTS_OFF_TARGET'
  • 'TEAM_SHOTS_ON_TARGET',
  • 'TEAM_SHOTS_OUTSIDEBOX'
  • 'TEAM_SHOTS_TOTAL'
  • 'TEAM_SUBSTITUTIONS'
  • 'TEAM_SUCCESSFUL_PASSES'
  • 'TEAM_SUCCESSFUL_PASSES_PERCENTAGE'
  • 'TEAM_YELLOWCARDS'

Input player data sets comprise the following 3 identifier columns:

  • ID, LEAGUE and TEAM_NAME, POSITION and PLAYER_NAME (note that LEAGUE, TEAM_NAME, and PLAYER_NAME are not included in the test data)

52 statistics, which are aggregated by sum, average and standard deviation. They are similar to the team statistics though more fine-grained.

Output data sets are composed of 4 columns:

  • ID: Unique row identifier - corresponding to the input identifiers,
  • HOME_WINS,
  • DRAW,
  • AWAY_WINS,

The target score is the accuracy of prediction for the vector [HOME_WINS, DRAW,AWAY_WINS], for which there are three possible choices, [1,0,0]. [0,1,0] and [0,0,1].

All variables have been standardized and team/players/and league names have been removed from the test set. We have provided as much data in the train set as possible for your convenience. We expect from you to not use any external data, which can lead to disqualification.

An example of submission file containing random predictions is provided - see also the notebook in the supplementary material section for the benchmark output.

We have included other alternate training targets, such as GOAL_DIFF_HOME_AWAY, which is the difference of goals between the HOME and AWAY team, in the Y_train_supp file in case you want to train richer models with different targets.

Disclaimer: The data provided is exclusively intended for use in this challenge, and any usage of this dataset for other purposes is strictly prohibited. General terms or service are applicable: terms of service.


Benchmark description

There are two benchmarks for this challenge, the first is always predicting that the HOME team wins. This gives you an accuracy of around 44% on the training set.

The second benchmark, which is the one you will see in the leaderboard, uses only team data. It uses all numerical columns as features, and trains a gradient boosting tree model to predict if the AWAY team wins or not. This gives you an accuracy of around 47.5% on the training set.

Feel free to tweak existing benchmark, build your own models and use supplementary files as you want. Good luck for this challenge !


Files


Files are accessible when logged in and registered to the challenge


The challenge provider


PROVIDER LOGO

Qube Research & Technologies Group is a quantitative and systematic investment manager employing around 300 people with offices in Hong Kong, London, Mumbai, Paris and Singapore. We are a technology driven firm implementing a scientific approach to financial investment. QRT’s market presence is global and expands across the largest liquid electronic venues. The combination of data, research, technology and trading expertise has shaped our DNA and is at the heart of our innovation and development dynamic. The firm acts as an investment manager managing open-ended Funds used for management of third party capital.