Challenge data

Description

Competitive challenge

Sport

Classification

Tabular

More than 1GB

Intermediary level

Dates

Started on Jan. 10, 2024

Challenge context

Over the last two decades, professional sports around the world have adapted towards a data-driven approach to their decision-making. Sports analytics are part of live broadcasts, fantasy sports, and every-day discussions. This growth has been fueled by an exponential evolution in sports data.

Data science and machine learning can be useful to tackle the growing field of sports analytics. It can be used by fantasy league players and professional gamblers alike to make better informed decisions. Sports betting websites have become quite sophisticated in this area in the last few years. Models can also be used by managers of professional sports teams and recruiters to build rosters and strategically deploy players in a way that increases the team’s chance of winning.

Football has been at the heart of the sports analytics revolution. All types of statistics, both historical and real-time, are available. This challenge leverages Football data obtained from Sportmonks, a top-tier sports data provider widely used to enhance various online applications and websites. For additional details, feel free to explore sportmonks.com.

Feel free to visit and register to our dedicated forum at challengedata.qube-rt.com for more information about the challenge, the data and QRT.

Challenge goals

As this year’s QRT data challenge, we propose a match result prediction challenge. You will be provided with real historical data at the player, team and league level, and be asked to predict which team wins, or if there is a draw.

We have data for many leagues around the world and at different divisions. Your goal is to build a rich predictive models that can work in any football league regardless of competitive level or geographical location.

Data description

We provide data at the team and player level for dozens of football leagues.

The data comes packed in two zip files, X_train.zip, and X_test.zip, as well as two csv files Y_train.csv, and Y_train_supp.csv

The zip files contain the input data, which is divided into 4 csv files. The data is separated into HOME and AWAY team statistics, which is aggregated at the team and player level. All statistics come from real historical matches. They are summaries of the last 5 games prior to the match, as well as season-to-date statistics of the game being predicted.

The ID column links tables in X_train, with Y_train and Y_train_supp. The same holds true for the test data.

Input team data sets comprise the following 3 identifier columns:

ID, LEAGUE and TEAM_NAME (note that LEAGUE and TEAM_NAME are not included in the test data)

The following 25 statistics, which are aggregated by sum, average and standard deviation.

'TEAM_ATTACKS'
'TEAM_BALL_POSSESSION'
'TEAM_BALL_SAFE'
'TEAM_CORNERS'
'TEAM_DANGEROUS_ATTACKS'
'TEAM_FOULS'
'TEAM_GAME_DRAW'
'TEAM_GAME_LOST'
'TEAM_GAME_WON'
'TEAM_GOALS'
'TEAM_INJURIES'
'TEAM_OFFSIDES'
'TEAM_PASSES'
'TEAM_PENALTIES'
'TEAM_REDCARDS'
'TEAM_SAVES'
'TEAM_SHOTS_INSIDEBOX'
'TEAM_SHOTS_OFF_TARGET'
'TEAM_SHOTS_ON_TARGET',
'TEAM_SHOTS_OUTSIDEBOX'
'TEAM_SHOTS_TOTAL'
'TEAM_SUBSTITUTIONS'
'TEAM_SUCCESSFUL_PASSES'
'TEAM_SUCCESSFUL_PASSES_PERCENTAGE'
'TEAM_YELLOWCARDS'

Input player data sets comprise the following 3 identifier columns:

ID, LEAGUE and TEAM_NAME, POSITION and PLAYER_NAME (note that LEAGUE, TEAM_NAME, and PLAYER_NAME are not included in the test data)

52 statistics, which are aggregated by sum, average and standard deviation. They are similar to the team statistics though more fine-grained.

Output data sets are composed of 4 columns:

ID: Unique row identifier - corresponding to the input identifiers,
HOME_WINS,
DRAW,
AWAY_WINS,

The target score is the accuracy of prediction for the vector [HOME_WINS, DRAW,AWAY_WINS], for which there are three possible choices, [1,0,0]. [0,1,0] and [0,0,1].

All variables have been standardized and team/players/and league names have been removed from the test set. We have provided as much data in the train set as possible for your convenience. We expect from you to not use any external data, which can lead to disqualification.

An example of submission file containing random predictions is provided - see also the notebook in the supplementary material section for the benchmark output.

We have included other alternate training targets, such as GOAL_DIFF_HOME_AWAY, which is the difference of goals between the HOME and AWAY team, in the Y_train_supp file in case you want to train richer models with different targets.

Disclaimer: The data provided is exclusively intended for use in this challenge, and any usage of this dataset for other purposes is strictly prohibited. General terms or service are applicable: terms of service.

Benchmark description

There are two benchmarks for this challenge, the first is always predicting that the HOME team wins. This gives you an accuracy of around 44% on the training set.

The second benchmark, which is the one you will see in the leaderboard, uses only team data. It uses all numerical columns as features, and trains a gradient boosting tree model to predict if the AWAY team wins or not. This gives you an accuracy of around 47.5% on the training set.

Feel free to tweak existing benchmark, build your own models and use supplementary files as you want. Good luck for this challenge !

Files

Files are accessible when logged in and registered to the challenge

The challenge provider

Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all asset classes across the world. Driven by a shared passion for data, research, technology and trading expertise, we strive to deliver high-quality returns for our investors. Established in 2016, QRT can rely on its 1,400+ employees across 11 offices in Europe, Middle East and Asia Pacific. QRT supports various coding initiatives as well as academic projects developing and promoting maths and science education.

PROVIDER WEBSITE

Challenge Data

Football : Can you guess the winner?
by QRT

Description

Dates

Challenge context

Challenge goals

Data description

Benchmark description

Files

The challenge provider

Challenge Data

Football : Can you guess the winner? by QRT

Description

Dates

Challenge context

Challenge goals

Data description

Benchmark description

Files

The challenge provider

Football : Can you guess the winner?
by QRT