Challenge data

Description

Competitive challenge

Industrial

Regression

Time series

Less than 10MB

Basic level

Dates

Started on Jan. 6, 2020

Challenge context

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near real-time. This challenge is the opportunity to work on one of our dataset: using on-site measurements to predict asset production.

Challenge goals

The goal of this problem is to estimate the production of a group of industrial assets, based on daily measurements and capacity constraints.

Considering a number of industrial installations $A_i$ with nominal production capacities $C_i$ , $J$ daily measurements ( $J=5$ ) ( $x^j_{i,w,l}$ ) are carried on each asset $i$ , for the week $w$ , for the weekday $l$ ( $l=1...7$ ), for the measure type $j$ , in order to detect patterns in operations that affect production.

The assets $A_i$ are gathered into $K$ disjoint groups ( $K=2$ ). For each group, installations report the actual production levels $y_{k,w}$ at an aggregated level, on a weekly basis.

For each group $k$ , the goal is to predict $\hat{y}_{k,w}$ , as the sum of the productions of all assets in that group ( $\hat{y}_{k,w} = \sum_i \hat{C}_{i,w}$ ) under the constraint that an asset production is smaller than its maximum capacity for each week, i.e.

$0 <= \hat{C}_{i,w} <= C^{max}_i \quad \forall w$

$\hat{C}_{i_0,w_0}$ is to be estimated as a function of the measures pertaining to asset $i_0$ and week $w_0$ :

$\hat{C}_{i_0,w_0} = f(x^j_{i_0,w_0,l})\quad \forall j \in 1, ... J, \quad \forall l \in 1, ..., 7$

Notes:

The metrics can only be used to assess the production of the asset where these were taken
An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.
The target is reported weekly while the measures are daily. Hence, the measures corresponding to the target for week $w$ are the measures of all the days in week $w$ .
Other factors might impact productivity, like economic conditions (i.e. demand-linked curtailment), but as a first approach, we assume such effects are not significant.
Some series might be lacking for some assets and are filled entirely with None values
These measurements may correspond to incidents or maintenance, which impact the productivity of the assets, possibly with varying significance (e.g. large v.s small incidents, …).

Data description

The inputs X are divided in two files train (6MB, 104 ID) and test (2.4MB, 46 ID). In each file the data is represented as a tidy dataframe with columns:

empty: Does not matter
SAMPLE_ID: The id of the sample
GROUP_ID: categorical in {2, 3}, the number of the group to which the asset belongs to
ASSET_ID: categorical, ranges from 1 to N (N=83), the id of the asset
MEASURE_TYPE: categorical, ranges from 1 to J (J=4)
MEASURE_VALUE: real-value, a measure on the asset on a given date
MEASURE_WEEK: week on which the measurement was acquired. Only available for the train set for exploratory purpose. Not available in the test set as it could be used to infer the current week based on future measurements.
MEASURE_WEEKDAY: categorical, ranges from 1 to 7, day of the week on which the measurement was taken. Weekdays are ordered starting from 1: Thursday, to 7:Wednesday

Targets

The targets are gathered into a separate file train_y.csv:

SAMPLE_ID: The id of the sample to predict
PRODUCTION_GROUP_{k}: the target for the group k.

Extra

An extra file with the asset nominal capacities assets.csv:

ASSET_ID: The id of the asset
ASSET_NOMINAL_CAPACITY: Nominal production capacity for the asset. An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.

Metrics

Mean Squared Error

Submission format

Same as the y train format. Keep the same column names and order. The SAMPLE_ID shall be the ones appearing in the test files and must be sorted in increasing order.

Benchmark description

One could naively assume that the assets are producing at maximum capacity when they are on, except for the rare times they are off. Their production can then be inferred with anomaly detection for each asset measurements.

When the measures are far from the centroid (distance above a certain cut-off), an anomaly is detected, the asset is considered off and set to produce 0 for the period. Otherwise, the asset is set to produce its nominal capacity $C$ . Then, aggregated production can be derived from the nominal capacities of the active assets.

This baseline approach relies on a crucial assumption about the simplicity of the asset. In reality, assets are not 100% on or off. A suggested improvement could be to try to infer a real-valued activity level from the measurements.

Files

Files are accessible when logged in and registered to the challenge

The challenge provider

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near realtime. Our mission is to introduce unparalleled transparency to market players for better decision-making.

PROVIDER WEBSITE

Congratulation for the winners of the challenge

1 Christophe Leroux
2 Eftychia Alexandri
3 Jean Yim

You can find the whole list of winners of the season here

Challenge Data

Asset production estimation
by Kayrros

Description

Dates

Challenge context

Challenge goals

Data description

Targets

Extra

Metrics

Submission format

Benchmark description

Files

The challenge provider

Congratulation for the winners of the challenge

Challenge Data

Asset production estimation by Kayrros

Description

Dates

Challenge context

Challenge goals

Data description

Targets

Extra

Metrics

Submission format

Benchmark description

Files

The challenge provider

Congratulation for the winners of the challenge

Asset production estimation
by Kayrros