Challenge Data

Asset production estimation
by company


Register or login to participate !

Description


NO LOGO FOR THIS CHALLENGE
Dates

From Jan. 6, 2020 to Dec. 18, 2020


Challenge context

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near real-time. This challenge is the opportunity to work on one of our dataset: using on-site measurements to predict asset production.


Challenge goals

The goal of this problem is to estimate the production of a group of industrial assets, based on daily measurements and capacity constraints.

Considering a number of industrial installations AiA_i with nominal production capacities CiC_i , JJ daily measurements (J=5J=5 ) (xi,w,ljx^j_{i,w,l} ) are carried on each asset ii , for the week ww , for the weekday ll (l=1...7l=1...7 ), for the measure type jj , in order to detect patterns in operations that affect production.

The assets AiA_i are gathered into KK disjoint groups (K=2K=2 ). For each group, installations report the actual production levels yk,wy_{k,w} at an aggregated level, on a weekly basis.

For each group kk , the goal is to predict y^k,w\hat{y}_{k,w} , as the sum of the productions of all assets in that group (y^k,w=iC^i,w\hat{y}_{k,w} = \sum_i \hat{C}_{i,w} ) under the constraint that an asset production is smaller than its maximum capacity for each week, i.e.

0<=C^i,w<=Cimaxw 0 <= \hat{C}_{i,w} <= C^{max}_i \quad \forall w

C^i0,w0\hat{C}_{i_0,w_0} is to be estimated as a function of the measures pertaining to asset i0i_0 and week w0w_0 :

C^i0,w0=f(xi0,w0,lj)j1,...J,l1,...,7 \hat{C}_{i_0,w_0} = f(x^j_{i_0,w_0,l})\quad \forall j \in 1, ... J, \quad \forall l \in 1, ..., 7

Notes:

  • The metrics can only be used to assess the production of the asset where these were taken

  • An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.

  • The target is reported weekly while the measures are daily. Hence, the measures corresponding to the target for week ww are the measures of all the days in week ww .

  • Other factors might impact productivity, like economic conditions (i.e. demand-linked curtailment), but as a first approach, we assume such effects are not significant.

  • Some series might be lacking for some assets and are filled entirely with None values

  • These measurements may correspond to incidents or maintenance, which impact the productivity of the assets, possibly with varying significance (e.g. large v.s small incidents, …).


Data description

The inputs X are divided in two files train (6MB, 104 ID) and test (2.4MB, 46 ID). In each file the data is represented as a tidy dataframe with columns:

  • empty: Does not matter

  • SAMPLE_ID: The id of the sample

  • GROUP_ID: categorical in {2, 3}, the number of the group to which the asset belongs to

  • ASSET_ID: categorical, ranges from 1 to N (N=83), the id of the asset

  • MEASURE_TYPE: categorical, ranges from 1 to J (J=4)

  • MEASURE_VALUE: real-value, a measure on the asset on a given date

  • MEASURE_WEEK: week on which the measurement was acquired. Only available for the train set for exploratory purpose. Not available in the test set as it could be used to infer the current week based on future measurements.

  • MEASURE_WEEKDAY: categorical, ranges from 1 to 7, day of the week on which the measurement was taken. Weekdays are ordered starting from 1: Thursday, to 7:Wednesday

Targets

The targets are gathered into a separate file train_y.csv:

  • SAMPLE_ID: The id of the sample to predict

  • PRODUCTION_GROUP_{k}: the target for the group k.

Extra

An extra file with the asset nominal capacities assets.csv:

  • ASSET_ID: The id of the asset

  • ASSET_NOMINAL_CAPACITY: Nominal production capacity for the asset. An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.

Metrics

Mean Squared Error

Submission format

Same as the y train format. Keep the same column names and order. The SAMPLE_ID shall be the ones appearing in the test files and must be sorted in increasing order.


Benchmark description

One could naively assume that the assets are producing at maximum capacity when they are on, except for the rare times they are off. Their production can then be inferred with anomaly detection for each asset measurements.

When the measures are far from the centroid (distance above a certain cut-off), an anomaly is detected, the asset is considered off and set to produce 0 for the period. Otherwise, the asset is set to produce its nominal capacity CC . Then, aggregated production can be derived from the nominal capacities of the active assets.

This baseline approach relies on a crucial assumption about the simplicity of the asset. In reality, assets are not 100% on or off. A suggested improvement could be to try to infer a real-valued activity level from the measurements.

 

Files


Files are accessible when logged in and registered to the challenge


The challenge provider


PROVIDER LOGO

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near realtime. Our mission is to introduce unparalleled transparency to market players for better decision-making.