# Challenge Data

### Asset production estimation by Kayrros

#### Description

##### Dates

Started on Jan. 6, 2020

##### Challenge context

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near real-time. This challenge is the opportunity to work on one of our dataset: using on-site measurements to predict asset production.

##### Challenge goals

The goal of this problem is to estimate the production of a group of industrial assets, based on daily measurements and capacity constraints.

Considering a number of industrial installations $A_i$ with nominal production capacities $C_i$ , $J$ daily measurements ($J=5$ ) ($x^j_{i,w,l}$ ) are carried on each asset $i$ , for the week $w$ , for the weekday $l$ ($l=1...7$ ), for the measure type $j$ , in order to detect patterns in operations that affect production.

The assets $A_i$ are gathered into $K$ disjoint groups ($K=2$ ). For each group, installations report the actual production levels $y_{k,w}$ at an aggregated level, on a weekly basis.

For each group $k$ , the goal is to predict $\hat{y}_{k,w}$ , as the sum of the productions of all assets in that group ($\hat{y}_{k,w} = \sum_i \hat{C}_{i,w}$ ) under the constraint that an asset production is smaller than its maximum capacity for each week, i.e.

$0 <= \hat{C}_{i,w} <= C^{max}_i \quad \forall w$

$\hat{C}_{i_0,w_0}$ is to be estimated as a function of the measures pertaining to asset $i_0$ and week $w_0$ :

$\hat{C}_{i_0,w_0} = f(x^j_{i_0,w_0,l})\quad \forall j \in 1, ... J, \quad \forall l \in 1, ..., 7$

Notes:

• The metrics can only be used to assess the production of the asset where these were taken

• An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.

• The target is reported weekly while the measures are daily. Hence, the measures corresponding to the target for week $w$ are the measures of all the days in week $w$ .

• Other factors might impact productivity, like economic conditions (i.e. demand-linked curtailment), but as a first approach, we assume such effects are not significant.

• Some series might be lacking for some assets and are filled entirely with None values

• These measurements may correspond to incidents or maintenance, which impact the productivity of the assets, possibly with varying significance (e.g. large v.s small incidents, …).

##### Data description

The inputs X are divided in two files train (6MB, 104 ID) and test (2.4MB, 46 ID). In each file the data is represented as a tidy dataframe with columns:

• empty: Does not matter

• SAMPLE_ID: The id of the sample

• GROUP_ID: categorical in {2, 3}, the number of the group to which the asset belongs to

• ASSET_ID: categorical, ranges from 1 to N (N=83), the id of the asset

• MEASURE_TYPE: categorical, ranges from 1 to J (J=4)

• MEASURE_VALUE: real-value, a measure on the asset on a given date

• MEASURE_WEEK: week on which the measurement was acquired. Only available for the train set for exploratory purpose. Not available in the test set as it could be used to infer the current week based on future measurements.

• MEASURE_WEEKDAY: categorical, ranges from 1 to 7, day of the week on which the measurement was taken. Weekdays are ordered starting from 1: Thursday, to 7:Wednesday

#### Targets

The targets are gathered into a separate file train_y.csv:

• SAMPLE_ID: The id of the sample to predict

• PRODUCTION_GROUP_{k}: the target for the group k.

#### Extra

An extra file with the asset nominal capacities assets.csv:

• ASSET_ID: The id of the asset

• ASSET_NOMINAL_CAPACITY: Nominal production capacity for the asset. An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.

#### Metrics

Mean Squared Error

#### Submission format

Same as the y train format. Keep the same column names and order. The SAMPLE_ID shall be the ones appearing in the test files and must be sorted in increasing order.

##### Benchmark description

One could naively assume that the assets are producing at maximum capacity when they are on, except for the rare times they are off. Their production can then be inferred with anomaly detection for each asset measurements.

When the measures are far from the centroid (distance above a certain cut-off), an anomaly is detected, the asset is considered off and set to produce 0 for the period. Otherwise, the asset is set to produce its nominal capacity $C$ . Then, aggregated production can be derived from the nominal capacities of the active assets.

This baseline approach relies on a crucial assumption about the simplicity of the asset. In reality, assets are not 100% on or off. A suggested improvement could be to try to infer a real-valued activity level from the measurements.

#### Files

Files are accessible when logged in and registered to the challenge

#### The challenge provider

We are a technology-focused data and analytics firm using unconventional data to positively disrupt the world’s biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near realtime. Our mission is to introduce unparalleled transparency to market players for better decision-making.