We are a technology-focused data and analytics firm using unconventional data to positively disrupt the worldβs biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near real-time. This challenge is the opportunity to work on one of our dataset: using on-site measurements to predict asset production.
Challenge goals
The goal of this problem is to estimate the production of a group of industrial assets, based on daily measurements and capacity constraints.
Considering a number of industrial installations Aiβ with nominal production capacities Ciβ, J daily measurements (J=5) (xi,w,ljβ) are carried on each asset i, for the week w, for the weekday l (l=1...7), for the measure type j, in order to detect patterns in operations that affect production.
The assets Aiβ are gathered into K disjoint groups (K=2). For each group, installations report the actual production levels yk,wβ at an aggregated level, on a weekly basis.
For each group k, the goal is to predict y^βk,wβ, as the sum of the productions of all assets in that group (y^βk,wβ=βiβC^i,wβ) under the constraint that an asset production is smaller than its maximum capacity for each week, i.e.
0<=C^i,wβ<=Cimaxββw
C^i0β,w0ββ is to be estimated as a function of the measures pertaining to asset i0β and week w0β :
The metrics can only be used to assess the production of the asset where these were taken
An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.
The target is reported weekly while the measures are daily. Hence, the measures corresponding to the target for week w are the measures of all the days in week w.
Other factors might impact productivity, like economic conditions (i.e. demand-linked curtailment), but as a first approach, we assume such effects are not significant.
Some series might be lacking for some assets and are filled entirely with None values
These measurements may correspond to incidents or maintenance, which impact the productivity of the assets, possibly with varying significance (e.g. large v.s small incidents, β¦).
Data description
The inputs X are divided in two files train (6MB, 104 ID) and test (2.4MB, 46 ID). In each file the data is represented as a tidy dataframe with columns:
empty: Does not matter
SAMPLE_ID: The id of the sample
GROUP_ID: categorical in {2, 3}, the number of the group to which the asset belongs to
ASSET_ID: categorical, ranges from 1 to N (N=83), the id of the asset
MEASURE_TYPE: categorical, ranges from 1 to J (J=4)
MEASURE_VALUE: real-value, a measure on the asset on a given date
MEASURE_WEEK: week on which the measurement was acquired. Only available for the train set for exploratory purpose. Not available in the test set as it could be used to infer the current week based on future measurements.
MEASURE_WEEKDAY: categorical, ranges from 1 to 7, day of the week on which the measurement was taken. Weekdays are ordered starting from 1: Thursday, to 7:Wednesday
Targets
The targets are gathered into a separate file train_y.csv:
SAMPLE_ID: The id of the sample to predict
PRODUCTION_GROUP_{k}: the target for the group k.
Extra
An extra file with the asset nominal capacities assets.csv:
ASSET_ID: The id of the asset
ASSET_NOMINAL_CAPACITY: Nominal production capacity for the asset. An asset usually cannot produce more than its nominal capacity, but sometimes spikes in production can go up 120% the nominal capacity.
Metrics
Mean Squared Error
Submission format
Same as the y train format. Keep the same column names and order. The SAMPLE_ID shall be the ones appearing in the test files and must be sorted in increasing order.
Benchmark description
One could naively assume that the assets are producing at maximum capacity when they are on, except for the rare times they are off. Their production can then be inferred with anomaly detection for each asset measurements.
When the measures are far from the centroid (distance above a certain cut-off), an anomaly is detected, the asset is considered off and set to produce 0 for the period. Otherwise, the asset is set to produce its nominal capacity C.
Then, aggregated production can be derived from the nominal capacities of the active assets.
This baseline approach relies on a crucial assumption about the simplicity of the asset. In reality, assets are not 100% on or off. A suggested improvement could be to try to infer a real-valued activity level from the measurements.
Files
Files are accessible when logged in and registered to the challenge
The challenge provider
We are a technology-focused data and analytics firm using unconventional data to positively disrupt the worldβs biggest industries. By harnessing the power of satellite imagery, natural language processing, machine learning and advanced mathematics, we deliver actionable intelligence on virtually any asset across the globe, in near realtime. Our mission is to introduce unparalleled transparency to market players for better decision-making.