Challenge Data

Assessing uncertainty in air quality predictions
by Oze Energies


Login to your account


Description


NO LOGO FOR THIS CHALLENGE
Competitive challenge
Physics
Environment
Regression
Time series
10MB to 1GB
Advanced level

Dates

Started on Jan. 4, 2021


Challenge context

Created in 2006, Oze-Energies is an innovative company specialized in Instrumented Energy Optimization of existing commercial buildings. Thousands of communicating sensors, coupled with monitoring and energy optimization softwares, allow to measure and store a huge number of data (temperatures, consumptions, programming, etc...) in real time and continuously. Using data accumulated for a few weeks and its statistical learning algorithms, Oze-Energies models the energy behavior of each building. Oze-Energies experts then identify and evaluate progress actions for equal comfort and without work, acting partly on the settings of climatic equipment (heating, ventilation and air conditioning) and secondly resizing energy contracts. These actions reduce the energy bill of the owners and tenants of about 25% on average per year.

Contacts

Maurice Charbit (mch@oze-energies.com)

Max Cohen (max.cohen@telecom-sudparis.eu )

Sylvain Le Corff (sylvain.lecorff@gmail.com)


Challenge goals

This data challenge aims at introducing a new statistical model to predict and analyze air quality in big buildings using observations stored in the Oze-Energies database. Physics based approaches to build air quality simulation tool in order to simulate complex building behaviors are widespread in the most complex situations. The main drawbacks of such softwares to simulate the behavior of transient systems are:

  • the significant computational time required to run such models as they integrate many noisy sources and a huge number of parameters and require essentially massive thermodynamics computations;
  • the fact that they often solely output a single-point estimate at each time, without providing any uncertainty measures to assess their confidence about their predictions.

In order to analyze and predict future air quality to alert and correct building management systems to ensure comfort and satisfactory sanitary conditions, this challenge aims at solving issue ii), i.e. at designing models which takes into account the uncertainty in the exogenous data describing external weather conditions and the occupation of the building. This will allow to provide confidence intervals on the air quality predictions, here on the humidity of the air inside the building.


Data description

The file is decomposed into a training dataset and a test dataset and each dataset contains input and output variables. Each sample in the training and test sets corresponds to one week of hourly observations, each column corresponds to a sensor value at a given hour during the week. The input file contains 4040 different weeks and the test file contains 1212 different weeks. In this input file are gathered building management system values (such as the air handling unit) and several forecasts for the outside temperatures and relative humidity for one week. One input xix_i is described as follows.

  • Each sample is identified by a unique identification number Id.
  • Each sample is identified by a unique start time of the time series FirstDayOfWeek.
  • AHU_1_AIRFLOWRATE_i: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow rate of the first AHU (Air Handling Unit), (168 columns).
  • AHU_1_AIRFLOWTEMP: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow temperature of the first AHU, (168 columns).
  • AHU_2_AIRFLOWRATE: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow rate of the second AHU (Air Handling Unit), (168 columns).
  • AHU_2_AIRFLOWTEMP: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow temperature of the second AHU, (168 columns).
  • AHU_3_AIRFLOWRATE: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow rate of the third AHU (Air Handling Unit), (168 columns).
  • AHU_3_AIRFLOWTEMP: for 0⩽i⩽1670\leqslant i \leqslant 167 airflow temperature of the third AHU, (168 columns).
  • OCCUPATION_PROFILE_i: for 0⩽i⩽1670\leqslant i \leqslant 167 building occupancy rate, (168 columns).
  • FORECAST_EXT_RHUM_i_j: 0⩽i⩽1670\leqslant i \leqslant 167 and 0⩽j⩽1670\leqslant j \leqslant 167 for each hour jj in the week, we have the 168 forecasts of one expert ii for the outside relative humidity, (168 ×\times 168 columns).
  • FORECAST_EXT_TEMP_i_j: 0⩽i⩽1670\leqslant i \leqslant 167 and 0⩽j⩽1670\leqslant j \leqslant 167 for each hour jj in the week, we have the 168 forecasts of one expert ii for the outside temperature, (168 ×\times 168 columns).

The output file contains the times series to be predicted hourly from the input. These corresponds to the predictions on the air quality inside the building and on the outside temperatures and relative humidity obtained from the input. The output file is defined as follows. For each Id of the input dataset, the same Id of the output data set contains the following quantities yiy_i considered as a part of the air quality index (AQI).

  • HISTORIC_EXT_TEMP_Inf_i: 0⩽i⩽1670\leqslant i \leqslant 167, lower bound for the outside temperature, (168 columns).
  • HISTORIC_EXT_TEMP_Sup_i: 0⩽i⩽1670\leqslant i \leqslant 167, upper bound for the outside temperature, (168 columns).
  • HISTORIC_EXT_RHUM_Inf_i: 0⩽i⩽1670\leqslant i \leqslant 167, lower bound for the outside relative humidity (168 columns).
  • HISTORIC_EXT_RHUM_Sup_i: 0⩽i⩽1670\leqslant i \leqslant 167, upper bound for the outside relative humidity, (168 columns).
  • SENSOR_INT_TEMP_Inf_i: 0⩽i⩽1670\leqslant i \leqslant 167, lower bound for the inside temperature, (168 columns).
  • SENSOR_INT_TEMP_Sup_i: 0⩽i⩽1670\leqslant i \leqslant 167, upper bound for the inside temperature, (168 columns).
  • SENSOR_INT_RHUM_Inf_i: 0⩽i⩽1670\leqslant i \leqslant 167, lower bound for the relative humidity, (168 columns).
  • SENSOR_INT_RHUM_Sup_i: 0⩽i⩽1670\leqslant i \leqslant 167, upper bound for the relative humidity, (168 columns).

For any time step 1⩽t⩽T1\leqslant t \leqslant T (here T=168T=168) of any sample ii in the test set and each output yt,iy_{t,i}, we ask the model to provide a lower y‾t,i\underline y_{t,i} and an upper bound y‾t,i\overline y_{t,i} to predict a 95% confidence bound. To do so, we propose to provide the prediction as follows, first provide all the lower bounds (with data in the same order as in the output file) y‾t,i\underline y_{t,i} and then all the upper bounds y‾t,i\overline y_{t,i}. Therefore the prediction has first 4×168=6724\times 168 = 672 lower bounds and then 4×168=6724\times 168 = 672 upper bounds.


Benchmark description

The performance of the model is assessed by analyzing the predictions based on the input variables of the test file. We use the Predicted Interval Coverage Percentage (PICP) and the Mean Prediction Interval Width [Pearce et al., 2018], see https://arxiv.org/pdf/1802.07167.pdf in equation (15) with λ=1\lambda = 1 and α=0.05\alpha = 0.05. For any time step 1⩽t⩽T1\leqslant t \leqslant T (here T=168T=168) of any sample ii in the test set and each output yt,iy_{t,i}, we ask the model to provide a lower y‾t,i\underline y_{t,i} and an upper bound y‾t,i\overline y_{t,i} and the PICP and the captured MPIW are computed as ni=∑t=1T1y‾t,i⩽yt,i⩽y‾t,i ,picpi=ninandmpiwi=1ni∑t=1T(y‾t,i−y‾t,i)1y‾t,i⩽yt,i⩽y‾t,in_i = \sum_{t=1}^T 1_{\underline y_{t,i}\leqslant y_{t,i}\leqslant \overline y_{t,i}}\,,\quad \mathrm{picp}_i = \frac{n_i}{n}\quad\mathrm{and}\quad \mathrm{mpiw}_i = \frac{1}{n_i}\sum_{t=1}^T \left(\overline y_{t,i} - \underline y_{t,i}\right) 1_{\underline y_{t,i}\leqslant y_{t,i}\leqslant \overline y_{t,i}}.

The aim is then to minimize mpiwi\mathrm{mpiw}_i under the constraint picpi⩾0.95\mathrm{picp}_i\geqslant 0.95. Therefore, following [Pearce et al., 2018], we consider the following penalized loss function for one sample ii: ℓi=mpiwi+nα(1−α)max(0,(1−α)−picpi)2\ell_i = \mathrm{mpiw}_i + \frac{n}{\alpha(1-\alpha)} \mathrm{max}(0,(1-\alpha)-\mathrm{picp}_i )^2, where α=0.05\alpha= 0.05. The total loss is the mean over all samples of this loss.

The benchmark was obtained using a LSTM network with dropout using Torch with loss torch.nn.MSELoss() and Adam optimizer with learning rate lr=1e-1 and

  • epochs = 1000
  • nb_hidden = 8
  • nb_layers = 2
  • dropout_rate = 0.2

During the test phase, 1000 stochastic runs of the LSTM were used to produce 1000 samples for each data point to be predicted. These samples were then used to build a confidence interval and produce the lower and upper bounds.


Files


Files are accessible when logged in and registered to the challenge


The challenge provider