Professor

by Oze Energies

Started on Jan. 4, 2021

Created in 2006, Oze-Energies is an innovative company specialized in Instrumented Energy Optimization of existing commercial buildings. Thousands of communicating sensors, coupled with monitoring and energy optimization softwares, allow to measure and store a huge number of data (temperatures, consumptions, programming, etc...) in real time and continuously. Using data accumulated for a few weeks and its statistical learning algorithms, Oze-Energies models the energy behavior of each building. Oze-Energies experts then identify and evaluate progress actions for equal comfort and without work, acting partly on the settings of climatic equipment (heating, ventilation and air conditioning) and secondly resizing energy contracts. These actions reduce the energy bill of the owners and tenants of about 25% on average per year.

Maurice Charbit (mch@oze-energies.com)

Max Cohen (max.cohen@telecom-sudparis.eu )

Sylvain Le Corff (sylvain.lecorff@gmail.com)

This data challenge aims at introducing a new statistical model to predict and analyze air quality in big buildings using observations stored in the Oze-Energies database. Physics based approaches to build air quality simulation tool in order to simulate complex building behaviors are widespread in the most complex situations. The main drawbacks of such softwares to simulate the behavior of transient systems are:

- the significant computational time required to run such models as they integrate many noisy sources and a huge number of parameters and require essentially massive thermodynamics computations;
- the fact that they often solely output a single-point estimate at each time, without providing any uncertainty measures to assess their confidence about their predictions.

In order to analyze and predict future air quality to alert and correct building management systems to ensure comfort and satisfactory sanitary conditions, this challenge aims at solving issue ii), i.e. at designing models which takes into account the uncertainty in the exogenous data describing external weather conditions and the occupation of the building. This will allow to provide confidence intervals on the air quality predictions, here on the humidity of the air inside the building.

The file is decomposed into a training dataset and a test dataset and each dataset contains input and output variables. Each sample in the training and test sets corresponds to one week of hourly observations, each column corresponds to a sensor value at a given hour during the week. The input file contains $40$ different weeks and the test file contains $12$ different weeks. In this input file are gathered building management system values (such as the air handling unit) and several forecasts for the outside temperatures and relative humidity for one week. One input $x_i$ is described as follows.

- Each sample is identified by a unique identification number
**Id**. - Each sample is identified by a unique start time of the time series FirstDayOfWeek.
- AHU_1_AIRFLOWRATE_i: for $0\leqslant i \leqslant 167$ airflow rate of the first AHU (Air Handling Unit), (168 columns).
- AHU_1_AIRFLOWTEMP: for $0\leqslant i \leqslant 167$ airflow temperature of the first AHU, (168 columns).
- AHU_2_AIRFLOWRATE: for $0\leqslant i \leqslant 167$ airflow rate of the second AHU (Air Handling Unit), (168 columns).
- AHU_2_AIRFLOWTEMP: for $0\leqslant i \leqslant 167$ airflow temperature of the second AHU, (168 columns).
- AHU_3_AIRFLOWRATE: for $0\leqslant i \leqslant 167$ airflow rate of the third AHU (Air Handling Unit), (168 columns).
- AHU_3_AIRFLOWTEMP: for $0\leqslant i \leqslant 167$ airflow temperature of the third AHU, (168 columns).
- OCCUPATION_PROFILE_i: for $0\leqslant i \leqslant 167$ building occupancy rate, (168 columns).
- FORECAST_EXT_RHUM_i_j: $0\leqslant i \leqslant 167$ and $0\leqslant j \leqslant 167$ for each hour $j$ in the week, we have the 168 forecasts of one expert $i$ for the outside relative humidity, (168 $\times$ 168 columns).
- FORECAST_EXT_TEMP_i_j: $0\leqslant i \leqslant 167$ and $0\leqslant j \leqslant 167$ for each hour $j$ in the week, we have the 168 forecasts of one expert $i$ for the outside temperature, (168 $\times$ 168 columns).

The output file contains the times series to be predicted hourly from the input. These corresponds to the predictions on the air quality inside the building and on the outside temperatures and relative humidity obtained from the input. The output file is defined as follows. For each **Id** of the input dataset, the same **Id** of the output data set contains the following quantities $y_i$
considered as a part of the air quality index (AQI).

- HISTORIC_EXT_TEMP_Inf_i: $0\leqslant i \leqslant 167$ , lower bound for the outside temperature, (168 columns).
- HISTORIC_EXT_TEMP_Sup_i: $0\leqslant i \leqslant 167$ , upper bound for the outside temperature, (168 columns).
- HISTORIC_EXT_RHUM_Inf_i: $0\leqslant i \leqslant 167$ , lower bound for the outside relative humidity (168 columns).
- HISTORIC_EXT_RHUM_Sup_i: $0\leqslant i \leqslant 167$ , upper bound for the outside relative humidity, (168 columns).
- SENSOR_INT_TEMP_Inf_i: $0\leqslant i \leqslant 167$ , lower bound for the inside temperature, (168 columns).
- SENSOR_INT_TEMP_Sup_i: $0\leqslant i \leqslant 167$ , upper bound for the inside temperature, (168 columns).
- SENSOR_INT_RHUM_Inf_i: $0\leqslant i \leqslant 167$ , lower bound for the relative humidity, (168 columns).
- SENSOR_INT_RHUM_Sup_i: $0\leqslant i \leqslant 167$ , upper bound for the relative humidity, (168 columns).

For any time step $1\leqslant t \leqslant T$ (here $T=168$ ) of any sample $i$ in the test set and each output $y_{t,i}$ , we ask the model to provide a lower $\underline y_{t,i}$ and an upper bound $\overline y_{t,i}$ to predict a 95% confidence bound. To do so, we propose to provide the prediction as follows, first provide all the lower bounds (with data in the same order as in the output file) $\underline y_{t,i}$ and then all the upper bounds $\overline y_{t,i}$ . Therefore the prediction has first $4\times 168 = 672$ lower bounds and then $4\times 168 = 672$ upper bounds.

The performance of the model is assessed by analyzing the predictions based on the input variables of the test file. We use the Predicted Interval Coverage Percentage (PICP) and the Mean Prediction Interval Width [Pearce et al., 2018], see https://arxiv.org/pdf/1802.07167.pdf in equation (15) with $\lambda = 1$ and $\alpha = 0.05$ . For any time step $1\leqslant t \leqslant T$ (here $T=168$ ) of any sample $i$ in the test set and each output $y_{t,i}$ , we ask the model to provide a lower $\underline y_{t,i}$ and an upper bound $\overline y_{t,i}$ and the PICP and the captured MPIW are computed as $n_i = \sum_{t=1}^T 1_{\underline y_{t,i}\leqslant y_{t,i}\leqslant \overline y_{t,i}}\,,\quad \mathrm{picp}_i = \frac{n_i}{n}\quad\mathrm{and}\quad \mathrm{mpiw}_i = \frac{1}{n_i}\sum_{t=1}^T \left(\overline y_{t,i} - \underline y_{t,i}\right) 1_{\underline y_{t,i}\leqslant y_{t,i}\leqslant \overline y_{t,i}}$ .

The aim is then to minimize $\mathrm{mpiw}_i$ under the constraint $\mathrm{picp}_i\geqslant 0.95$ . Therefore, following [Pearce et al., 2018], we consider the following penalized loss function for one sample $i$ : $\ell_i = \mathrm{mpiw}_i + \frac{n}{\alpha(1-\alpha)} \mathrm{max}(0,(1-\alpha)-\mathrm{picp}_i )^2$ , where $\alpha= 0.05$ . The total loss is the mean over all samples of this loss.

The benchmark was obtained using a LSTM network with dropout using Torch with loss torch.nn.MSELoss() and Adam optimizer with learning rate lr=1e-1 and

- epochs = 1000
- nb_hidden = 8
- nb_layers = 2
- dropout_rate = 0.2

During the test phase, 1000 stochastic runs of the LSTM were used to produce 1000 samples for each data point to be predicted. These samples were then used to build a confidence interval and produce the lower and upper bounds.

Files are accessible when logged in and registered to the challenge