Challenge Data

Predicting odor compound concentrations by VEOLIA

Description

Dates

Started on Jan. 5, 2022

Challenge context

Veolia group is the global leader in optimized resources management. With nearly 169 000 employees worldwide, the Group designs and provides water, waste and energy management solutions that contribute to the sustainable development of communities and industries. Through its three complementary business activities, Veolia helps to develop access to resources, preserve available resources, and to replenish them.

Veolia's objective is to provide a technical and objective response to perceptions of odor nuisance around certain wastewater and waste treatment sites. SO2 is a colorless gas with a pungent odor and poisonous, the inhalation of which is strongly irritating. It is released into the Earth's atmosphere by volcanoes and by many industrial processes.

A smart prediction of odor compound concentration can improve industrial processes to avoid causing odor nuisance around industrial sites.

Contacts:

• Anne-Sophie Guilbert
• Yannick Deleuze
• email: fr.veri.challenge-ens.int.groups@veolia.com

Challenge goals

Can you predict the concentration of Sulfur dioxide (SO2) at one location from a network of sensors?

Using measurement data from ATMO Normandie sensor network, weather data, and land use data from Copernicus Corine Land Cover (CLC), the goal is to do Multivariate Time Series Forecasting and predict the SO2 hourly concentration in μg / m³ corresponding to the next 12 hours at the Le Havre, MAS station from the last 48 hours.

Data description

The dataset contains hourly average concentrations from the fixed network of the main regulated pollutants in the air in the Normandy region, including sulfur dioxide SO2. All data provided are in μg / m³ (microgram per cubic meter). It also contains hourly values for weather data such as surface temperature, wind speed, wind direction, relative humidity, atmospheric pressure, dew point, and precipitation rate. Finally it contains the land cover class that is an indicator on the ability of a pollutant plume to be more or less dispersed due to the occupation of the land.

The total volume of data corresponds to a year of historical data. The file is decomposed into a training dataset and a test dataset and each dataset contains input and output variables. Each sample in the training and test sets corresponds to $48$ hour observations, each column corresponds to a sensor value at a given hour. One input $x_i$ is described as follows:

• ID : row ID
• weekday-$i$ : $1 <= i <= 48$ , weekday (monday =1, ... , sunday =7) at previous $i$ hour
• hour-$i$ : $1 <= i <= 48$ , hour at previous i hour
• SO2_HRI-$i$ : $1 <= i <= 48$ , SO2 measurement at the HRI station in micrograms per cubic meter at previous $i$ hour
• SO2_HVH-$i$ : $1 <= i <= 48$ , SO2 measurement at the HVH station in micrograms per cubic meter at previous $i$ hour
• SO2_STA-$i$ : $1 <= i <= 48$ , SO2 measurement at the STA station in micrograms per cubic meter at previous $i$ hour
• SO2_CAU-$i$ : $1 <= i <= 48$ , SO2 measurement at the CAU station in micrograms per cubic meter at previous $i$ hour
• SO2_GOR-$i$ : $1 <= i <= 48$ , SO2 measurement at the GOR station in micrograms per cubic meter at previous $i$ hour
• SO2_HAR-$i$ : $1 <= i <= 48$ , SO2 measurement at the HAR station in micrograms per cubic meter at previous $i$ hour
• x_wgs84_HRI-$i$ : $1 <= i <= 48$ , X coordinate of the station HRI in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_HVH-$i$ : $1 <= i <= 48$ , X coordinate of the station HVH in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_MAS-$i$ : $1 <= i <= 48$ , X coordinate of the station MAS in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_STA-$i$ : $1 <= i <= 48$ , X coordinate of the station STA in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_CAU-$i$ : $1 <= i <= 48$ , X coordinate of the station CAU in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_GOR-$i$ : $1 <= i <= 48$ , X coordinate of the station GOT in the World Geodetic System (WGS) format at previous $i$ hour
• x_wgs84_HAR-$i$ : $1 <= i <= 48$ , X coordinate of the station HAR in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_HRI-$i$ : $1 <= i <= 48$ , Y coordinate of the station HRI in the World Geodetic System (WGS) format at previous i$i$ hour
• y_wgs84_HVH-$i$ : $1 <= i <= 48$ , Y coordinate of the station HVH in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_MAS-$i$ : $1 <= i <= 48$ , Y coordinate of the station MAS in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_STA-$i$ : $1 <= i <= 48$ , Y coordinate of the station STA in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_CAU-$i$ : $1 <= i <= 48$ , Y coordinate of the station CAU in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_GOR-$i$ : $1 <= i <= 48$ , Y coordinate of the station GOR in the World Geodetic System (WGS) format at previous $i$ hour
• y_wgs84_HAR-$i$ : $1 <= i <= 48$ , Y coordinate of the station HAR in the World Geodetic System (WGS) format at previous $i$ hour
• surfaceTemperatureCelsius-$i$ : $1 <= i <= 48$ , Temperature in Celcius degrees at previous $i$ hour
• surfaceDewpointTemperatureCelsius-$i$ : $1 <= i <= 48$ , Dewpoint temperature in Celcius degrees at previous $i$ hour
• relativeHumidityPercent-$i$ : $1 <= i <= 48$ , relative humidity in % at previous $i$ hour
• surfaceAirPressureKilopascals-$i$ : $1 <= i <= 48$ , Pressure in Kilopascals at previous $i$ hour
• windSpeedKph-$i$ : $1 <= i <= 48$ , Windspeed in kilometers per hour at previous $i$ hour
• windDirectionDegrees-$i$ : $1 <= i <= 48$ , Wind direction in degrees at previous $i$ hour. 0° is a wind blowing from the north.
• cloudCoveragePercent-$i$ : $1 <= i <= 48$ , Cloud coverage in % at previous $i$ hour
• precipitationPreviousHourCentimeters-i: $1 <= i <= 48$ , Precipitation in centimiters at previous i hour
• directNormalIrradianceWsqm-$i$ : $1 <= i <= 48$ , Direct normal solar irradiance watt per square meter at previous $i$ hour
• downwardSolarRadiationWsqm-$i$ : $1 <= i <= 48$ , Downward solar irradiance watt per square meter at previous $i$ hour
• diffuseHorizontalRadiationWsqm-$i$ : $1 <= i <= 48$ , Diffuse horizontal irradiance (amount of radiation received) in watt per square meter at previous $i$ hour
• windChillTemperatureCelsius-$i$ : $1 <= i <= 48$ , Wind chill temperature in Celcius degrees at previous $i$ hour
• apparentTemperatureCelsius-$i$ : $1 <= i <= 48$ , Apparent temperature in Celcius degress at previous $i$ hour
• snowfallCentimeters-$i$ : $1 <= i <= 48$ , Snow fall in centimeters at previous $i$ hour
• surfaceWindGustsKph-$i$ : $1 <= i <= 48$ , Surface wind gust in kilometers per hour at previous $i$ hour
• land_cover_class_HVC-$i$ : $1 <= i <= 48$ , Land cover class around station HVC at previous $i$ hour
• land_cover_class_HAR-$i$ : $1 <= i <= 48$ , Land cover class around station HAR at previous $i$ hour
• land_cover_class_CAU-$i$ : $1 <= i <= 48$ , Land cover class around station CAU at previous $i$ hour
• land_cover_class_MAS-$i$ : $1 <= i <= 48$ , Land cover class around station MAS at previous $i$ hour
• land_cover_class_GOR-$i$ : $1 <= i <= 48$ , Land cover class around station GOR at previous $i$ hour
• land_cover_class_HRI-$i$ : $1 <= i <= 48$ , Land cover class around station HRI at previous $i$ hour

The output file contains the $12$ hour times series to be predicted hourly from the input. These corresponds to the predictions on the SO2 measured over time at the target station. The output file is defined as follows. For each Id of the input dataset, the same Id of the output data set contains the following quantities $y_i$ :

• ID : row ID
• SO2_MAS+$i$ : $0 <= i <= 11$ , SO2 measurement at the MAS station at $i$ hour ahead in micrograms per cubic meter

The input test dataset will have the following form: 48 columns for each feature time series:

ID, feature1-48, ..., feature1-1, feature2-48,..., feature2-1, .... featureN-48, ..., featureN-1
1,...
2,...
3,...
...


The ouput test data will have the same ID correspondance with each rows corresponding to the $12$ hours to predict:

ID, SO2_MAS+0, SO2_MAS+1, ..., SO2_MAS+11
1,...
2,...
3,...
...


Evaluation metric

The metric used is the MSE (Mean Squared Error).

Benchmark

The benchmark was obtained using a LSTM network with dropout using Keras with loss 'mse' and Adam optimizer with learning rate lr=1e-3 and

epochs = 100
batch_size = 512
nb_LSTM_layers = 1
nb_units = 30
dropout_rate = 0.2


Files

Files are accessible when logged in and registered to the challenge

The challenge provider

Veolia Research and Innovation