Challenge Data

Prediction of energy losses on a power grid
by Enedis


This challenge doesn't accept new participants

Login to your account


Description


NO LOGO FOR THIS CHALLENGE
Competitive challenge
Physics
Environment
Industrial
Regression
Time series
Tabular
10MB to 1GB
Basic level

Dates

Started on Jan. 10, 2024


Challenge context

Enedis is a public service company in France, responsible for managing the distribution network of electricity across a significant portion of the metropolitan territory, serving approximately 39 million customers. It is entrusted with the development, operation, and modernization of an extensive electrical network, comprising 1.4 million kilometers of low and medium-voltage lines (230 and 20,000 volts), as well as the management of associated data.


Challenge goals

Forecasting Energy Losses on the Enedis Network.

Our problem is a series forecasting challenge without history. It involves predicting series of losses at Enedis network nodes (source substations) based on the characteristics of these nodes.

In this context, we perform electrical balance sheets for each of the 2,200 source substations on the Enedis network to gather loss data for each of these source substations and associated explanatory variables.

The proposed exercise involves forecasting monthly loss volumes per source substation based on their physical characteristics, consumption and production data, and temperature data. Analyzing the forecast models we construct will subsequently help us determine which factors to prioritize in reducing losses. Additionally, the forecasts will enable us to detect source substations with higher losses than expected, which will be locally examined by the business to understand the root causes and, if possible, address them.

One of the challenges in this exercise is that the data volume is relatively low. The network consists of 2,200 source substations, and we have monthly loss volumes for a 2-year historical period, totaling approximately 55,000 records (for the training and testing dataset). The training dataset will contain approximately 70% of the source substations. Furthermore, the data can be relatively noisy for some source substations since losses are not directly measured but are the result of complex electrical balance calculations, which themselves involve certain uncertainties.

More specifically, we obtain losses as the difference between what is injected into the network and what is consumed by different types of clients supplied by Enedis. A portion of local consumptions and productions, especially at low voltage, is not measured but estimated, leading to uncertainties.

Losses Formula: Losses = Energy injected by RTE + Energy Injected by ELDs - Energy Rejected to RTE + Local Production - Medium Voltage Consumption - Low Voltage Consumption - Energy Withdrawn by ELDs.


Data description

The files x_train and x_test have the same structure and contains 36312 and 15552 columns respectively with 30 columns that provides useful information to better understand losses. Here is a description of each column:

  1. Id_poste_source: Identifier of the source substation (anonymized) that distinguishes different data sources.

  2. Id_dr: Identifier of the Regional Direction (DR) associated with the data.

  3. Mois: Month in the format mm/yyyy, indicating the period to which the data pertains.

  4. Conso_totale: Total consumption per source substation and per month, expressed in kilowatt-hours (kWh).

  5. Prod_totale: Total production in kilowatt-hours (kWh) for the given period.

  6. Prop_conso_jour: The proportion of total consumption during daylight hours (between 8 AM and 8 PM), expressed as a percentage (%).

  7. Prop_prod_jour: The proportion of total production during daylight hours, expressed as a percentage (%).

  8. Conso_reseau_HTA: Total consumption on the High-Voltage Network (HTA) for the month.

  9. Conso_reseau_BT: Total consumption on the Low-Voltage Network (BT) for the given period.

  10. Conso_clients_RES: Consumption of residential customers for the month.

  11. Conso_clients_PRO: Consumption of professional customers for the given period.

  12. Conso_clients_ENT: Consumption of business customers for the month.

  13. Prod_reseau_BT: Production on the Low-Voltage Network (BT) for the given period.

  14. Prod_reseau_HTA: Production on the High-Voltage Network (HTA) for the month.

  15. Prod_filiere_eolien: Wind energy production for the given period.

  16. Prod_filiere_PV: Photovoltaic energy production for the month.

  17. Prod_filiere_autre: Energy production from other sources such as hydropower, cogeneration, etc.

  18. Prop_clts_logement_indiv: The proportion of clients residing in individual housing.

  19. Prop_clts_logement_collectif: The proportion of clients residing in collective housing.

  20. Taux_linky: The saturation rate of Linky meters for the given period.

  21. Temperature: The average temperature of the nearest weather station, expressed in °C.

  22. Type_territoire: The type of area where the substation is located. We distinguish: "rural", "semi-urbain", "urbain" and "zone d'activités".

  23. Longueur_reseau_aerien_bt: Total length of overhead wires in Low-Voltage network, expressed in meters (m).

  24. Longueur_reseau_souterrain_bt: Total length of underground wires in Low-Voltage network, expressed in meters (m).

  25. Nb_postes_htabt: Number of High-voltage/Low-voltage transformers substations.

  26. Puissance_transfos: Total capacity of the High_voltage/Low-voltage transformers, expressed in kilovoltampère (kvA)

  27. Prop_hta_type_1: The proportion of "type 1" High-voltage network, expressed in %.

  28. Prop_hta_type_2: The proportion of "type 2" High-voltage network, expressed in %.

  29. Prop_hta_type_3: The proportion of "type 3" High-voltage network, expressed in %.

  30. Prop_hta_type_4: The proportion of "type 4" High-voltage network, expressed in %.

The file y_train contains the total losses for each unique identifier, i.e. the volume of total unique in kilowatt-hours (kWh) for the month.


Benchmark description

The benchmark for this challenge is a classical regression with a Random Forest. The missing data were replaced by their average value before fitting the model. The metric used to compute the score is the Mean Absolute Error (MAE).

The Python implementation of this model is provided in the supplementary files.


Files


Files are accessible when logged in and registered to the challenge


The challenge provider


PROVIDER LOGO

Data Scientist