Research on satellite-based surface solar irradiance forecasting
For this challenge, the submitted files can be heavy (more than 100Mo). Processing your submission might take a few minutes: be patient!
Started on Dec. 10, 2021
In the domains of Solar Energy and energy meteorology, there is a need for accurate intraday, named hereinafter short-term, solar forecasting. Indeed, short-term forecasts allow a better integration of photovoltaic (PV) systems by anticipating the variability of solar radiation in space and time. This is particularly important in electric systems with a high penetration of solar energy, where the dispatch of generation units to match electricity production and consumption at each time is particularly challenging. This need not only holds for large scale electric grid PV integration but also in the special case of off-the-grid electricity supply systems.
Geostationary satellites, notably thanks to algorithms such as Heliosat, are a source of spatially and temporally resolved of surface solar irradiance (SSI), the "fuel" of PV systems (unit: $W/m^2$ ). In the framework of Copernicus Atmospheric Monitoring Services (CAMS), the multispectral images acquired by Meteosat Second Generation (MSG) at the longitude 0° are used to provide, in near-real-time basis, every 15 min, images of SSI and SSI under clear-sky condition at 3 km resolution. These services, resp. CAMS Rad and CAMS McClear, are operated and maintained by Transvalor Innovation SoDA (www.soda-pro.com), in collaboration with the DLR, the German Aerospace. This source of time series of SSI images is notably used to provide short-term (up to 2 hours) solar forecasting. The state-of-the-art of such satellite-based short-term solar forecasting is based on cloud motion vector (CMV) using optical flow or block-matching techniques.
The aim of this challenge is to propose machine learning and deep learning approaches on sequences of images to provide better short-term forecast of future image of SSI on horizontal plan, noted GHI (Global Horizontal Irradiance), for time horizon ranging from 15 minutes to 1 hour, with a time resolution of 15 min and a spatial resolution of 3 km.
More precisely, we are interested in a square region of interest (RoI) of size 51 pixels x 51 pixels (approx. 150 km). With an assumption of max cloud speed of $10+ m/s$ , and considering solar forecasting up to 1 hour ahead, the observation region (OR) encompassing the RoI have a size of 81 pixels x 81 pixels (approx. 240 km).
At a given time $t$ , one hour after the sunrise and one hour before the sunset, considering the sequence of the 4 previous $\text{GHI}$ images on the OR every 15 min, the solar forecasting aims at predicting the $\text{GHI}$ images on the RoI for the next times ahead, ranging from the next 15 min up to the next hour with a time step of 15 min. This forecast of $\text{GHI}$ for the location $p$ , done at the time $t$ for the future time $t + \Delta t$ is noted: $\widehat{\text{GHI}}(p,\ t + \Delta t|t)$
The learning phase is done on one year of data and the test phase is done on a separate year.
In this challenge, we will only consider the cloud effects on $\text{GHI}$ , assuming that the concomitant and collocated $\text{GHI}$ under clear-sky condition (with no cloud) is perfectly known and noted $\text{GHI}_{cls}$ .
Contextual information of interest are the corresponding solar zenith angles (SZA) $\theta_{S}$ , solar azimuth angle (SAA) $\alpha_{S}$ .
Do not hesitate to refer to the full Copernicus documentation in the supplementary files, as several technical aspects of the challenge are further explained and detailed.
The training set contains 1845 samples, and the test set contains 1841 samples. Each sample represents a time $t$ at which we consider the previous images and wish to predict the next images.
Practically, the input $X$ is encoded in the numpy .npz format and consists of:
datetime the time $t$ at which we consider the 4 previous $\text{GHI}$ images on the OR every 15 min. This vector of length $n_{samples}$ is of datetime type (YYYYMMDDHHMM).
GHI a matrix of size ($n_{samples}$ , 81, 81, 4) with the sequence of the 4 previous 15-min $\text{GHI}$ images (of size (81,81)) on the OR for the times $t$ -45min, $t$ -30min, $t$ -15min, and $t$ .
CLS a matrix of size ($n_{samples}$ , 81, 81, 8) with the sequence of the 4 previous and 4 next modelled 15-min clear-sky (i.e. with no clouds) $\text{GHI}$ images (of size (81,81)) on the OR, noted $\text{GHI}_{cls}$ , for the times $t$ -45min, $t$ -30min, $t$ -15min, $t$ , $t$ +15min, $t$ +30min, $t$ +45min, $t$ +60min.
SZA a matrix of size ($n_{samples}$ , 81, 81, 8) with the sequence of the 4 previous and 4 next modelled 15-min SZA (of size (81,81)) on the OR for the times $t$ -45min, $t$ -30min, $t$ -15min, $t$ , $t$ +15min, $t$ +30min, $t$ +45min, $t$ +60min.
SAA a matrix of size ($n_{samples}$ , 81, 81, 8) with the sequence of the 4 previous and 4 next modelled 15-min SAA (of size (81,81)) on the OR for the times $t$ -45min, $t$ -30min, $t$ -15min, $t$ , $t$ +15min, $t$ +30min, $t$ +45min, $t$ +60min.
Note that $n_{samples}$ = 1845 for the training set and $n_{samples}$ = 1841 for the testing set.
To load and read the contents of a .npz file, one can use the following:
# Load the .npz file
X = np.load('filename.npz', allow_pickle=True)
# Display the contents of the .npz file
X.files
# Access the contents of the .npz file
date = X['datetime']
GHI = X['GHI']
CLS = X['CLS']
SZA = X['SZA']
SSA = X['SSA']
Note that $n_{samples}$ = 1845 for the training set and $n_{samples}$ = 1841 for the testing set.
The output vector $y$ represents the sequence of the 4 next 15-min $\text{GHI}$ images on the RoI, corresponding to a matrix of size ($n_{samples}$ , 51,51,4), for the 4 future times $t$ +15min, $t$ +30min, $t$ +45min and $t$ +60min.
In this challenge, we will be providing the raw 2D format of the output vector $y$ which is a dataframe of size ($n_{samples}$
, 4x51x51x4+1) = ($n_{samples}$
, 10405), where the first colum of the dataframe is id_sequence
(the ids of the considered time sequence $t$
).
In order to transform the raw 2D output to a 4D matrix format (which will be useful for displaying the various images of the output vector $y$ ), it is necessary to:
First, remove the id_sequence column.
Second, use the following transformation:
y_4D = np.transpose(np.reshape(np.array(y_raw),(-1,4,51,51)), (0, 1, 3, 2))
In order to transform the 4D output to a 2D raw format (which will be compulsory when submitting your model predictions), it is necessary to:
y_2D = np.transpose(y_4D, (0,1,3,2)).reshape(-1, 10404)
Transform the array to a dataframe.
Add an index column id_sequence.
These transformations are already implemented in the transform_output_format.py (cf. supplementary files).
The OR and the RoI are concentric: with python-like index :
RoI = OR[15:66,15:66]
Two simple forecasts methods will be provided for the benchmark:
The persistence forecasting $P$ :
$\widehat{\text{GHI}_{P}\left( p,t + \Delta t|t \right)} = GHI(p,t)\left( \frac{\text{GHI}_{\text{cls}}(p,t + \Delta t)}{\text{GHI}_{\text{cls}}(p,t)} \right)$This method of forecasting is used as a baseline, to compute the skill-score (SC).
The $CMV$ forecasting which is based on a state-of-the-art optical flow and CMV persistence.
We will be providing the persistence forecasting $P$ benchmark for the test set while the $CMV$ forecasting benchmark, for the test set as well, will be added as supplementary data.
The candidate is free to choose either one of these forecasting methods to benchmark his model.
Files are accessible when logged in and registered to the challenge