Master of Science Thesis
KTH School of Industrial Engineering and Management Energy Technology TRITA-ITM-EX 2018:608
Division of Heat and Power Technology SE-100 44 STOCKHOLM
Portfolio balancing strategy for the
integration of renewable energy sources to the day ahead market
Master of Science Thesis TRITA-ITM-EX 2018:608
Portfolio balancing strategy for the integration of renewable energy sources to the day ahead market
Supervisor Rafael Guedez
Commissioner Contact person
New methods of government support and marketing of renewable production push the renewable energy sources (RES) to be more integrated to the wholesale day-ahead market. In this way, predictive models of production for solar and wind power have been developed to manage the resulting balancing costs. They aim to forecast the production of a plant for the following day at hourly intervals, based on historical operational and meteorological data. They are backed by three machine learning algorithms, which are the Artificial Neural Networks (ANN), the Support Vector Regression (SVR) and the Random Forest (RF).
These models are evaluated on 20 solar farms and 2 wind farms, through 3 criteria which are the RMSE, the RMSEN and the R-square. It gives significantly improved performances compared to ‘persistence method’
or other naive methods. In most cases, the best results were obtained with the random forest algorithm, with an average RMSEN of 15% and an average R-square of 0,8. Considering these models and ideal operational conditions, the balancing costs are evaluated for each solar farm, showing the lowest obtainable costs with these models. The average cost calculated ranges from 1 to 1,4 € per MWh produced depending on the power plant considered.
However, thanks to the ‘portfolio benefit effect’, the combination of the forecasting errors of multiple sites can highly decrease this cost. Strategies of portfolio combination can be developed by increasing the installed capacity and the number of sites within the portfolio and/or diversifying the locations or the types of RES used. The savings go up to 45% of the initial simple balancing costs.
Nya metoder för statligt stöd och marknadsföring av förnybar produktion driver förnybara energikällor (RES) för att vara mer integrerade på grossistmarknaden för framtida varor. På detta sätt har förutsägda produktionsmodeller för sol och vindkraft utvecklats för att hantera de resulterande balanseringskostnaderna.
De syftar till att prognostisera produktion av en växt för följande dag med timmars intervall, baserat på historiska operativa och meteorologiska data. De stöds av tre maskininlärningsalgoritmer, som är Artificial Neural Networks (ANN), Support Vector Regression (SVR) och Random Forest (RF).
Dessa modeller utvärderas på 20 solparker och 2 vindkraftverk, genom 3 kriterier som är RMSE, RMSEN och R-kvadraten. Det ger signifikant förbättrade prestanda jämfört med "persistensmetod" eller andra naiva metoder. I de flesta fall erhölls de bästa resultaten med Random Forest algoritm, med en genomsnittlig RMSEN på 15% och en genomsnittlig R-square på 0,8. Med tanke på dessa modeller och idealiska driftsförhållanden utvärderas balanseringskostnaderna för varje solkraftverk, vilket visar de lägsta tillgängliga kostnaderna med dessa modeller Den genomsnittliga kostnaden beräknas sträcka sig från 1 till 1,4 € per MWh producerad beroende på vilken kraftverk som övervägs.
Tack vare "portföljförmånseffekten" kan kombinationen av prognostiseringsfel på flera webbplatser emellertid mycket minska denna kostnad. Strategier för portföljkombination kan utvecklas genom att öka den installerade kapaciteten och antalet platser inom portföljen och / eller diversifiera de platser eller typer av RES som används. Besparingarna går upp till 45% av de initiala enkla balanseringskostnaderna.
Abstract ... 2
1 Introduction and objectives ... 6
1.1 Introduction ... 6
1.2 Master thesis context ... 6
1.3 Problems definition ... 7
1.4 Thesis objectives ... 7
2 Integration of the RES in the French energy market ... 8
2.1 RES future development in France ... 8
2.2 French power market organization ... 9
2.3 Financial incentives for renewables in France ...10
3 Physical approach of PV and wind power production ... 12
3.1 Photovoltaic power production ...12
3.1.1 Photovoltaic power plant performance ...12
3.1.2 Sun position ...13
3.1.3 Predictors selected ...14
3.2 Wind power production ...14
3.2.1 Wind power conversion ...14
3.2.2 Wind farm design ...15
3.2.3 Predictors selected ...16
4 Methodological approach ... 17
4.1 Data collection ...17
4.2 Data exploration and preparation ...17
4.3 Machine learning algorithms ...18
4.3.1 Generalities ...18
4.3.2 Support Vector Regression (SVR) ...18
4.3.3 Artificial Neural Network (ANN) ...19
4.3.4 Random Forest (RF) ...20
4.4 Development of the predictive models ...20
4.5 Performance criteria ...21
5 Solar energy production prediction ... 22
5.1 Assumptions and boundaries ...22
5.2 Available data description ...22
5.3 Models evaluation ...23
5.4 A deeper description of the results obtained with the best model ...26
5.5 Discussion ...29
6 Wind energy production prediction ... 30
6.1 Assumptions and boundaries ...30
6.2 Available data description ...30
6.3 Models evaluation ...31
6.3.1 Simple algorithm models ...32
6.3.2 Complex models ...33
6.4 A deeper description of the best model ...36
6.5 Discussion ...38
7 Balancing costs estimate and balancing portfolio strategy. ... 39
7.1 Methodology for balancing costs estimate ...39
7.2 Balancing costs estimation ...40
7.3 Portfolio balancing strategy ...42
7.3.1 Portfolio benefit effect ...42
7.3.2 Geographic diversification. ...43
7.3.3 Renewable energy sources diversification ...44
8 Conclusion ... 46
Index of tables ... 49
1 Introduction and objectives 1.1 Introduction
In 2015, pushed by the European Commission, the law relative to the energy transition and to the green growth is voted in France . It introduces a legal framework to the direct marketing and to the feed-in premium tariff scheme for the power production from renewable energy sources (RES). It has been a major turning point for the organisation of the French power market, since its liberalization in the early 2000’s, making renewable energy production sensitive to the price signal given by the wholesale market. Whereas before the electricity from RES was directly sold to the historical and state-owned electric utility company EDF, it should now be sold and valued on the wholesale market through market brokers and traders.
From this day, the competition between balance responsible entities has been opened, and most of them see this new mechanism as a real opportunity to market renewable energy. The field growth is every year more important, and the French objectives in terms of RES are great. As a matter of fact, their production should represent 40% in 2030  of the total electricity production against 18,4% in 2017 . Pushed by these observations the decision makers think every time more on the way they can best integrate these new sources of energy.
However, electricity from RES is different from any other type of energy when it comes to integrate it to an energy portfolio. Indeed, production from wind power and solar power is unmanageable, intermittent, and mainly dependent of weather parameters that are still difficult to predict today.
Nevertheless, to offer power plant operators the best price for the purchase of their electricity produced, the balance responsible entities must limit the losses due to the imbalance penalties paid to the grid operator when there is a difference between the production scheduled and the actual production.
Thus, this master thesis studies the various impacts of RES on an energy portfolio and tries to determine several parameters leading balance responsible entities to an optimal strategy in sight of their integration to the wholesale market.
1.2 Master thesis context
This master thesis has been led in collaboration with Solvay Energy Services (SES), and particularly the asset optimization team. SES is a global business unit of Solvay group, one of the leaders in the global chemical industry. Its role is to source the group and its industrial client in energy, such as gas, power, coal or CO2
allowances. The assets optimization team has the role to optimise and to promote the operation of electrical means of production owned by Solvay or by a third part. SES is a balance responsible player on the French electricity market, and has in its balance portfolio around 700 MW of installed power capacity (mainly cogeneration turbines and run-of-river hydro power plant)
Solvay Energy Services has recently contracted with renewable power plants operators to value their electricity on the wholesale through the new mechanism that is the feed-in premium or through Power Purchase Agreement (PPAs), which is a contract that consists in buying electricity to a third-part at a defined price during a certain time. Whether that the feed-in premium concerns new installations supported by public financial incentives, the PPA concerns any type of power plant which doesn’t receive public grants.
As a result, SES is now in charge of selling electricity from RES on the wholesale market and plays the role of aggregator by combining the production of several renewable power plants. Consequently, SES must develop strategies for the sale of electricity on the power market, during both the day-ahead and the intraday markets.
1.3 Problems definition
As mention previously, the main task for a balance responsible entity by integrating RES is to manage the balancing costs resulting of uncontrollable and intermittent production of wind power and solar power. In this way, it should be able to receive the best production forecast of its asset for the next day and to provide the most precise schedule of production to the grid operator. Also, to price the energy produced by a plant, it should estimate the cost due to the imbalances.
In this way, the questions that the thesis tries to solve are: Which are the best ways to predict the production of a wind farm and of a solar plant for the next day? What performances can be reached? Considering these forecasting models what are the balancing costs? How a balance responsible can reduce them?
1.4 Thesis objectives
Considering the problems raised in the previous section, the objectives of this master thesis are:
• To develop predictive models for wind and solar power production, for valuing this energy on the day-ahead market, based on historical data.
• To evaluate the performance of these models.
• To evaluate the balancing costs resulting of the use of these models.
• To define parameters of a strategy that best integrates RES into the wholesale market.
2 Integration of the RES in the French energy market 2.1 RES future development in France
Since the beginning of the XXIst century the electricity generation share from renewable energy sources has boomed, led mainly by solar and wind power . Those two types of generation are the most popular in France after the hydropower as shown on Figure 1, representing the shares of the four main renewable energy sources according to the TSO data  (Solaire = Solar, Bioénergies = Bioenergy, Eolien = Wind power, Hydraulique = Hydro).
Figure 1: Share of the installed capacity of the four main renewable energy technology in France .
The growth of the RES shares in the electricity production will continue thanks to the objectives set by the government which are to reach 27% in 2020 and 40% in 2030 against only 18,4% in 2017 .
This rush for the development of RES can be explained by a real fall of the levelized cost of energy (LCOE) of those technologies in all countries, and particularly of the PV cells technology, as shown on Figure 2. A fall that is predicted to continue according to IRENA prediction in .
Figure 2 : Global levelized cost of electricity from utility-scale renewable power generation technologies, 2010-2017, 
However solar and wind power are must-run technology and intermittent. They are not controllable, and it is still difficult to forecast precisely their production. Those two drawbacks are challenging all grid operators over the world, who are responsible of the physical balance of the grid. Indeed, most of the electrical grid have been built to be efficient with electricity produced by controlled and scheduled generation technologies such as thermal power plant, or nuclear power plant.
2.2 French power market organization
The French power market has been liberalized in the early 2000’s, breaking an historical monopoly of EDF, the former state-owned electric utility company. This change induced a new organization and new ways for the sale and the purchase of power. Today, the French short-term power market is organized around the EPEX Spot stock exchange, which is the exchange for spot power trading in the biggest West European countries as Germany, France, Great-Britain or Netherlands.
The members of the EPEX Spot are the balance responsible entities to the transmission system operator (TSO), which is RTE in France. They are operators, such as Solvay Energy Services, which managed the balance of its portfolio between the electricity withdrawal and the electricity injection. It is also committed to finance the balancing costs to the TSO. Those costs represent the real cost for the TSO to maintain the physical balance of the power grid (for instance, through reserve calls or flexibility mechanisms).
The EPEX Spot managed two types of markets. Firstly, the over-the-counter market (OTC) in which a counterpart trades directly with another one, buying or selling a certain amount of energy at a contracted price. Secondly, the spot exchange market which is the main trading market. In France in 2016, the OTC market represents 19 % of the exchanged volumes, whereas the spot market represents 81%. The spot market is divided in two parts: the day-ahead market, where 88 % of the volumes are exchanged at the EPEX Spot in 2016, and the intraday market, where only 12 % of the volumes are exchanged .
The day-ahead market is organized as an auction ending every day at 12:00pm for the entire following day.
During the trading period, the members bid for each half hour how much they want to buy or to sell and at which price. At the end of the trading period, a price is determined for each half hour resulting in the crossing point between the demand and supply curves, which usually corresponds to the highest marginal cost of the production units used .
Figure 3: Price setting from the offer and the demand curves 
The intraday market is organized as exchanges between counterparts. It takes place every day from 15:00pm the day before until a few minutes before the delivery. An order is activated another one corresponding is registered at EPEX Spot
After the delivery, the differences between the volumes contracted during the day-ahead and intraday market and the real volumes exchanged are paid to the TSO, through the balancing costs. To do so, the TSO meters the total imbalance of the perimeter managed by the balance responsible. This amount corresponds to the total energy injection minus the total energy extraction. Depending on the sign of the imbalance and the balance situation of the power system, the price of the imbalanced energy is defined by adding or subtracting a fee to the spot price set during the day-ahead market as described in Table 1 .
Imbalance of the power system
Imbalance of the portfolio
+ Price = Spot - fee Price = Spot
- Price = Spot Price = Spot + fee
Table 1: Imbalance costs setting depending of the imbalance of the power system
The development of the renewables for fifteen years has pushed up the volumes exchanged during the short- term markets, since their production is highly volatile and often depends of meteorological parameters (at least for wind and solar power), which are more precise just before the delivery.
2.3 Financial incentives for renewables in France
To encourage the development of renewables in France, many financial incentives have been created years after years. Today, most of the renewables units installed before 2016 benefits of a feed-in tariff incentive, called ‘Obligation d’achat’ (OA) contracts, or ‘purchasing obligation contracts’ in English. But, since 2016, the new RES units benefit of the ‘Complément de Rémunération’ (CR) scheme, or ‘additional remuneration’ or ‘feed- in premium” in English.
• Obligation d’achat contracts (OA)
The Obligation d’achat incentive is granted to project developers designated by the CRE (Commission de Régulation de l’ Energie: the independent administrative body in charge of regulating the energy sector in France), after a call for tenders between projects asking for a specific feed-in tariff, which often represents the price limit at which the developer considers the project as profitable.
Each call for tenders targets a certain type of renewable energy source (wind, solar, biomass, hydro…), and is offered for a certain amount of installed capacity. The ones that best meet the requirements of the specifications are designated and can contract with EDF OA, an EDF subsidiary which operate as a public utility. This contract set the selling price for the electricity produced for a certain period (generally up to 15 years) and engage EDF OA to buy all the electricity produced by the power plant.
However, many drawbacks to the OA appeared. The first one is the monopoly of EDF OA, which was the only market player able to add those renewable power plant into its portfolio. This can be considered as highly problematic in a free and liberalized market. The second was that the OA was not encouraging the renewable production to be sensitive to the market price, and particularly to the negative prices which are a sign of high tension on the grid .
• Complément de rémunération (CR)
The complément de rémunération incentive has been introduced in 2016 by the CRE to face the OA criticisms.
Also, it is a direct consequence of the EU directives on the public incentives in the energy field, which foster incentives linked to the wholesale electricity market. 
It is granted after a call for tenders as the OA was. The CR guarantees a total fixed income (reference income) in euro per MWh produced by the power plant. This reference income is set by the call for tenders and corresponds often to the break-even point of the installation. However, the CR is not organized as the OA was. First, electricity produced is sold directly on the market by the producer. Afterwards if the revenues from
the market is lower than the reference income, the state pays the difference to the producer (which is literally the additional remuneration or complément de rémunération), contrariwise if the revenues from the market is higher, the producer will have to pay back the surplus to the state. Finally, if the price is negative (what has happened only during two hours in 2016 in France), no additional revenue is pay to the producer in these hours , .
The calculation of the revenues from the market is based on a national monthly index M0 specific for each technology (PV, Wind…). This index M0 is equal to the mean spot price (day ahead) at which the electricity produced by this technology is sold on the market during the month. For instance, the M0 index for the wind power might be lower than the one for PV power since wind turbine produce also during the night when price are often lower as shown in Table 2.
In €/MWh January February March April May
M0 wind power 78,70 48,41 34,33 33,00 32,54
M0 solar power 84,62 53,12 35,06 33,50 33,67
Table 2 : French M0 index in 2017 for wind and solar power published by the CRE .
Moreover, the calculation of the revenues from the market for a producer doesn’t consider when its electricity is actually produced during the month. If a wind farm is more efficient during peak hours, it will probably have a mean selling price higher than the M0, but it will receive the same additional revenue by energy produced as another one.
The change from the OA to the CR has financially no impact on the producer. The main change is that it must get access to the power market, and often do so via an aggregator, which can be freely chosen. In France the main ones are EDF, Engie, Statkraft, Enercoop or Solvay Energy Services. Its role is to combine in its portfolio energy from different power plants and to sell it on the power market during the day ahead or the intraday periods. An aggregator nominates, that is to say estimate its portfolio production, on each half-hour of the day D, at 12:00pm on the day D-1. If its production forecast for the half-hour N changes before the delivery time, it can trade with other traders directly on the intraday market. However, if there is an imbalance between the energy volumes traded on the market and the delivered volumes, the aggregator will have to pay the imbalance costs introduced in the previous section. Nevertheless, by aggregating renewable energy from multiple non-correlated plants, an aggregator can reduce the risk of imbalance, and decrease the costs of the intermittence of renewable energy sources production.
3 Physical approach of PV and wind power production
Two kinds of renewable power plants have been studied along this thesis: photovoltaic power plant and wind power plant. This part will quickly introduce the physical processes that convert solar power or wind power to electrical power, in order to determine which are the main predictors to consider when forecasting the production of such power plants.
3.1 Photovoltaic power production
3.1.1 Photovoltaic power plant performance
Photovoltaic cells are the basis of a solar power plant’s electricity production. They can directly convert energy from the solar radiation to electrical power. However, since that their unit production is of only a few watts, PV cells need to be connected to build larger power plant. Today a large scale photovoltaic power plant has generally an installed power capacity higher than 5MWc and can reach 900 MWc as for the Kurnool Ultra Mega Solar Park in India. These installations associate a huge number of cells within modules and arrays. The photovoltaic power plant performance depends then of the numerous variables impacting the production of the cells and of the modules themselves. Several types of cell exist such as monocrystalline, polycrystalline, thin films, each one of them having a specific efficiency depending on the conditions of operation. Their performance can be visualized through the Intensity-Voltage and Power-Voltage curves of the cell as displayed on Figure 4 and on Figure 5.
Figure 5 : P-V curve of a photovoltaic cell 
Figure 4 shows a certain type of regime of the cell for which the intensity in the cell is Isc, when it is short- circuited, and for which the tension is equal to Voc when it is open circuited. The power output of the cell is then equal to the product of I by V. The regime of the cell is maintained to the maximum power point displayed on Figure 5 thanks to the use of an electronic regulation device called maximum power point tracker. 
However, the I-V curve might change depending on two main parameters, which are the irradiance and the temperature, as shown on Figure 6 and on Figure 7.
Figure 4: I-V curve of a photovoltaic cell 
Figure 7: P-V curve evolution depending on the cell temperature On Figure 6, it can be observed that the irradiance has huge impact on the Isc and tends to increase the intensity of the maximum power point when the irradiance increases, considering the temperature as fixed.
On Figure 7, it can be observed that the temperature has an impact on the open circuit voltage, Voc, and tends to decrease the tension on the maximum power point when the temperature is higher, considering the irradiance as fixed. In short, these two plots show that the power produced by a cell, and consequently by a whole solar farm, increases when the irradiance increases and when the temperature decreases. It can be noted, that the temperature considered here is the one of the cell which depends of the ambient temperature, of the solar radiation and of the wind.
Moreover, the performance of power plant also depends of technical and environmental characteristics.
Indeed, the type of modules, their connection, the inverters used, or the shading effect might have a direct impact on the efficiency of the power plants. However, these parameters are directly linked to the design of the plant and to the ambient conditions.
3.1.2 Sun position
As described in the previous section, the irradiance is the main parameter impacting the power output of a solar farm. The total irradiance has two components, the direct and the diffuse irradiances, as described by the equation 1.
𝐼𝑡 = 𝐼𝑏𝑐𝑜𝑠𝜃𝑧𝑐𝑜𝑠𝜃 + 𝐼𝑑 𝐼𝑡 Total surface irradiation
𝐼𝑏 Direct normal irradiation 𝐼𝑑 Diffuse irradiation
𝜃𝑧 Solar zenith Angle 𝜃 Surface incidence angle
Figure 8 : Solar angle definition schemas Figure 6 : I-V curve evolution depending on the irradiance
Considering that the surface incidence angle depends of the surface tilt angle of the panels and of the sun positions, as described by the equation 2.
𝜃 = arccos(𝑐𝑜𝑠𝛽𝑐 𝑐𝑜𝑠𝜃𝑧+ 𝑠𝑖𝑛𝛽𝑐 𝑠𝑖𝑛𝜃𝑧(𝛾𝑠− 𝛾𝑐)) 𝛽𝑐 Surface tilt angle
𝛾𝑠 Solar azimuth angle 𝛾𝑐 Surface azimuth angle
Thus, the total irradiance depends on the sun position and, consequently, of the solar angles, which can be computed from the day of the year and the location of the power plant. Also, it depends of how the PV panels are mounting, through the surface tilt angle, and the direction they face, through the surface azimuth angle.
Finally, it depends of meteorological data such as the direct normal irradiance and the diffuse irradiance.
3.1.3 Predictors selected
Over this quick introduction to the production performances of a solar power plant, some predictors can be selected to forecast the short-term production which can be classified in two types as they appear in Table 3.
These predictors are time dependent and must computed or evaluated for each time interval of the forecast.
Meteorological predictors Solar position predictors Direct normal irradiance Azimuth angle
Diffuse irradiance Zenith angle
Temperature Wind speed
Table 3 : Predictors selected for solar power
Other data, such as the tilt angle of the panels, the surface azimuth angle, the location might be needed.
However, since in this study the algorithms used to forecast the production of power plants are based on machine learning, it can be considered that the machine is able to learn those characteristics from historical data. In the same way, it is assumed that it can learn what type of PV cell is used (monocrystalline, polycrystalline, thin film…), the nominal capacity of the plant, the losses that the shading effect and the electrical components can cause.
3.2 Wind power production
3.2.1 Wind power conversion
The wind turbine role is to convert kinetic energy of the air into electrical energy. Depending on the wind speed, more or less power is available in the air . This energy can be expressed by the equation 3.
Where P is the power in the wind, 𝜌 the density of the air, A the area swept by the blades, and U the wind speed. One can note that the power is proportional to the cube of the wind speed, therefore a 10% error on the wind speed can imply a 33% error on the power available in the wind.
However, a wind turbine cannot extract all the power from the wind. Indeed, if it was the case the air would not move after the turbines and would not be extracted. The maximum power that can be extracted is defined by the Betz limit equal to 59% of the available power. Figure 9 displays the power in the wind and the maximum extractable power due to Betz limit as a function of the wind speed, considering blades of 30m length .
Figure 9 : Power of the wind and maximum extractactable according to the wind speed
For each wind turbine, a power curve is published by the manufacturer giving the power produced as a function of the wind speed. Figure 10 is an example of a typical power curve for a wind turbine.
Figure 10 : Typical power curve of a wind turbine
Four zones can be delimited by three different wind speeds called: the cut-in wind speed, the rated wind speed and the cut-out wind speed. The cut-in wind speed is the speed from which the turbines can produce energy, the rated wind speed, is the speed from which the turbine produces energy at the nominal power rate, finally the cut-out wind speed, is the speed from which the turbine is stopped to avoid mechanical or electrical damages. It can be observed that the power produced follows a cubic function depending on the wind speed between the cut-in and the rated speeds, which is directly linked to the power available in the air. Also, this is the zone where the power is the most sensitive to the wind speed.
3.2.2 Wind farm design
When designing a wind farm several aspects must be considered to obtain the best performances from the plant. The characteristic of the site and the surrounding topography can have a direct effect on the wind shear depending on the height considered. Basically, on a flat terrain, with no obstacles, the wind speed can be described in function of the height by a power law defined by equation 4, where 𝑈𝑟𝑒𝑓 is the wind speed at the reference height 𝑧𝑟𝑒𝑓, U the wind speed at the height 𝑧, and 𝛼 the power law coefficient.
𝑈𝑟𝑒𝑓 = ( 𝑧 𝑧𝑟𝑒𝑓)𝛼
The power law coefficient is directly link to the roughness length, 𝑧0 ,which is the height above ground at which the wind speed is zero, due to friction with the terrain.
𝛼 = 0.24 + 0.096 log(𝑧0) + 0.016 log (𝑧0)2
The roughness length size order is 0,01m above water, 0.1m above bushland or 1m above town or forests.
Figure 11 represents the wind speed depending on the height for two roughness lengths. 
Figure 11 : Wind shear according to the roughness length
Besides, local obstacles and discontinuities in the topology can cause perturbation in the air and impact the power law describing the wind shear over the height.
The wake effect, defined in , as the fact that “within a wind farm it is common for one turbine to be operating wholly or partly in the wake of another”, can impact and reduce the production of certain wind turbines depending on their location. Figure 12 is a famous illustration of this effect.
Figure 12: Illustration of the wake effect
The impact of this effect depends mainly on the direction of the wind, and during the design phase, the wind farms layout is arranged making most of the wind turbines facing the prevailing direction of the wind speed.
3.2.3 Predictors selected
Considering the quick review done in the previous sections, and the predictors selected in the literature such as in , it can be concluded that only a few meteorological variables have a direct impact on the hourly variations of the power output of a wind farm. Thus, the wind speed, the wind direction are the two main predictors that can be used to predict the production of the wind farm. As well as in the solar power plant, the intrinsic characteristics of the farm, such as the roughness of the ground, the topology, the farm layout or the type of turbine in operation are assumed to be learnt by the machine learning algorithm during its training.
4 Methodological approach
For both wind and solar power production prediction, a common approach has been implemented to developed forecasting models based on historical data. These models are all based on machine learning algorithms. This part aims to introduce the steps of building model based on big data technologies. They consist firstly in a data collection step, then a data cleaning and preparation, afterwards a training, and finally an evaluation the model.
4.1 Data collection
Even if this step seems at first sight basic, it is surely the one that can have the biggest impact on the outputs, since there are not good results with bad data, even if the models are good. The data collection consists in looking for all relevant data that can be exploited to train the statistical models. They must be valid, consistent and representative of the information available in operation.
Three types of data are planned to be used: power production data from wind farm or solar farm operators, meteorological and environmental data from external sources.
Production data from solar and wind farms are used as targets value by the models, meaning that they are values that models must estimate as outputs. These data are from solar and wind power producers that own or operate multiple farms in France. Depending on the dataset considered, different variables can be included, but all of them contain at least the hourly production for two years.
Meteorological data and environmental data are used as the predictive features, meaning that these values make up the inputs vectors on which the production predictions are based on. Meteorological datasets are from an external provider, Météo-France. It is a national operator that provides data directly from its own numerical weather prediction models, AROME and ARPEGE . These models provide data on a grid with up to a 1.5km resolution. The closest point from the central location of the power plant will then be chosen.
Consequently, several parameters are available (more than 30), but they are standards and not always adjusted to our problem. For instance, the wind speed is forecasted at a 10m height, which doesn’t correspond to the wind that a wind turbine will face.
4.2 Data exploration and preparation
Once the data have been collected, they must be prepared to be usable. This step aims to adapt and complete the inputs and outputs and to check their consistency. It is the most time consuming one but inevitable.
Among the first manipulations to be carried out there is the data cleaning and the time synchronisation. The first one consists in cleaning data containing an error message or being obviously foolish. The second one is to assure the data synchronisation by changing time stamps into the same time zone (it often consists in changing from local time to UTC).
Then, a first exploration of the dataset can quickly check the consistency of the values coming from different sources, for instance by calculating the correlation between the direct normal irradiance and the solar production, or between the wind speed and the wind power production.
Besides, since the energy production and the meteorological data are time series, it’s important to keep within the inputs the temporal characteristic. To do so, lag and lead windows should be added to the inputs features to keep, for instance, the wind speed of the next and of the previous hour.
Figure 13 emphasizes how the scheme can change before and after the data preparation for a dataset used for the wind power prediction.
Figure 13 : Scheme evolution during the data preparation phase
4.3 Machine learning algorithms
Machine learning methods are all structured to take benefit of a huge amount of data available and to model non-linear proceeds. To do so, a so-called ‘training dataset’ is provided. This dataset is composed of historical observations on which the algorithm will optimize its objective function. The larger is the training dataset, the more accurate the machine will be trained.
A multitude of machine learning algorithms has been developed since the last years. Their types can be classified between unsupervised and supervised learning, and between regression or clustering algorithms .
Unsupervised learning corresponds to the case in which the machine task is to find a hidden describing function of the ‘unlabelled data’. As a result, there is no evaluation criterion of the accuracy of the algorithm.
This is the main difference with the supervised learning in which the training dataset is given with an output to predict. Then, the machine tries to model a function determining the best the output. If the output is a class label, it is a classification process, whereas if the output is a numerical value, it is a regression process .
This study focuses on giving a deterministic forecast of the hourly production of a power plant, based on historical data composed of meteorological and production data. As a result, it is precisely talking a supervised regression algorithm.
To develop such a predictive model, three main machine learning algorithms have been chosen based on what appears as the most present and as the most performing in the literature as for instance , , , , , : Support Vector Regression (SVR), Artificial Neural Networks (ANN), and Random Forest (RF).
The following section briefly present theses algorithms and explain their differences.
4.3.2 Support Vector Regression (SVR)
Support Vector Regression (SVR) is a general learning method, introduced by Vapnik in 1986 , developed from the Statistical Learning Theory over the last fourth decades . SVR main idea is to map an input vector into a high-dimensional feature space by using a nonlinear mapping process and then perform linear regression in the features space. The goal is here to find a balance between two constraints: the linear regression in the multi-dimensional space which must be as flat as possible, and the loss function which has to be as low as possible.
The SVR tool makes use of two free parameters: C, which is the penalty parameter of the error term and, ε, which is “the value for which no penalty is imposed to the training loss function as long as the predicted values are within a distance ε of the actual value of the training examples.” .
DateTime (UTC) Wind Speed (m/s)
Wind direction (°)
DateTime (Paris Time)
Capacity Error DateTime
(Paris Time) Production Source #1
OUTPUT Source #3 INPUTS
DateTime (UTC) Wind Speed (m/s) at h-1
Wind Speed (m/s)
Wind Speed (m/s) at h+1
Wind direction (°) at h-1
Wind direction (°)
(°) at h+1 DateTime (UTC) Available
Capacity Error DateTime (UTC) Production
Source #1 Source #2 Source #3
Since a more precise description of this algorithm cannot be done without dealing with more technical purposes, the reader is invited to refer to the literature such as to get a deeper insight of this algorithm, such as ,  and .
4.3.3 Artificial Neural Network (ANN)
ANN is an algorithm based on the brain neural structure, which connects multiple units standing for neurons to a bigger network. An artificial neuron is typically characterized by an activation function (generally the same in all as network) and weights for each input. The activation function is equivalent to a transfer function linking the input of a neuron and its output. Many functions can be used, but the most popular are often the identity function, the binary step or the hyperbolic tangent. The weights are optimized during the training period with a gradient descent algorithm .
Figure 14 : Representation of an artificial neuron
Many different networks structures have been developed but the most used, and consequently with the most basic structure, is the multi-layer perceptron. It is typically composed of multiple layers of a certain number of neurons with an added bias. Each layer takes as inputs the results of the previous one. The first layer consists of a set of neurons receiving the inputs features. The followings are called the hidden layers. They transform the values from the previous layer with a weighted linear summation and link the input to the output. The last layer is generally composed of one neuron leading to the final output. The main free parameters that can be set are: the number of hidden layers, the total number of neuron of each layer, the activation function type, and the convergence algorithm used (Levenberg-Masquardt Algorithm, Damped Least-Square) .
The artificial neural network is trained by optimizing the weights of each neurons transfer function with a gradient descent algorithm applied to a metric such as the R-square or the RMSE.
Figure 15 shows a MLP networks with only one hidden layer with neurons called “a1, a2, ak …”. The arrows show how the information is spreading over the network.
Figure 15 : Basic representation of an artificial neural network with one hidden layer .
-20- 4.3.4 Random Forest (RF)
188.8.131.52 Decision tree
A decision tree is a statistical model introduced by Breimain in 1984 in  and depicting the different values that an output can take depending on a set of input values . It is composed of nodes and branches organized in a hierarchy with no loops. Each node corresponds to a test function applied to the incoming data and has two outgoing branches which is activated depending on the result of the test function. At the end of each branch is located a leaf which represents a possible final result. The process consists in going down along branches from the bottom root to a down leaf giving the output.
During the training phase, the algorithm optimizes the splitting done by each node and the values of the test parameters to optimize the metrics function. The tree training is stop generally by a termination criterion: it can be the number of nodes, or the number of samples corresponding to the leaf.
Figure 16 : Representation of a decision tree used for wind power production prediction 184.108.40.206 Random forest
Depending on the training sample, the results of a decision tree can be very unstable. To solve this problem the random forest algorithm has been developed, which combines the prediction of several decision tree. It consists in growing several trees in parallel and to average their output. The random forest selects a defined number of inputs for each of its trees. This number is randomly determined, which protects the model from noise information by getting independent trees .
4.4 Development of the predictive models
The structure of the models has been studied all along the thesis. This part explains their structure as it appears at the end of the project. The general framework has been decided common for both wind and solar predictive models to simplify the problem, however the models themselves have been developed and evaluated independently.
The first step consists in normalizing and splitting the dataset in two parts, as introduced before, the train dataset and the test dataset which will be using
Two structural parts can be identified the first one is the training and the second one is the testing . During the training phase the machine learning algorithm turns its prediction by fitting its parameters (for instance the weights of the neurons within an artificial neural network and analysing the huge amount of data available in the training dataset. In this way, the machine looks for relationship between inputs and outputs and tries to learn how it can predict non-linear and multi-variable systems. In practice
During the testing phase, the models is evaluated by performing on the test dataset that is independent from the training dataset. Thus, it gives unbiased evaluation of the models.
4.5 Performance criteria
Three performance criteria have been used for the performance evaluation of the models: RMSE, NRMSE, and R-square.
Root Mean Square Error (RMSE) is calculated with the equation 6:
𝑅𝑀𝑆𝐸 = √∑𝑛𝑡=1(𝑦𝑡− 𝑦̂ )𝑡 2 𝑛
Where n is the number of samples, and 𝑦̂𝑡 the forecasted power generated by the power plant at the time t, and 𝑦𝑡 the actual power generated at this time. This criterion indicates how accurate is the model. To be able to compare this value with another dataset, it has been introduced the Normalized Root Mean Square (NRMSE) which is calculated with the equation 7:
𝑁𝑅𝑀𝑆𝐸 = 𝑅𝑀𝑆𝐸
Where 𝑃𝑖𝑛𝑠𝑡𝑎𝑙𝑙𝑒𝑑 is the installed power capacity of the power plant.
R-square is the coefficient of determination of the forecast. It indicates how reliable the forecasting model is.
If model’s outputs fit perfectly with the actual values the R-square will be equal to 1. A constant model that always predicts the expected value of y, disregarding the input features, would get a R-square score of 0 .
R-square value is calculated with the equation 8:
𝑅2=∑𝑛𝑡=1(𝑦̂ − 𝑦̅)𝑡 2
Where n is the number of samples, and 𝑦̂𝑡 the forecasted power generated by the power plant at the time t, 𝑦𝑡 the actual power generated at this time, and 𝑦̅ the mean values of 𝑦𝑡 for t=1..n.
These criteria are the most adapted to the problems risen in this study since they value the errors of prediction leading to the balancing costs and the reliability of the model. They are also the more mentioned is the literature such as in  and in .
5 Solar energy production prediction 5.1 Assumptions and boundaries
As described in the previous sections, to best integrate renewable energy sources to the wholesale market, balance responsible entities need to provide the production planning of the power plants making up their portfolio at noon the day before. Therefore, the model of prediction developed will be based on information available at 12:00 the day-ahead to be representative of the operation situations met by the asset managers.
Moreover, the prediction time interval is set at one hour.
Besides, to simplify and to be able to compare the performances obtained with the various power plants, the structure of the prediction model should be common for all the sites. Consequently, the parameters of machine learning algorithms should be constant and optimized to get the best global results.
Finally, to evaluate the true performance of the models, and to avoid the overstatement of the criteria, the nigh times, during which the production of solar energy is always equal to zero, will be ignored. Thus, only the times between the sun rise and the sun set will be studied.
5.2 Available data description
The available data can be classified in three parts, each one coming from a specific source: production data, meteorological data and environmental data. All these data are provided with a time interval of one hour during two consecutive years (2014 and 2015).
The production data are from a third part company which operates numerous solar farms in France. In this study, 20 farms have been considered. Figure 17 shows where these power plants are in France. As it can be observed, they are mainly located in south-east region where the solar potential is the highest in the country.
The annex shows the main characteristics of the plants which are the installed capacity, the tilt angle and the panels orientation. The data obtained cover the years 2014 and 2015 and revealed the hourly production of an entire power plant. However, there is no data concerning the availability of the panels, and, as a result, it will be assumed that a plant is whether totally available or not available at all (all or nothing, in other words).
Figure 17 : Location of the solar power plants in France
As described in the previous part, the meteorological dataset c from a national provider which has developed NWP models. It is composed of gross data of 4 parameters: temperature, wind speed, global horizontal irradiance and direct normal irradiance.
Other environmental data has been obtained through the python package PVLib . This package gives access to the sun position and to clear sky irradiance for a defined location. The sun position is defined by
three angles described in a previous part: zenith, elevation and azimuth. The clear sky irradiance represents the irradiance received by the ground on a clear day. There are several models developed to evaluate this last parameter. PVLib package uses by default the Ineichen and Perez model to calculate the GHI and the DNI.
This model is described in . And is reported to have excellent performance with a minimal input data set according to .
Finally, Table 4 summarizes the predictors used for the development of solar production forecast model.
Predictors Units Source
Wind Speed m/s
Global horizontal irradiance W/m2 Direct normal irradiance W/m2
Solar zenith °
PVLib Python Package
Solar azimuth °
Clear Sky GHI W/m2
Clear Sky DNI W/m2
Table 4 : Predictors used for solar power production forecasting and their source
5.3 Models evaluation
The models developed during this study are based on machine learning algorithms. As described before, these algorithms are trained and tuned based on a train dataset composed of multiple features. These features composed the input vector used for each prediction. The main ones used are those summarized in Table 4.
However, the time dimension of this variable has also to be considered by the model. As described in Section 4.2, a shift down and a shift up of one hour is done for the time dependent variable (which concern all the predictors described in the table 4). However, to limit the number of inputs, the time shift will be applied only on meteorological data which are much more erratic than the environmental data. The equation 9, shows the structure of the input vector considering the time dimension.
𝐼(ℎ) = [𝑃1(ℎ), . . , 𝑃𝑖(ℎ), … , 𝑃𝑛(ℎ)] 𝑖 = 1. . 𝑛
Where 𝐼(ℎ) is the input vector for the hour h, and n the number of predictors.
With 𝑃𝑖(ℎ) = [𝑝𝑖(ℎ − 1), 𝑝𝑖(ℎ), 𝑝𝑖(ℎ + 1)] where 𝑝𝑖, for 𝑖 = 1. . 𝑝 , are the p meteorological predictors And with 𝑃𝑖(ℎ) = [𝑝𝑖(ℎ)] where 𝑝𝑖, for 𝑖 = 𝑝 + 1. . 𝑛 , are the n-p environmental predictor
Besides, three kinds of machine learning algorithms have been evaluated for the forecast of the production of the 20 solar power plants: Support Vector Regression, Neural Networks and Random Forest. The evaluation has been based on the calculation of the root mean squared error and of the R-square value between the prediction and the actual values.
The evaluation process is common to all the solar farms. It consists on tuning and training the machine learning algorithms on the 18 first months of the available historical data, which corresponds on the value registered between the 1st of January 2014 and the 30th of June 2015. The evaluation itself is done on the remaining data, with the value registered between the 1st of July 2015 and the 31st of December 2015. Even if the “ideal” evaluation dataset would have been a whole year, the limited amount of available data, and the necessity to have a quite large train dataset, obliged to make a compromise. Nevertheless, the test dataset remains representative of the results obtained on a full year since it is composed of two seasons (summer and autumn) with weather variations.
Besides, the persistence method has also been evaluated to quantify the gain of sophisticated methods compared to a very naive one. The ‘persistence method’ is defined here as P(h)=P(h-48), where P(h) is the power generated at the hour h.
As a result, the graphs below introduce the results obtained for each solar farm studied, displaying the RMSEN and the R-square obtained for the model developed with the available data.
Figure 18 : R-square of the prediction of the production of 20 solar farms
Figure 19 : RMSEN of the prediction of the production of 20 solar farms
Most of the values obtained seems consistent, except the ones from the solar farm #17 which has quite surprising results with a very low R-square with the machine learning algorithm, and a very low RMSEN with the persistence model. After investigation, these surprising results can be explained by the fact that the available installed capacity of the solar farm has changed from 1,5 MWc to 4,3 MWc between during the test phase, as shown on Figure 20. The consequence are uncorrelated results between the forecast and the actual production, and an error very low since it is normalized by the maximum installed capacity which was higher than during the test phase. As a result, the following calculations will be made without considering the results obtained with the solar farm #17.
Figure 20 : Actual and forecasted production of solar farm #17
Figures 18 and 19 shows the improvement due to the use of machine learning models compared to the persistence model. Whereas the R-square of the persistence model is about 0,4, which means a very low correlation between the forecasted value and the actual value, the models using machine learning algorithms have a R-square about 0,85 between the prediction and the actual values, which shows a great improvement considering the correlation. This level of correlation allows some conclusions to be drawn. First, it can be confirmed that the dataset furnished by the different third parties are consistent. Indeed, the risk of having unreliable data coming was quite high when there are multiple sources and multiple site considered. In this case, having R-square at almost the same level for all the plants and at a quite level give confident into the values obtained. Second, the models developed give pretty good scores for the prediction of RES production which are so erratic, and hard to predict one day before. The tuning and the methodology used seems adapted to the problem used. Finally, the assumption that a single common structure can give convenient and consistent results can be confirmed through these first results.
Also, the RMSEN, estimating the error of prediction, has decreased a lot using these models, from about 25% with the persistence model to about 13% with the developed models. The error being directly linked to the balancing cost for a balance responsible entity, this value shows how high the savings can be by using a machine learning based model compared to a simple model.
Figure 22 : R-square boxplots for the three algorithms Figures 21 and 22 shows the dispersion of the two performance criteria considered RMSEN and of R-Square for the three machine learning algorithms considered. They can help to determine if there is a more efficient algorithm, and which one it is. On both figures, a similar performance order can be set between the algorithms which is Random Forest as the best performing model, the Support Vector Regression as the worst one, the
Figure 21 : RMSEN boxplots for the three algorithms
Artificial Neural Network having middle performance. Thus, the Random Forest based model can be retained as the globally best performing model with a mean R-square value of 0,84 and a mean RMSEN value of 12%.
As a result, the description of the best performing will consider the model using this algorithm.
5.4 A deeper description of the results obtained with the best model
Figure 23 : RMSEN and R-square for the 19 solar farms with model based on the random forests
Figure 23 shows the results obtained for the 19 remaining solar farms (n.b: solar farm #17 has been removed previously) considering the RMSEN and the R-square values between the power produced and the value predicted the day ahead. To get a more accurate description of the results obtained, the study will focus on a typical solar farm, for instance solar farm #7, which has average results and is in the south-east region of France, where most of the solar farms considered are located.
As a remind, the evaluation is based on the results obtained during the test phase, from the 1st of June 2016 to the 31st December 2016. Figure 24 is a scatter of the prediction versus the actual power production for the solar farm. Each point of Figure 24 represents the actual power production during one precise hour versus the predicted power production at the day before, for the solar farm 7. The ideal prediction would lead to a perfect y=x straight line represented by the black line.
Figure 24 : Prediction and actual value scatter of the prediction of the power produced by solar farm #7
As it can be observed, some points are far from this line, and this is, of course, due to the randomness of the power production from a must-run renewable energy sources. However, a true trend appears along the black line y=x on Figure 24. To get a better insight of the results obtained, the following figures show production profile of the solar farm #7.
Figure 25 is a production profile example during the summer period (June and July), when the weather is almost always nice in the south-east of France. Indeed, one observes that the daily peaks of production of the farm is always between 5,5 and 6 MW, surely depending on the cell temperature impacting the efficiency of the PV panels.
Figure 25 : Production profile comparing the forecast and the actual value of the power produced by the solar farm #7 on three weeks in June 2015
From Figure 25, it can be concluded that the model developed is very efficient during sunny periods, which is the easiest period to forecast since the production cycle are almost similar from one day to another and are enough represented in the train dataset. The production forecast seems to be very close to the actual production at all time, even if it can be noticed that the model has difficulties to reach the daily peaks. Those peaks are often under estimated by the model. The following section will try to give some explanation for these observations.
Figure 26, which is a zoom in Figure 25, shows how the results are good for such days, with very low absolute differences (<1MW) between the actual and the predicted production. Moreover, this figure shows that the synchronism of the model is very good. Indeed, there is no delay between the cycle on the upward and downward ramps.
Figure 26: Production profile comparing the forecast and the actual value of the power produced by the solar farm #7 on three days in June 2015
On the other hand, Figure 27 represents a production profile example from the autumn/winter period in December 2015, which is a cloudier period, with great variations of the irradiance along the day and between them.
Figure 27 : Production profile comparing the forecast and the actual value of the power produced by the solar farm #7 on three weeks in December 2015
One notices that the prediction during this period is not as good as it is during sunny days, mainly during the production peaks. The production peaks are often underestimated when they reach a power higher than 2 MW, and often overestimated by the forecasting model when they stay under 2 MW. However, there’s a real tendency between the forecast production and the actual one. Indeed, if the problem is considered at a daily level, it is often well predicted if a day will have a high production or a low one, for example the days before the 17th of December have a peak of production higher than 2MW as predicted, whereas after the 17th December their peak is higher than 2MW as predicted.
Finally, Figure 28 is also a zoom in Figure 27 for three days. For example, on the 11th of December, it can be observed that the synchronism is not good as it was during the sunny periods, which leads to high absolute differences, even if the global trend is well forecasted. This can be explained, not only by the fact that the weather forecasts are less precise at an hourly scale, when the weather is changing, but also by the fact that training dataset might be too short to be able to model such weather changing periods.
Figure 28 : Production profile comparing the forecast and the actual value of the power produced by the solar farm #7 on three days in December 2015
As seen and noticed on those figures, the forecasts often underestimate the production on the peaks, or on extremum values. This phenomenon can be explained by the fact that extremum values are less represented than other values in the train dataset. Therefore, in general, machine learning models are less efficient with extremum values and have a tendency to underestimate maximums and to overestimate minimums. In the
benchmark done by Zamo et al. in , it has already been observed such results by using the random forest model for short term forecasting of photovoltaic production.
Figure 29 shows the differences between the forecasted and the actual power produced normalized by the maximum production of the power plant, vs. the actual power produced normalized. The normalization allows to draw a single scatter plot with results obtained for the 19 power plants, i.e. more than 200.000 points.
A quadratic regression of has been drawn in red. Once again, it can be observed that when the production is high enough the model tends to predict a lower value (negative bias) and when the production is low enough the prediction gives a higher value (positive bias). Zamo et al. explain this phenomena by the fact that the model tends to average the prediction by being under dispersive compared to the observed production. That the results of the model will be closer to the mean value in average than the actual values.
Figure 29 : Scatter of the normalized error of prediction vs the power produced normalized, considering the 19 solar farms
The physical model giving as output the power produced by a solar farm is non-linearly dependent of several input variables. As described before, two classes of variables can be distinguished: meteorological data, such as irradiance, temperature, or wind speed, and specific data such as characteristics of electronics components (inverters, cabling…) or local conditions. This part aims to explain how the model performances can be affected by the quality of inputs data, and to explain what is missing to get perfect forecasts.
As described before, the main advantage of machine learning algorithms is their ability to learn from the input data all the specificities of the model that are not time dependent, such as the installed power capacity, the panel orientation, the characteristics of the electronic components. Thus, there is no need to provide such specification that are often quite complex to model when using a physical approach.
However, if there is a change on these characteristics, the model will hardly be able to forecast the power production. For instance, if the actual availability of the plant changes due to failure on a module or in the cabling, the production will be highly impacted, but the forecasting model will take a pretty long time to adapt its structure and to deliver a corrected forecast.
Besides, the quality of the weather forecasts can be questioned. Models trained are based on only one single NWP model which is the AROME model from Météo-France. Combining several weather forecasts providers can help to get more reliable irradiance and temperature forecasts for the model, especially reducing the bias of input values.
6 Wind energy production prediction 6.1 Assumptions and boundaries
As for the solar energy production prediction model, the models developed for wind power will be based on information available at 12:00 the day-ahead to be representative of the operation situations met by the asset managers of a balance responsible entity, with a prediction time interval set at 1 hour. Also, to simplify the comparison, the models will be common for the two wind farms or for all wind turbines when the prediction is made by turbine.
6.2 Available data description
The data used to build production forecast models are from various third-parties. Power production data are several power plants owned by a wind farms operator. However, an important constraint appeared when wind turbines are considered. Unlike solar farm, the available capacity of a wind farm cannot be neglected to model the power production. Indeed, the capacity of production of a wind farm is much more concentrated than for a solar farm (capacity per wind turbine~1 MW vs. capacity per PV module~500 W). Considering that, the data used should contain continuous power production data, for each wind turbine, giving then the number of wind turbines able to produce. Two wind farms get such accurate data. Only total production data were available for the others, which is unreliable to build predictive models.
The characteristics of the wind farms are described in Table 5.
Wind farm # Region of France Hub height Number of WT Capacity per WT Total capacity
#1 East 85 m 8 2,3 MW 18,4 MW
#2 North 97 m 4 2,3 MW 9,2 MW
Table 5 : Wind farms studied characteristics
The production data cover lightly less than 2 years: from the 1st of January 2015 to the 15th of November 2016, and as explained before, is given per wind turbine.
Two new variables are introduced. The first one, noted Wi(h), is a time-dependent binary variable, equal to 1, when the production of the ith turbine of the park is producing more than 1% of its nominal power during the hth hour, and equal to 0 otherwise. Another variable is introduced, as the instantaneous number of wind turbine available, noted Av(h), computed as the number of turbine producing more than 1% of the nominal power during hour h. Then Av(h) can be defined as
𝐴𝑣(ℎ) = ∑ 𝑊𝑖(ℎ)
Where N is the number of turbines installed in the wind park.
To add those new predictors, the assumption of perfect information about the hourly availability of the wind turbines introduced in the previous section is used. Indeed, since Wi(h) and Av(h) are computed from variables used as targets by the model, they cannot be used in operation. It is assumed that then in operation, the program of turbines availability will be perfectly known the day before.
Besides, meteorological data used are from the national meteorological service, and are gross data computed by a numerical weather prediction model. The parameters used are the wind speed and direction at a ten meters height, the air pressure and the air temperature (at two meters).
To consider the specificities of the turbine, a new predictor is the theoretical power production, calculated from the forecasted wind speed rescaled and the characteristic power curve. To rescale the
forecasted wind speed, at the appropriate height, the power law, described by equation 11, is used by many researchers.