Intelligent hydropower: Making hydropower more efficient by utilizing machine learning for inflow forecasting

(1)

INOM

EXAMENSARBETE INDUSTRIELL EKONOMI, AVANCERAD NIVÅ, 30 HP

STOCKHOLM SVERIGE 2020,

Intelligent hydropower

Making hydropower more efficient by utilizing machine learning for inflow forecasting

JAKOB CLAESSON SAM MOLAVI

KTH

SKOLAN FÖR INDUSTRIELL TEKNIK OCH MANAGEMENT

(2)

(3)

Intelligent Hydropower

Making hydropower more efficient by utilizing machine learning for inflow forecasting

by

Jakob Claesson Sam Molavi

Master of Science Thesis TRITA-ITM-EX 2020:247 KTH Industrial Engineering and Management

Industrial Management

SE-100 44 STOCKHOLM

(4)

Intelligent vattenkraft

Effektivisering av vattenkraft genom användning av maskininlärning

by

Jakob Claesson Sam Molavi

Master of Science Thesis TRITA-ITM-EX 2020:247 KTH Industrial Engineering and Management

Industrial Management

SE-100 44 STOCKHOLM

(5)

Master of Science Thesis TRITA-ITM-EX 2020:247

Intelligent Hydropower

Jakob Claesson Sam Molavi

Approved

2020-06-05

Examiner

Niklas Arvidsson

Supervisor

Milan Jocevski

Commissioner

Fortum Generation

Contact person

Kent Pettersson

Abstract

Inflow forecasting is important when planning the use of water in a hydropower plant. The process of making forecasts is characterized by using knowledge from previous events and occurrences to make predictions about the future. Traditionally, inflow is predicted using hydrological models. The model developed by the Hydrologiska Byråns Vattenbalansavdelning (HBV model) is one of the most widely used hydrological models around the world. Machine learning is emerging as a potential alternative to the current HBV models but needs to be evaluated.

This thesis investigates machine learning for inflow forecasting as a mixed qualitative and

quantitative case study. Interviews with experts in various backgrounds within hydropower illustrated the key issues and opportunities for inflow forecasting accuracy and laid the foundation for the machine learning model created. The thesis found that the noise in the realised inflow data was one of the main factors which affected the quality of the machine learning inflow forecasts. Other notable factors were the precipitation data from the three closest weather stations. The interviews suggested that the noise in the realised inflow data could be due to faulty measurements. The interviews also provided examples of additional data such as snow quantity measurements and ground moisture levels which could be included in a machine learning model to improve inflow forecast performance.

One proposed application for the machine learning model was as a complementary tool to the current HBV model to assist in making manual adjustments to the forecasts when considered necessary.

The machine learning model achieved an average Mean Absolute Error (MAE) of 1.39 compared to 1.73 for a baseline forecast for inflow to the Lake Kymmen river system 1-7 days ahead over the period 2015-2019. For inflow to the Lake Kymmen river system 8-14 days ahead the machine learning model achieved an average MAE of 1.68 compared to 2.45 for a baseline forecast. The current HBV model in place had a lower average MAE than the machine learning model over the available comparison period of January 2018.

Key words:

machine learning, inflow forecasting, hydropower, HBV model

(6)

Examensarbete TRITA-ITM-EX 2020:247

Intelligent vattenkraft

Godkänt

2020-06-05

Examinator

Niklas Arvidsson

Handledare

Milan Jocevski

Uppdragsgivare

Fortum Generation

Kontaktperson

Kent Pettersson

Sammanfattning

Tillrinningsprognostisering är viktig vid planeringen av vattenanvändningen i ett vattenkraftverk.

Prognostiseringsprocessen går ut på att använda tidigare kunskap för att kunna göra prediktioner om framtiden. Traditionellt sett har tillrinningsprognostisering gjorts med hjälp av hydrologiska modeller.

Hydrologiska Byråns Vattenbalansavdelning-modellen (HBV-modellen) är en av de mest använda hydrologiska modellerna och används världen över. Maskininlärning växer för tillfället fram som ett potentiellt alternativ till de nuvarande HBV-modellerna men behöver utvärderas.

Det här examensarbetet använder en blandad kvalitativ och kvantitativ metod för att utforska maskininlärning för tillrinningsprognostisering i en fallstudie. Intervjuer med experter med olika bakgrund inom vattenkraft påtalade nyckelfrågor och möjligheter för precisering av

tillrinningsprognostisering och lade grunden för den maskininlärningsmodell som skapades. Den här studien fann att brus i realiserade tillrinningsdata var en av huvudfaktorerna som påverkade kvaliteten i tillrinningsprognoserna av maskininlärningsmodellen. Andra nämnvärda faktorer var

nederbördsdata från de tre närmaste väderstationerna. Intervjuerna antydde att bruset i realiserade tillrinningsdatana kan bero på felaktiga mätvärden. Intervjuerna bidrog också med exempel på ytterligare data som kan inkluderas i en maskininlärningsmodell för att förbättra

tillrinningsprognoserna, såsom mätningar av snömängd och markvattennivåer. En föreslagen användning för maskininlärningsmodellen var som ett kompletterande verktyg till den nuvarande HBV-modellen för att underlätta manuella justeringar av prognoserna när det bedöms nödvändigt.

Maskininlärningsmodellen åstadkom ett genomsnittligt Mean Absolute Error (MAE) på 1,39 jämfört med 1,73 för en referensprognos för tillrinningen till Kymmens sjösystem 1–7 dagar fram i tiden under perioden 2015–2019. För tillrinningen till Kymmens sjösystem 8–14 dagar fram i tiden åstadkom maskininlärningsmodellen ett genomsnittligt MAE på 1,68 jämfört med 2,45 för en referensprognos. Den nuvarande HBV-modellen hade ett lägre genomsnittligt MAE jämfört med maskininlärningsmodellen under den tillgängliga jämförelseperioden januari 2018.

Nyckelord:

maskininlärning, tillrinningsprognostisering, vattenkraft, HBV-modell

(7)

List of Tables

Table 1 - Overview of the interviews. 13

Table 2 - Raw data before cleaning missing values and outliers. 25

Table 3 - Inflow data aggregated to a weekly level. 26

Table 4 - Final model (random forest), mean and baseline comparison. 27 Table 5 - MAE for the HBV model, baseline and random forest model. 28

Table 6 - Other tried machine learning algorithms. 29

(11)

List of Figures

Figure 1 - Multiple levels as a nested hierarchy. 9

Figure 2 - A dynamic multi-level perspective on technological transitions. 10

Figure 3 - Exploratory sequential research design. 12

Figure 4 - SMHI weather stations. 14

Figure 5 - Overview of the Lake Kymmen river system. 18

Figure 6 - Rolling 7-day inflow average. 26

Figure 7 - Feature importance. 28

(12)

Acknowledgements

This master’s thesis was conducted at the division of Sustainability and Industrial Dynamics in the department of Industrial Engineering and Management at the Royal Institute of Technology. We would like to thank our supervisor Milan Jocevski at the Royal Institute of Technology for guidance and precious feedback during the whole process. Thank you for always being available to us whenever we needed to brainstorm. We also would like to thank the students in our seminar group and our seminar chairs Cali Nuur and Niklas Arvidsson for all valuable feedback and inspiration.

The main idea behind the thesis was brought forward by our case company Fortum Generation in

Stockholm. We would like to extend a very special gratitude to our supervisor Kent Pettersson and to our coordinator Hans Bjerhag at Fortum for helping along the way and making this thesis possible. We are particularly appreciative of the field trip which gave us a tangible and interesting learning experience. We would also like to thank all of the interviewees who lent their time to provide useful insights that we will keep with us.

Stockholm, June 2020

(13)

1

1. Introduction

This chapter gives an introduction to the topic investigated in this thesis. Additionally, the research problem and research questions are presented as well as academic contributions and delimitations.

1.1 Background

The transition to renewable electricity production is gaining momentum. The transition is driven by decreased costs of wind turbines and solar panels [1]. The power output from wind and solar power is difficult to predict as it is dependent on weather [2]. When the share of wind and solar power increases in an electricity network, other energy sources are needed to stabilize the grid - that is, alternatives that can on short notice increase or decrease electricity production [3]. These are called regulating power plants.

As the importance of regulating power plants in the grid increases so does their respective planning and operations. Hydro electricity production is an example of a power source which can regulate the amount of electricity produced by altering the amount of water let through the power plant. A reservoir of water in conjunction with the power plant acts as a physical battery or energy storage where one can decide, under certain restrictions, when to use the water and generate electricity.

Hydropower is an important power source for regulating and balancing the power system in many parts of the world [4]. In Sweden, hydropower constitutes about 39 % of the total electrical energy production [5].

The current installed hydropower capacity in Sweden has not changed significantly in the last decades [6].

The river system in proximity to the hydropower plants, lakes and rivers are regulated by strict water management laws. This could for example be maximum and minimum water levels for lakes and minimum flows in rivers. Inflow forecasting, that is the prediction of how much water that will flow into the reservoir, is important for planning the operation of the hydropower plant [7]. The inflow of water to a reservoir is dependent on several factors. Accurate and reliable inflow forecasting is vital for making decisions for reservoir operations and management [7].

One particular type of a hydropower plant is called a pumped hydro energy storage plant (PHES) where the pump-turbine can both generate electricity with water flowing from the upper to the lower reservoir as well as pump from the lower reservoir to the upper [8]. This could add complexity when planning the hydropower plant and could thus make inflow forecasting even more important. PHES are uncommon in Sweden and the PHES in connection to the studied river system is the biggest PHES plant in operation in Sweden.

Improved accuracy in inflow forecasting is important as it leads to more efficient use of the water in the reservoirs [9]. Traditionally, inflow forecasts are generated using different mathematical and physical models based on a number of hydrological parameters. One of the most widely used hydrological models for inflow forecasting is the model by Hydrologiska Byråns Vattenbalansavdelning, also known as the HBV model [10]. The HBV model describes hydrological processes using a number of parameters to forecast the amount of water inflow into a reservoir. This is also the model used in the case river system for inflow forecasting. As inflow forecasting is dependent on weather, the further into the future the

(14)

2 forecasts look the more uncertain it becomes. Typically, the forecasts can be divided into two categories, one which includes the upcoming 14 days ahead (or a time-span with similar duration) and one which is broader and can include years ahead. For this thesis, the developed machine learning model and the research questions presented will focus on inflow forecasting for the 14 days ahead. As a measure to try to improve inflow forecasting accuracy, researchers around the world have become increasingly interested in developing machine learning models for inflow forecasting [11].

Machine learning is a branch within artificial intelligence which utilizes statistical, optimization and probabilistic tools to learn from data to classify new data and identify trends without explicit programming [12]. With the availability of data sets machine learning is being tested and utilized in various industries for a variety of problems often including prediction [13]. In recent years, machine learning models have been evaluated as an alternative to hydrological models [14]. There are some potential advantages with using machine learning for inflow forecasting such as that the underlying physical hydrological processes do not have to be considered. Instead, the machine learning models depend on historical hydro and meteorological data which in turn reduces the number of input parameters. This allows for a more novel approach in creating an inflow forecasting model, utilizing the full extent of the data set which requires the machine learning algorithm to make inferences instead of explicitly defining relationships between parameters as in current hydrological models [7]. In some research, machine learning models have outperformed traditional inflow forecasts in certain scenarios [15][16][17].

1.2 Research problem

Inflow forecasts are dependent on weather. The inflow model for a certain reservoir is specific to the local area and its conditions. In some cases and some areas, the inflow forecasts are not accurate. Inaccuracies in the inflow forecasts affect the planning and operations of the hydropower plants leading to less efficient use of the water in the reservoirs and more spillage. Inflow forecasts are directly used by teams planning the hydropower plants throughout the entire planning process. This ultimately means that inaccuracy in the inflow forecasts affect several different divisions in the company and has a direct impact on the real- time operation of the hydropower plant. The main problem is inaccuracy in inflow forecasts which leads to worse decision making in terms of how to utilize the water in the most efficient way. Machine learning has emerged as a potential alternative to current inflow forecasting models. This thesis aims to investigate the factors that affect the quality of inflow forecasts by creating and evaluating a machine learning model for the inflow to reservoirs connected to a hydro plant. The machine learning model evaluation will explore what elements are important for inflow forecast accuracy. The machine learning model

performance will be compared to the performance of the HBV model currently used for inflow forecasts for the case river system. The comparison between the different models will allow for a discussion with regard to how machine learning can be utilized for reservoir inflow forecasting.

1.3 Research questions

In order to achieve the objective of the study, two research questions will be answered:

1. What factors affect the quality of inflow forecasts?

2. How can reservoir inflow forecasting be improved by utilizing machine learning?

(15)

3

1.4 Contribution

The thesis contributes to previous research of applied machine learning and in this case more specifically to its potential use in inflow forecasting. Furthermore, it contributes to the theory behind inflow

forecasting by highlighting how different factors affect inflow. The thesis adds the context of a pumped hydropower storage plant in Värmland, Sweden, which historically has been a difficult area to produce accurate inflow forecasts for. The thesis also discusses what measures need to be taken to allow for even better machine learning use in this area in the future.

1.5 Delimitations

The machine learning model and the research questions presented will focus on inflow forecasting for the 14 days ahead. This is due to that weather and more specifically precipitation forecasts are primarily on a 14-day time horizon. After 14 days, the weather forecasts become increasingly inaccurate. Only the period of the years 2015 to 2020 is used when building and evaluating the machine learning model. Data from years prior to 2015 were deemed not to be useful as there are many uncertainties in the validity of these data. The created machine learning model is not intended to be used in production. Instead, the model aims at evaluating what factors need to be improved for implementing a machine learning model for actual inflow forecasts at Fortum, which from here on will be referred to as ‘the case company’. The model will be compared and evaluated against the current inflow forecasting tool used in the case river system.

(16)

4

2. Research background

This chapter aims to cover relevant theory and research background for the reader. The research will constitute a foundation for deepened discussion to answer the research questions.

2.1 Machine learning

This section will present important considerations of machine learning. First, a general introduction to machine learning will be provided. Then, research of best practices for machine learning models will be presented. Finally, previous research of machine learning models for inflow forecasting will be presented as well as research regarding data quality.

2.1.1 Introduction to machine learning

Machine learning is a branch within artificial intelligence which utilizes statistical, optimization and probabilistic tools to learn from data to classify new data and identify trends without explicit programming [12]. Machine learning is a field within computer science where the computer learns from data sets in order to make predictions. One way of describing machine learning is having the computer teach itself rather than explicitly writing exactly what it should do. For a machine learning model to perform well in a predictive task it is very important that it is trained on a sufficient data set [18]. Machine learning can be divided into supervised learning and unsupervised learning. Supervised learning is when there is a target variable which the algorithm tries to predict. Supervised learning, which is used in prediction, can be divided into classification and regression depending on if the output variable is discrete or continuous respectively. In this thesis, the focus is on supervised learning and regression. Unsupervised learning works without a target variable and instead focuses on finding patterns and clusters in the input data [19].

2.1.2 Machine learning best practice and algorithm descriptions

Machine learning as a science includes some best practice techniques to create the best and most reliable models. In the subsequent sections, some of the best practice techniques and algorithms will be described briefly.

Data pre-processing for supervised learning

Data pre-processing is a broad term for the data activities before the training of the model. It includes activities such as data cleaning, normalization, transformation and feature selection. Data cleaning ensures quality and coherence within the data set. Often, large data sets include missing values and misrepresented values. By cleaning the data, one must make choices on how to handle these exceptions. How the cleaning should be performed depends on the data set. Missing values and outliers can be dropped entirely, set to zero, set as an average of the data set or in other ways handled [20]. Normalization is a technique of transforming the values of a feature to within a specified range such as between zero and one. This often improves the data representation for the machine learning model which can improve performance [21].

Transformation is how one can change the data into new representations. For example, the machine learning models can only handle numerical values and thus dates need to be represented by a numerical value. Transformation can also be the activity of creating new features based on the data, adding together

(17)

5 different values, or taking the difference between different features. Feature selection is the decision of which features, that is, what data, should be used when training the model. Using too many features can increase noise in the model and decrease performance. Feature selection happens both before training and iteratively when training [22].

K-fold cross validation

In order to reliably train and test the machine learning models the data set at hand is divided into a training set and a test set. The model is trained on the training set and evaluated on the test set. Thus, the models will make predictions on the test set without having seen that data before. This shows the models generalizability and how it performs on unseen data which would be the case if put into production. K- fold cross validation is a specific technique in utilizing the data set for training and testing the model. The data set is divided into K subsets where the model is each time trained on all the data except for the one of the subsets which is used for testing. This makes the evaluation of the model performance more reliable as it is tested on all parts of the data set [23].

Evaluation metric choice

When evaluating the performance of a machine learning model one must choose what metric the model should be measured on. The performance depends on the task and what is considered important. Accuracy is often used in a classification task, that is when the output variable is discrete and belongs to a specific class. When accuracy is used as an evaluation metric some additional analysis should be used in terms of false positives and false negatives. That is, a high accuracy might not be sufficient when false negatives have large consequences. In regression, Mean Square Error (MSE) and Mean Absolute Error (MAE) are common metrics to evaluate performance on. Both these evaluation metrics measure the error between the predicted value versus the observed value of a phenomenon. The lower MSE and MAE values, the better the model performs compared to the observed values of the target variable. It is also important to compare to a baseline. In prediction, this baseline could be a random guess, the average or the previous value [24].

Algorithm descriptions

Description of algorithms chosen in the subsequent model building of the inflow forecasting.

1. Random forest

Random forest is a supervised machine learning algorithm which can be used both for classification and regression. It is an ensemble algorithm, meaning that it combines many ‘weak learners’ into one strong model. During training the random forest algorithm constructs many ‘weak learners’ known as decision trees. Each tree is different and tries to gain information about the data based on different binary split conditions on each node of the tree, one example could be dividing the data based on if one feature is greater than or less than some value [25]. Random forest utilized something called feature bagging which makes sure that the split conditions on the nodes only consider a subset of all features available. The resultant output from the random forest algorithm is the average over all decision trees constructed [25].

2. Support vector machines

Support vector machines are used for classification, regression and anomaly detection. The support vector machines create a mapping of all data points into a p-dimensional space equivalent to the number of

(18)

6 features used. In regression, this mapping allows for linear separation between the data points introducing a hyperplane which maximises the margin [26].

3. K-nearest neighbour

The k-nearest neighbour algorithm is relatively simple which makes it very fast. The algorithm's main component is the size of the neighbourhood, that is how many of the most similar examples from the training data that should be used when predicting a new value. When used for regression the resultant prediction from the algorithm becomes that weighted or unweighted average of the neighbourhood [27].

4. Linear regression

Linear regression uses coefficients for each of the features in the input space to map to a prediction. If a feature is more important the coefficient becomes larger. The algorithm tries to minimize the residual sum of squares between the approximation made by the algorithm and target in the data set [28].

2.1.3 Inflow forecasting using machine learning

Machine learning models have recently been widely used in inflow forecasting. Researchers have proposed many different machine learning algorithms for inflow forecasting. In some of the research, the machine learning models are compared to other types of inflow forecasting methods which could shed light on the performance of the machine learning models. Previous research could also give an indication of how the machine learning model created in this thesis could perform compared to the current HBV tool used in the case river system.

A machine learning model for predicting one-month-ahead inflow was developed in a study [29]. The researchers compared the machine learning model to the existing inflow tool, which was an Auto Regressive Moving Average (ARMA) tool. The machine learning model was trained on data of 30 previous years. The machine learning model performance was evaluated using the Root Mean Squared Error (RMSE) value which is a common value to measure the difference between values predicted by a model and observed values. The average RMSE value was decreased by 9.9-21 % using machine learning models compared to the forecasts of the ARMA model.

Vos et al. compared the performance of hourly inflow forecasting between a machine learning model and a HBV model [30]. The researchers evaluated the performance of the models using three different metrics, including the MSE. For one-hour forecasts, the machine learning outperformed the HBV model slightly according to the researchers. For longer time frames, the HBV model performed slightly better.

Tokar et al. performed a number of evaluations on the difference in performance of inflow forecasts between conceptual hydrological models and inflow forecasts of a machine learning model for different reservoirs in the United States [31]. The machine learning model was trained on data from 29 previous years. The evaluation metrics were the R2-value and the ratio between the RMSE and the standard deviation (Std). For one-month ahead inflows, the machine learning forecast had an average R2-value of 0.58 while the hydrological model had R2-value of 0.38. The RMSE-Std ratio was on average 20 % lower for the machine learning model compared to the hydrological models. For one-day ahead inflow forecasts, the machine learning model produces similar results, with the R2 value being 0.85 compared to 0.6 for the

(19)

7 hydrological model and a lower RMSE-Std ratio of on average 38 %. Thus, in this research, the machine learning model performed better than the hydrological model. Additionally, the machine learning model shortened the amount of time spent training the model compared to the hydrological model.

Gaume et al. also compared the performance of inflow forecasts using a machine learning model and an HBV model [32]. The machine learning model performed considerably worse than the HBV model. The MSE was roughly 50 % higher for the machine learning model compared to the HBV model. The

researchers express that the performance of the machine learning model was disappointing, although they admit that improvements could be made to the model by using cross-validation, which was not used in their model.

Xu et al. developed an Artificial Neural Network (ANN), which is a type of machine learning algorithm, for short term inflow forecasting [33]. The model forecasts one to seven hours ahead inflows into a hydropower reservoir. The input data for the model was precipitation, mainstream inflow and local inflow for the preceding four to six hours from the current hour, as well as precipitation and inflow for the current hour. The authors deemed the machine learning model used a suitable model for short term inflow

forecasts. For one hour ahead forecasts the R-value for the model was 0.98 and for seven hours ahead forecasts it was 0.95. They stress that the training set is of importance when constructing machine learning models. They strongly advise to include more data in the training sets and particularly the peak flows with extreme values. As more data becomes available, they suggest the model to be retrained.

Farias et al. developed a machine learning model for monthly inflow forecasting to be used in water reservoir management for a particular reservoir located in a city in Japan [34]. The water management tool finds optimal water release, using inflow forecasts as input. As such, the researchers tested if the machine learning inflow forecasts could result in better performance by the water management tool. The machine learning model used in the research predicts the current months inflow as a function of the inflow of the previous month and the precipitation of the current month. The machine learning model provided

excellent inflow forecasts and the researchers deemed that it is very suitable for monthly inflow forecasts.

The machine learning model used was provided training data for the previous 12 years. Although this thesis will not focus on reservoir performance, this research gives an indication that machine learning models for inflow forecasting could perform well for different applications.

The research above points to the fact that machine learning models could perform at least as well as other inflow forecasting methods in certain scenarios. This will become important when comparing the results of the machine learning model created in this thesis to the current inflow forecasting tool used in the case reservoir. Should the results acquired in this thesis differ significantly from the results of previous research, it could give an indication that the conditions may not be optimal for developing a machine learning model and further analysis needs to be conducted.

2.1.4 Data quality for machine learning

The quality of data is an important consideration for building any machine learning model. Data quality refers to the fitness of the data [35]. More specifically, data quality is often measured as a function of a set of dimensions such as data accuracy, data currency and data consistency [35]. Although data quality has

(20)

8 always been an important issue, the recent rise of big data and machine learning has made data quality even more important. In recent years, there has been an increase in the amount of companies conducting data-driven analysis for operational and strategic decision making. Lack of data quality is the number one factor in impeding advanced analytics implementations for organizations [36]. In machine learning models for prediction, data quality is of particular importance as the machine learning model learns from and looks for patterns in the training data in order to make predictions. As such, the effectiveness, i.e. how well the model performs, is evaluated using a subset of the data which was not used for model building.

Thus, performance of machine learning models is used as an indirect measure of data quality [35]. In other words, the model can perform only as good as the data it is trained and tested on.

2.1.5 Manifestations of poor data quality

There are often many misconceptions in organizations about how poor data quality affects them.

Organizations often overestimate data quality and underplay the implications of poor quality data [35].

The consequences of poor data quality can range drastically and sometimes have detrimental effects on the organization. The reality, however, is that data extracted from real-world scenarios often contain noise, which decreases the quality of the data [37]. This could also affect the learning process of the machine learning model leading to inaccuracies as the machine learning algorithms are developed based on the assumption of clean data [37]. The amount of inaccuracies in the model due to noise depends on the amount of noise in the data as well as the type of machine learning algorithm used. There have been many studies on quantifying the effects of noise on machine learning performance [37][38][39]. The general conclusion seems to be that some noise in the data is both expected and accepted, especially for real-world applications. When the noise to signal ratio gets too big however, the model starts performing increasingly worse.

2.2 Single Source of Truth

For any corporation, it is important that data sets are accurate and that everyone in the organization uses the same data when making business decisions. Single Source of Truth is the practice of organizing and structuring data sets so that any change in a data set only can be done in a centralized manner [40]. One way of doing this is to set up a centralized storage point where only authorized personnel are allowed to make changes. The storage point, which is often cloud based, contains one authoritative copy of all crucial business data [39]. When changes are made or data gets updated, any data linking to the centralized data set automatically gets updated. This way the organization can guarantee that every employee has access to the same data, structured in the same format and logic. The purpose of this is that all divisions in an organization base business decisions on the same data and to prevent data silos. This is becoming even more important since many organizations are aiming to make more data-driven decisions as they have been shown to increase firm performance [41]. Inconsistent and erroneous data however, impedes the organization to understand its current performance and make forecasts of the future [42]. This highlights the importance of basing date-driven decisions on accurate data.

(21)

9

2.3 Multi-level perspective

The multi-level perspective is a means to describe and explain technological transitions. The multi-level perspective explains technological transitions as both the process of vibration, selection and retention as well as the process of unfolding and reconfiguration [43]. The multi-level perspective allows for a broader analysis of the mechanisms and environments surrounding an innovation instead of looking at discrete technological developments in isolation. With this frame of analysis, the focus is put on the interactions between different systems and the system's inherent resistance to change [43].

The multi-level perspective contains three different levels: the landscape, the socio-technical regime and the technological niche. The levels are nested, as shown in Figure 1, meaning that the technological niches are embedded in the socio-technical regimes which in turn are embedded in the landscape [43]. The landscape is on a macro level and is defined as an external structure and is affected by larger outside forces. Changes to the landscape are slow and provide trajectories for developments within the socio- technical regime [43]. The socio-technical regimes are on a meso level and provide stability for existing technological development. The niches are on a micro level and are in a vacuum where experiments and developments of radical innovation take place [43].

Figure 1 - Multiple levels as a nested hierarchy [43].

The interaction between the different levels are determined by their respective continuous changes and developments. The landscape can be described as the deep-rooted structures of society including for example, physical compositions of cities, factories, electrical infrastructure and highways [43]. The landscape includes factors such as wars, economic growth, large political developments, cultural and normative values and environmental problems. Changes to the landscape are slow and independent of individual actors. The individual actors in the regimes and niches are pressured on by the landscape and are thus affected by trends in the landscape factors exemplified above [43]. Whilst developments in the landscape might act as a destabilizing force the regimes act to protect existing technological development.

The regime cultivates incremental improvement along a trajectory [43]. The regimes contain ‘rule-sets’

for processes, technologies and culture in corporations and institutions. Regime shifts occur as an

(22)

10 accumulation of smaller changes over time. The niches are places for radical innovation within the regime [44]. The initial innovation process in the niche is described as separated from markets and normal regulation. The radical innovations are often fundamentally different to those in the regime and are subsequently often misaligned with current structures and systems in place [44].

Stability in the regime is influenced by the developments in the landscape and the pressure exerted by radical innovations stemming from the niches. Usually it is the slow developments in the landscape that open up windows of opportunity in the regimes for innovations from the niches [43]. The opportunities occur as tensions between different parts of the regime which can be filled by niche innovation.

Innovation coming from the niches can initially link with old technology and function together to avoid head-to-head competition [43]. Figure 2 illustrates the path for radical innovation successfully

transitioning from the niche into the regime and in the end adding pressure to the landscape. The vertical arrows show the pressure and tension between each level. Innovation trajectories are shown as the vertical arrows [43]. At times the vertical arrows for development within the socio-technical regime diverges shortly as a result of tensions which might lead to a window of opportunity for niche innovation [43]. The multi-level perspective explains technological transitions as a combination of the stabilizing forces in the regime and the destabilizing forces from the landscape and the niche.

Figure 2 - A dynamic multi-level perspective on technological transitions [43].

(23)

11

3. Methodology

In this section, we will outline our research methodology and how it will help us to answer the research question.

3.1 Research process

The purpose of the study was to investigate what factors affect the quality of inflow forecasts. This was done by developing and analysing the performance of a machine learning model on the inflow to the case river system described in Chapter 4. The study explores how machine learning can be used to improve inflow forecasting accuracy for a pumped hydro storage plant. The first part of the study consisted of identifying the study problem together with the case company supervisor and formulating research questions to guide the study. In order to gain a broad knowledge base of hydropower in general and to help guide what factors could be important when building a machine learning model for inflow forecasting in the case river system, interviews with experts in various areas at and in connection to the case company were conducted.

The experts interviewed were all employees of the case company, except for two which were in close contact with the case company. In addition to the interviews, our supervisor at the case company provided us with knowledge regarding the case river system through presentations and a field trip. The interviews were important for our general understanding of hydropower and inflow forecasting. The interviews had a direct impact on the architecture of the machine learning model, both in terms of which time horizon to look at as well as which data that should be used. After the interviews, the machine learning model building and evaluation took place. Before building the machine learning model, we decided upon the model characteristics and which variables to include in the model. Then the data was gathered and cleaned after which the model building began. In the last stage, the model was compared for the period of one month to the current HBV model used for inflow forecasts in the case river system. The results from the model, along with insights from interviews with experts formed the basis of our conclusions and further discussions. This particular type of research design used in this thesis fit well into the mixed methods research design.

Mixed methods is a type of research where the researchers use a combination of qualitative and quantitative approaches [45]. Often, mixed methods are used if the research problem could not be answered by only using either qualitative or quantitative method approaches. In this study, Research question 1 was answered through a mix of interviews and interpreting results from the machine learning model. Research question 2 was mostly answered through the machine learning model itself and through evaluating it against the current HBV model used by the case company today. As such, only one data source, qualitative or quantitative, was deemed to be insufficient for this study which is why a mixed method approach is suitable.

The mixed methods approach was determined in the beginning of the research process, making it a fixed mixed research design, as opposed to emergent mixed research design. The exploratory sequential design process according to Creswell was used in this thesis [45]. This research design prioritizes the collection

(24)

12 of qualitative data in the first phase of the research. Based on the qualitative data, the researcher conducts a development phase by defining features of the quantitative model. Finally, quantitative testing is conducted after which the results get interpreted and analysed.

The exploratory sequential design research process chart below shows the research process. Qualitative data collection and analysis took place in the first phase of the research. The qualitative data collection was in the form of semi-structured interviews. The interviews helped in the thought process of structuring the machine learning model. Important factors such as measurement of precipitation and inflow, inflow time horizons and what features are of special interest in regard to inflow were considered for the next phase of the study. This information was used in the development phase where the actual machine learning model was structured and designed. Then, the model was developed according to the process described in section 3.3.

Figure 3 - Exploratory sequential research design.

3.2 Qualitative method

For the qualitative part of the study, interviews and a field study were conducted. The interviews and the field study were held early in the research process.

3.2.1 Interviews

Blomkvist and Hallin [46] recommend that interviews should be conducted early in the research process to get a good base understanding of the study problem. The interviews were also conducted to assist in constructing and designing the machine learning model. The initial interview selection was made by our supervisor at the case company according to who he thought would provide valuable knowledge of the research problem. The interviews were delimited to only interviewing experts in contact with or at the case company. An overview of the conducted interviews can be found in the table below.

The interviews were conducted in a semi-structured manner [47]. This means that the topics were predetermined and the questions asked were of open-ended character. Using this interview style, the researcher has more control over the topics than an unstructured interview but there is no fixed range of answers [48]. This interview style was chosen as we did not want to restrict the interviews in any way, as well as not getting too far out of the study scope. Open ended questions were asked to allow the

interviewees to explore the area freely and to allow them to spot useful leads and pursue them [47]. Before the interviews a set of topics were decided upon based on the expertise of the interviewee. Some specific questions were written beforehand, but most of the questions asked were follow-up questions based on the interviewees statements.

(25)

13 A basic form of coding was conducted for analysis of the interviews. In qualitative research, coding is the process of generating ideas and concepts from raw data [49]. The data could be in many different forms, including interviews, articles and field notes. Specifically, coding was done to find patterns which could be used in the thought process of developing the machine learning model. Notes from the interviews were taken together with audio recordings. Based on the notes and the recordings, a number of themes emerged across the different interviews.

Interviewee Area of expertise Length Date

A Mid-term planner 1 hour 18/2-20

B Mid-term planner and

water regulation expert

45 min 19/2-20

C Short-term planner 1 hour 20/2-20

D Operations manager 1 hour 21/2-20

E Dam safety expert 45 min 21/2-20

F Turbine expert 30 min 26/2-20

G Värmland and water

expert

1 hour 19/3-20

H & I Hydrological experts, not employees of the

case company

1 hour 11/3-20

J Hydrological expert 1 hour 19/3-20

K Business developer Continuous interviews 15/1 to 30/4 -20

Table 1 - Overview of the interviews

3.2.2 Field trip

In addition to interviews, a field trip to the case river system was made. The reason for the field trip was to get hands-on experience into how the river system was connected as well as having the opportunity to talk to on-site personnel. As the field trip was done very early in the thesis process and the full scope of the thesis was not yet clear, the questions asked were broad and aimed at increasing the overall knowledge of hydropower and the river system at large. The on-site personnel possess knowledge regarding details of the area, such as regulations and the hydropower plant itself.

3.3 Machine learning and data

In this section, the machine learning method is explained. To answer the research question of how local inflow forecasting can be improved by utilizing machine learning, the approach taken in this thesis focused to a great extent on how to create the best conditions for a machine learning model for inflow

(26)

14 forecasting. The purpose of the machine learning model, created in this thesis to predict local inflow into a river system, was to obtain and evaluate what features and data are important when predicting inflow rather than building the most accurate model. Thus, the model created will not be used in production and is part of an explorative approach to answer the research questions posed.

3.3.1 Data collection and data cleaning

The data gathered were obtained from the case company sources as well as data from weather data from the Swedish Meteorological and Hydrological institute, SMHI. The data were limited to the years 2015- 2020. The weather data (precipitation and temperature) were gathered from the three closest weather stations to the river system as seen below in Figure 4. The historical data of realised inflow to the river system was taken from the case company database.

Figure 4 - SMHI weather stations with added annotation of the location of the Lake Kymmen PHES.

The data used in the machine learning model is provided below. The data gathered was on an hourly basis during the years 2015-2020.

Hourly precipitation - How much rain has fallen in mm to one decimal point, gathered from SMHI weather stations Sunne A, Gustavsfors A and Arvika A.

Hourly temperature - Average temperature in Celsius to one decimal, gathered from SMHI station Sunne A.

Hourly inflow - The combined water inflow to the lakes Lake Kymmen, Lake Skallbergssjön and Lake Gransjön were gathered from the case company database. Inflow into Lake Skallbergssjön and into Lake Gransjön ultimately ends up in Lake Kymmen which is why these lakes were of interest in this thesis. The historical inflow data are calculated based on measured changes in water levels in the lakes minus the

(27)

15 estimated output from all gates currently open and the estimated water quantity going through the pump- turbine.

All the data came with Microsoft Excel as an intermediary and was later loaded into a Pandas data frame.

Pandas is an open source data analysis and manipulation tool for the programming language Python. The data was cleaned in order to handle missing data and anomalies. For missing values of precipitation, a zero-mm value was used. For missing temperature values, the value of the previous hour was used. As very few values were missing no hourly data were discarded and the few missing values were filled in as described above. For the calculated value of the inflow to the case river system the values needed more cleaning. As there are uncertainties in the output quantities from the gates, the quantity going through the pump-turbine (in either direction) and in the reservoir water level measurements, some approximations had to be made. An example of this is when wind causes the water on the reservoir to move which could affect the water level measurement. This could cause an initial increase in calculated inflow and when the wind stops the inflow would then be negative (which is not possible considering the evaporation from the lakes is next to nothing). Hourly values were aggregated to a weekly level. The date was cyclically encoded to a sine and a cosine value to better represent the cyclical nature of a year. Negative historic inflow values were set to zero. The data cleaning process was performed mostly using fill-functions in the Pandas library - filling with 0 or the previous value as described above.

3.3.2 Model building machine learning

To allow for several predictive outputs from the model a recursive machine learning model structure was chosen. The chosen outputs had equal interval length, the coming 1-7 days and 8-14 days, which allowed for this approach rather than creating several different models. Subsequently, data transformation included adding together the hourly data into seven day chunks. Thus, the 8-14 day forecast output used the 1-7-day output as input.

The Sci-kit Learn machine learning library for Python was used to create and compare different machine learning models. Several different machine learning algorithms were tested and evaluated. These were random forest regression, k-nearest neighbor regression, support vector regression and linear regression.

Ten-fold cross validation was used to test and evaluate the algorithms, which means that the data was divided into a training set and a test set in the ratio 9:1 ten times to train and test the models. In the cross validation, the data were not shuffled as the data sets are dependent on time and thus in a chronological order. MAE was used as the evaluation metric. The different algorithms were compared by their MAE and the algorithm with the lowest error, which in this thesis was the random forest algorithm, was used for further development. An iterative approach was used to continue to improve data transformation and allow for a better model, that is adding and removing different features depending on the performance of the model. For the best algorithms, randomized grid search was used for hyper parameter tuning. The hyper parameters are parameters that are set before training the model. Therefore, in order to choose the best hyper parameters for the model it has to be tested on several times in a grid search. A randomized grid search runs tests over random combinations of chosen hyper parameters instead of testing all possible combinations. Randomized grid search was performed due to limited computational power. After keeping the best performing model and hyper parameters an additional non-randomized grid search was performed

(28)

16 only on slightly different hyperparameters than the result of the randomized grid search. This resulted in the finalized model.

3.3.3 Evaluating the machine learning models

As mentioned previously the different machine learning algorithms were evaluated and compared against one another in a ten-fold cross validation using the MAE. The algorithm with the lowest MAE was then further developed and compared to baselines and the current HBV model in place. The aim was to compare the final machine learning model with the HBV model on the whole data set. However, due to complications, we were not able to receive historical HBV model inflow forecasts for more than the period of January 2018, in total 30 data points (as the forecasts for January 16^th was missing). Thus, the resultant comparison is not complete. To provide some additional comparison two baselines were created.

The first baseline is predicting the coming inflow by using the previous value, meaning that it is predicting that the coming average inflow for the next week and the week after that will be the same as the week before. The second baseline is predicting the coming inflow by using the average over the whole dataset.

This comparison allowed for evaluation of the machine learning model over the whole period whilst having a more specific comparison with the HBV model for the period of January 2018.

3.4 Research quality

In the following section, the reliability and validity of the research will be presented.

3.4.1 Reliability

Reliability refers to the degree of consistency of a research. It indicates to which degree the research is bias-free and measures precision, repeatability, and trustworthiness of a research [50].

In terms of the qualitative part of the thesis, which was the interviews, the main objective was for us to increase our general knowledge regarding the research area and to aid in the development of the machine learning model. As the interviewees were asked to give their own personal answers to each question, we cannot rule out eventual biases completely. Moreover, one disadvantage of using semi-structured interviews, as was done in this thesis, is that it becomes difficult to exactly replicate the interviews [51].

This was partly taken account for by researching the field in which the interviewee was knowledgeable in and asking relevant in-depth follow up questions.

In terms of the quantitative part of the thesis, which was the development of the machine learning model, the relevant data used was provided by the case company, as well as open weather data from SMHI. A discussion could be had regarding the amount of data used in the machine learning model. It is possible that repeating this study with more data used in the model would have produced different results.

However, given the limitations of this thesis, we deem the developed model to be reliable. This is discussed further in Chapter 6.

(29)

17

3.4.2 Validity

Validity refers to the extent to which a research instrument measures what it aims to measure [52]. In the case of this thesis, the research instrument was the machine learning model developed for inflow

forecasting in the case river system.

The machine learning model was developed to explore what factors affect the quality of inflow forecasts as well as investigating how a machine learning model could improve the inflow forecasts in a river system. The model was developed using insights from interviews with experts in the field as well as readings of other research. This increases the validity of the research. The machine learning model developed aided us in exploring both the aims mentioned above.

(30)

18

4. Empirical research

This section starts by describing characteristics of the investigated power plant and surrounding river flows and reservoirs. Then, insights from the interviews will be presented.

4.1 Area description

This thesis is delimited to the inflow to Lake Kymmen located in Värmland, Sweden. The study area is the Lake Kymmen river system which includes Lake Kymmen, Lake

Skallbergssjön, Lake Gransjön and Lake Rottnen and the rivers River Granån, River Kymsälven, River Rottnan and a throughput tunnel between Lake Skallbergssjön and Lake Kymmen. The Lake Kymmen reservoir, where the hydropower plant included in our scope is located, has inflow mainly from the tunnel from Lake

Skallbergssjön with additional water coming to the tunnel from Lake Gransjön. Inflow into Lake Gransjön and Lake Skallbergssjön ultimately leads to Lake Kymmen. This is why the combined inflow into Lake Gransjön, Lake Skallbergssjön and Lake Kymmen is of interest in this thesis.

The main inflow to Lake Skallbergssjön is River Rottnan. There are also smaller rivers contributing to the inflow into Lake

Skallbergssjön. When the gates are open for the throughput tunnel from Lake

Skallbergssjön to Lake Kymmen, water flows freely. The amount of water flowing into Lake Kymmen through the throughput tunnel is determined by the respective water levels of each lake. The bigger the difference in water levels between Lake Skallbergssjön and Lake Kymmen, the more water flows from the former lake to the latter through the tunnel. The water can also go the other way in

some special cases. There is currently no way of knowing the exact amount of water flowing from Lake Figure 5 - Overview of the Lake Kymmen river system.

(31)

19 Skallbergssjön to Lake Kymmen through the tunnel. Rough approximations have been made by the case company to try and calculate the flow, however the approximations are not 100 % accurate.

4.1.2 Regulations

There are many smaller rivers both upstream and downstream of these lakes. According to the interviews done, the minimum streamflow of these rivers is governed by local regulations and water rights permits.

In practice, this means that there is a minimum flow that needs to be maintained in each river. Water running in these smaller rivers are in other words taken from the lakes. One interviewee mentions that the exact amount of water flowing out of the lakes is not measured accurately. For example, minimum-flows to some of the rivers nearby are controlled by gates placed at different levels. The higher the water level of the lakes, the more water flows through the gates which does not allow for controlling exactly how much water that flows through to the rivers. The regulated minimum-flows are different throughout the year.

Another important consideration is the water level regulation of the lakes. For each lake, one interviewee says that there is a maximum and minimum reservoir water level that must be followed at all times. These levels vary greatly between summer and the rest of the year. There are also regulations regarding pumping in the PHES plant. As per the water rights permits, currently no pumping is allowed during the summer months. This affects the planning by limiting dispatch of the hydropower plant as regulations regarding reservoir water levels must be obeyed.

4.1.3 Other parties affected by PHES water operations

The Lake Kymmen river system has many different stakeholders, all having their own interests. There is a large fishing community to whom the minimum flow of the smaller rivers is important. The case company makes large efforts in providing a suitable habitat for nearby nature within reasonable concern. The water levels are regulated within what is governed by law but additional concern is taken for the fishing

community as well as the nearby community in general according to one interviewee. When possible the water level is kept not too high or too low during the summer time to allow for swimming and activities on the shore.

4.1.4 Inflow forecasting at the case company

Inflow forecasting for the case company in the province of Värmland is currently assigned to a third-party vendor. The vendor uses a hydrological model, which aims to describe the physical movement of water to predict the water inflow into the river system. As mentioned previously, the hydrological model currently used for inflow forecasting is based on the HBV model developed by the Swedish Meteorological and Hydrological Institute. This is one of the most widely used hydrological models around the world for inflow forecasting. This section aims to describe the HBV model used in more detail. An overview of some of the hydrological concepts used in hydrological models is provided in appendix under A.1.

The HBV model was created to assist hydropower operations. It is one of the main hydrological models used in hydropower plants for inflow forecasting [53]. The model is a precipitation-runoff model which means that the model describes water inflow into a reservoir using precipitation as the main input.

Precipitation is measured from nearby weather stations and input into the model every day. The specific

(32)

20 type of HBV model used in Värmland has roughly 60 parameters according to one interviewee. These parameters describe conditions in the surrounding area that affect the movement of water. Such parameters include soil characteristics, snow melt and evaporation. A description of some of these parameters is provided under appendix A.1 (e.g. soil moisture and interception). Note that this is a completely different approach than inflow forecasts generated by machine learning models, which do not consider the relationship of parameters that affect water movement. Although these parameters are central to the HBV model, they are not measured directly. Instead, they are calibrated and fit using the realised inflow data to create the best fitting model. In practice, it means that the parameter values change every day depending on how the prediction of the HBV model performs compared to the actual realised inflow.

The HBV model used by the third-party vendor produces long-term inflow forecasts. The forecasts are made once every day. For each forecast, the daily inflow for the coming two years is forecasted. The model uses historical precipitation observations from 51 historical years to create inflow scenarios which makes the model computationally heavy. Thus, for each forecast, 51 different inflow scenarios are produced by the model after which the third-party vendor only provides a few of these to the case company.

For every reservoir, there is an area in which precipitation leads to inflow to that specific reservoir. This area is called the river basin. Each river basin is then in turn divided into sub-basins. The sub-basins are divided according to the land characteristics so that physical calculations are roughly the same in each sub-basin. For example, should a basin consist of one part with a lot of vegetation, and another part with no vegetation, this basin is divided into two sub-basins. According to one interviewee, in the case of the Lake Kymmen river system, the river basin of Lake Skallbergssjön is divided into 12 different sub-basins, while the river basins of Lake Kymmen and Lake Gransjön are divided into 4 sub-basins respectively.

This is done to get the most accurate model of water inflow.

4.2 Insights from interviews

The interviews were held in the beginning of the thesis to obtain a broad understanding of the planning and operations of the case hydropower plant. The interviews led to interesting discussion points and most importantly helped in the thought process behind the structure of the machine learning model. Below are themes gathered from the interviews which are considered important in building a machine learning inflow forecast model to compare with the current HBV model in place. Below, key takeaways are grouped by theme.

4.2.1 Time horizon of inflow forecasts

One important aspect of creating a forecast is the time horizon of the forecast for it to be as useful as possible. This would also be essential when later designing the machine learning model. Two interviewees explained that inflow forecasts longer than two weeks ahead decrease rapidly in accuracy as weather and precipitation becomes more difficult to predict. All interviews indicated that time frames longer than two weeks are not of interest for a machine learning model for inflow forecasting at this time as precipitation forecasts become increasingly inaccurate for time periods longer than two weeks ahead. This is contrary to the desire to also improve the long-term forecasts which are predictions for years ahead. It is simpler to

(33)

21 verify and compare results on shorter inflow forecasts which two interviewees highlighted. There are also more data points to train and test on when considering the two-week time horizon of inflow forecasts rather than a time horizon in years. To answer the research questions by creating and analysing a machine learning model for inflow forecasting of this particular river system, the interviewees indicated that the predictions do not need to be on an hourly basis. Furthermore, having larger increments of time between each value of the forecast simplifies the comparison with the realised inflow values. One interviewee stated that it does not matter if the precipitation, and inflow, happens during a particular hour. What matters is that it occurs within a larger time range. Subsequently, there is no need to compare forecasts on an hourly basis and for this a longer time horizon is used in the machine learning model. As the reservoirs are relatively large, small changes in inflow have a minimal effect on water levels of the reservoirs.

One interviewee showed how inflow forecasts are used in short-term planning and operations. The interviewee made clear that the inflow forecasts are not considered at a day-to-day short-term time horizon. What the short-term planning team considers in terms of inflow is the water value curve created with the inflow forecasts as one of several parameters. The interviewee also stated that when inflow for a river system is particularly high they receive instructions on what to do and if it is required to operate the hydropower plants to a higher degree than originally planned or if some spillway gates are required to be opened.

Two interviewees presented their ways of working with the inflow forecasts. Both the mid-term planning and the inflow forecasts can be divided into two different time horizons: two weeks ahead and one to two years ahead. The two weeks ahead inflow forecast is important to run the current operations and optimize electricity generation for the highest electricity price on the spot market. The longer inflow forecast is important for a broader perspective of operations as well as abiding by the water regulations for different reservoirs which require the water level to be higher during summer time. One interviewee expressed how the water regulation team often manually makes changes to the inflow forecasts for the upcoming two weeks in the Värmland region specifically. This indicated that investigating the use of a machine learning model on a two-week time horizon could be beneficial and interesting to the given region. At the least it could be used as support for making these manual changes which according to the same interviewee are made ad hoc, looking at different data depending on who in the water regulation team is responsible for the changes.

4.2.2 Information sharing

In order for machine learning to be utilized in the setting of the case company, the communication aspect between different divisions of the company needs to be taken into consideration. To improve inflow forecasting with the use of machine learning it is important where and in what way it could be useful.

The short-term and the mid-term planners work at opposite ends of the planning with different hierarchies of considerations and priorities. The short-term planners lost the longer time perspective prioritizing prices, abiding by regulation and planning the operations for the hydropower plants. The mid-term planners showed how they with the inflow forecasts produced a water value curve for the short-term planners to use when bidding on the electricity spot market. One interviewee mentions the current HBV model takes a lot of time to simulate. A machine learning model runs a lot faster than the current HBV

Intelligent hydropower: Making hydropower more efficient by utilizing machine learning for inflow forecasting

Intelligent hydropower

Making hydropower more efficient by utilizing machine learning for inflow forecasting

JAKOB CLAESSON SAM MOLAVI

Intelligent Hydropower

Making hydropower more efficient by utilizing machine learning for inflow forecasting

by

Jakob Claesson Sam Molavi

Master of Science Thesis TRITA-ITM-EX 2020:247 KTH Industrial Engineering and Management

Industrial Management

SE-100 44 STOCKHOLM

Intelligent vattenkraft

Effektivisering av vattenkraft genom användning av maskininlärning

by

Jakob Claesson Sam Molavi

Master of Science Thesis TRITA-ITM-EX 2020:247 KTH Industrial Engineering and Management

Industrial Management

SE-100 44 STOCKHOLM

Abstract

Key words:

Sammanfattning

Nyckelord:

Table of Contents

List of Tables

List of Figures

Acknowledgements

1. Introduction

1.1 Background

1.2 Research problem

1.3 Research questions

1.4 Contribution

1.5 Delimitations

2. Research background

2.1 Machine learning

2.1.1 Introduction to machine learning

2.1.2 Machine learning best practice and algorithm descriptions

2.1.3 Inflow forecasting using machine learning

2.1.4 Data quality for machine learning

2.1.5 Manifestations of poor data quality

2.2 Single Source of Truth

2.3 Multi-level perspective

3. Methodology

3.1 Research process

3.2 Qualitative method

3.2.1 Interviews

3.2.2 Field trip

3.3 Machine learning and data

3.3.1 Data collection and data cleaning

3.3.2 Model building machine learning

3.3.3 Evaluating the machine learning models

3.4 Research quality

3.4.1 Reliability

3.4.2 Validity

4. Empirical research

4.1 Area description

4.1.2 Regulations

4.1.3 Other parties affected by PHES water operations

4.1.4 Inflow forecasting at the case company

4.2 Insights from interviews

4.2.1 Time horizon of inflow forecasts

4.2.2 Information sharing