Using machine learning to predict power deviations at Forsmark

(1)

Examensarbete 30 hp Januari 2021

Using machine learning to predict power deviations at Forsmark

Albin Björn

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Using machine learning to predict power deviations at Forsmark

Albin Björn

The power output at the Forsmark nuclear power plant sometimes deviates from the expected value. The causes of these deviations are sometimes known and sometimes unknown. Three types of machine learning methods (k-nearest neighbors, support vector machines and linear regression) were trained to predict whether or not the power deviation would be outside an expected interval. The data used to train the models was gathered from points in the power production process and the data signals consisted mostly of temperatures, pressures and flows. A large part of the project was dedicated to preparing the data before using it to train the models.

Temperature signals were shown to be the best predictors of deviation in power, followed by pressure and flow. The model type that performed the best was k-nearest neighbors, followed by support vector machines and linear regression.

Principal component analysis was performed to reduce the size of the training datasets and was found to perform equally well in the prediction task as when principal component analysis was not used.

(3)

Popul¨ arvetenskaplig sammanfattning

Grunden till hur el produceras i kärnkraftverk är att vatten trycksätts, kokas, f˚ar en turbin att rotera och slutligen kondenseras. Turbinen är kopplad till en generator, vari rotationen om- vandlas till elektricitet. Mängden elektrisk effekt som levereras fr˚an Forsmarks kärnkraftverk varierar till följd av en rad olika förutsättningar. M˚anga av de faktorer som p˚averkar produktionen är kända och tas i beaktning d˚a prognoser för produktionen görs. Bland de kända faktorerna finns temperaturen p˚a vattnet som används för att kyla ˚angan efter att den lever- erat energi till turbinen. Sambandet mellan kylvattnets temperatur och mängden levererad elektrisk effekt är en negativ korrelation, det vill säga, ju högre temperatur p˚a kylvattnet, desto mindre elektrisk effekt levereras. Sambandet mellan kylvattentemperatur och levererad effekt kallas kylvattenkurvan och används som ett verktyg för att göra prognoser för hur mycket elektrisk effekt som kommer levereras i framtiden. Ibland avviker mängden levererad elektrisk effekt fr˚an värdet som beräknas av kylvattenkurvan och det är av intresse för Forsmark att vid dessa tillfällen först˚a varför avvikelsen sker. Det är ju önskvärt att ha s˚a noggranna prognosverktyg som möjligt.

P˚a Forsmark samlas stora m¨angder data in fr˚an alla delar av kraftproduktionsprocessen.

Tusentals mätsignaler per reaktor loggas och sparas i databaser. Bland alla dessa datasignaler kan det finnas ledtr˚adar som pekar mot varför avvikelserna fr˚an kylvattenkurvan up- pst˚ar. Detta projekt har studerat maskininlärning som metod för att förutsp˚a avvikelser fr˚an kylvattenkurvan och därigenom skapa en större först˚aelse för vilka bland mätsignalerna som

¨

ar indikatorer p˚a att elproduktionen blir lägre än förväntat. Tre typer av signaler användes för att träna maskininlärningsmodellerna och dessa var temperaturer, tryck samt flöden.

Tre olika typer av modeller testades och utvärderades. De har namnen k-nearest neighbors (kNN), support vector machines (SVM) samt linjär regression. Att träna och evaluera maskininlärningsmodeller tar tid. Ju större mängder data som används och ju komplexare modellerna är, desto mer tid g˚ar ˚at och under detta projekt har ingen extra datorkraft använts, utan alla beräkningar har gjorts p˚a en laptop. Av denna anledning undersöktes metoder för att minska mängden data som krävdes för att träna modellerna utan att modellernas prestanda p˚averkades för negativt. Dels undersöktes möjligheten att använda ett f˚atal av alla tillgängliga signaler istället för att använda alla och dels undersöktes en metod som kallas principalkomponentanalys (PCA).

De modeller som tränades p˚a temperatursignaler presterade bättre än de modeller som tränades p˚a trycksignaler, vilka i sin tur presterade bättre än de modeller som tränades p˚a flödessignaler. Skillnaderna i modellernas prestanda var inte s˚a stora mellan de olika signaltyperna. De största skillnaderna i prestanda var mellan själva modelltyperna där kNN var bäst, följt av SVM, följt av linjär regression. M˚anga modeller tog mycket l˚ang tid att träna och vissa program som tränade m˚anga modeller tog s˚a l˚ang tid att köra att de var tvungna att avbrytas. Därav är resultaten inte helt ”kompletta” utan det finns luckor i re- sultatet där till exempel SVM-modeller inte kunnat evalueras. Med hjälp av PCA kunde alla modelltyper tränas och jämföras. Även här var kNN-modeller tränade p˚a temperatursignaler de som presterade bäst. Prestandan för modellerna som tränats efter att PCA använts var

(4)

lika bra som för de modeller där PCA inte använts. Därav kan PCA ses som en användbar metod för att krympa datamängden utan att förlora s˚a mycket värdefull information.

Genom arbetets g˚ang förfinades metoderna för att välja ut vilka signaler modellerna skulle tränas med. I början valdes signaler i stort sett slumpmässigt, dvs. de valdes i den ordning de förekom i datan. Ett nogrannare sätt att välja ut datasignaler implementerades en bit in i arbetet. Denna metod är en s˚a kallad ”girig algoritm”, där den för tillfället bäst presterande signalen alltid väljs. Vissa signaler dök upp som starka indikatorer för avvikelse i elektrisk effekt oftare än andra. Bland dessa starka indikatorer fanns signaler som redan var kända för att p˚averka avvikelser fr˚an kylvattenkurvan, men det förekom även signaler vars p˚averkan

¨

ar okänd. Detta kan tyda p˚a att maskininlärningsalgoritmerna hittat samband mellan dessa signaler och tillfällen d˚a mängden levererad el avviker fr˚an det förväntade värdet. Mer studier bör göras för att se om det g˚ar att ˚aterskapa dessa samband och om s˚a är fallet bör de signaler som är starka indikatorer unersökas närmre. Det finns fortfarande stora mängder modeller att testa, samt fler omr˚aden p˚a Forsmark där maskininlärning kan appliceras. Detta arbete är ett första steg i utforskningen av maskininlärning som ett verktyg för att hämta information ur den data som samlas p˚a Forsmark och metoderna har visat sig kunna hitta kända samband för avvikelser i levererad el, vilket kan ses som ett tecken p˚a att metoderna

˚astadkommit sina m˚al.

(5)

Executive summary

Three machine learning models have been applied to the process data from Forsmark 1 to identify moments when the power production ends up below the value predicted by the cooling water curve. It was found that among the three different model types, k-nearest neighbors was the best at predicting deviations from the cooling water curve. The best signal type for making these predictions was temperature signals. Principal component analysis was performed on the input data for the models to reduce training times. It was found that the models performed equally well with principal component analysis as without it, so much time can be saved. Many signals in the process turned out to be faulty, the most common signal type to experience technical difficulties was flow signals. The signal which was the strongest predictor for deviations from the cooling water curve was a pressure signal with the name PISF1 K12414K163. The methods were able to find known connections between delivered power and process data, which can be viewed as an indication that these methods do indeed work when applied to this particular problem. This project is an early investigation of machine learning at Forsmark and there are a lot of methods that haven’t been attempted and a lot of areas at Forsmark where machine learning can be applied.

(6)

Acknowledgements

This project was done at the NMTU unit at Forsmark. I want to thank everyone at NMTU, where I felt welcome from day one. You have been very helpful in answering my questions and fun to hang out with, both in person and at a distance in these times of COVID-19.

A special thanks to my supervisor Thomas Smed whose knowledge of the power plant and Matlab runs deep. Thank you Niklas Wahlstr¨om for all the help with machine learning and report writing.

Albin Bj¨orn, March 2021

(7)

1 Introduction and background

In 1980 Sweden held a referendum on nuclear power. The result of the referendum was that Swedish nuclear power was to be decommissioned. The Swedish government decided shortly after the referendum that all Swedish reactors were to be shut down by the year 2010, a goal which turned out to be harder to reach than perhaps initially thought. For the ten years following the 2010 deadline, nuclear power has provided more than a third of Sweden’s annual electric energy every year except 2015. In 2019 nuclear and hydro power were the two largest producers of electric energy in Sweden, producing just below 65 GWh each. All other means of electric power production combined added up to just above 36 GWh that same year [1]. The Forsmark Nuclear power plant is one of three active nuclear power plants in Sweden. As of December 2020 Forsmark houses three of Sweden’s six remaining active nuclear reactors, as one of the reactors at the Ringhals nuclear power plant is coasting down to be taken off the grid at the start of 2021 [2]. The three reactors at Forsmark have been producing electricity since the 1980’s and have a combined electric power output of more than 3 GW. The smallest producer of the three reactors is Forsmark 1, having a rated electric power output of 984 MW, and the largest producer being Forsmark 3 rated at 1167 MW [3].

The power plant is situated at the coast and all the reactors are cooled using sea water. The layout of the plant is such that reactors 1 and 2 are right next to each other and reactor 3 is further away. All three reactors source their cooling water from a channel. Forsmark 1 and 2 have a shared cooling water intake from the channel while Forsmark 3 has its own intake.

No electrical energy can be stored in a power grid. Electric power is generated and consumed simultaneously, meaning that supply and demand must constantly be matched. Nuclear power is used as base power in the Swedish electric power system. This means that the amount of nuclear power on the grid is almost constant and acts as a ”base” on which other power sources stand. Other sources of power, such as hydro power, are much more suited to meet the fluctuating demand on the grid and are used as regulatory power, balancing supply and demand. More concretely for the reactors at Forsmark this means they mostly run on full thermal power, creating the base for other power sources. The amount of electric power delivered from the reactors will however not always be the rated amount. When reactors are working at full thermal power there are fluctuations in electricity production. These fluctuations happen for a variety of reasons, some of which are known and expected and some of which are unknown. One of the known factors is the temperature of the cooling water going into the plant. The general trend is that the lower the cooling water temperature is, the higher the production of electricity becomes. The ratio of electric power to thermal power (or output power to input power) is know as the efficiency, meaning that a higher cooling water temperature lowers the efficiency. This known relationship between cooling water temperature and production is called the cooling water curve, and is used as a tool to predict how much power will be produced.

Temperatures, pressures, flows and many other types of signals are measured at thousands of points at each of the reactors at Forsmark. Many of these signals depend on each other via

(9)

between power output and cooling water temperature, are known and accounted for. Other, unknown relationships may also exist given that there are thousands of signals per reactor.

It is of interest for plant operators to find these relationships. Knowledge of which signals are strong predictors for the deviations in efficiency can be used both to make predictive models more accurate and might point to locations that need maintenance in the power plant. There is interest at Forsmark to investigate whether machine learning can be used to find hitherto unknown signals or sets of signals that are strong predictors of deviations in efficiency.

Machine learning can broadly be divided into supervised learning and unsupervised learning.

Unsupervised learning focuses on finding patterns and associations within data without any type of response variable. Supervised learning on the other hand uses input-output pairs to try to generate a link between the inputs and the outputs. Supervised machine learning, which this project has investigated, can be used for two purposes: Classification, where the goal for the machine learning model is to assign categories, and regression, where the goal for the machine learning model is to generate numeric values [4]. The problem of power deviations happening for unknown reasons can in a machine learning context be viewed as both a classification problem and a regression problem. machine learning models can be trained either to predict whether or not the power deviation will lie within a certain interval or models can be trained to predict the magnitude of the deviations. The outputs of the regression model can be used for classification by determining if the predicted value is in the same interval used in the classification models. This enables the comparison of performance between the classification models and the regression models.

Training machine learning models can be time consuming, especially if the models are complex, the amount of data used as input is large and the available computing power is limited.

For this reason it is also of interest to investigate methods for reducing the amount of data models need to perform well.

1.1 Aim of the project

The goal of this project was to investigate if and how machine learning methods can be used to detect deviations in power production at Forsmark. In order to apply machine learning methods, data needed to be processed before applying the methods. This meant an additional goal was to understand and structure the raw data that was used.

1.2 Project boundaries

This project will not investigate the dynamics of the power production process, but rather static ”snapshots” of the state of the plant. To properly study the dynamics of the process other data sources would be needed and these data are too difficult to acquire in a reasonable amount of time for the scope of this work and will thus not be investigated. All three reactors will be investigated, but most of the work will focus on data from one of the reactors. There are slight differences in signal names and what signals are included in the data from each

(10)

reactor, meaning it would be too time consuming to try to match up the data from one reactor perfectly with data from another. Models could be trained on data stretching as far back in time as 2005. Training and evaluating models on such large amounts of data would however take very long. There are also some problems with the output data which will be discussed later in this report. For these reasons, the machine learning methods are applied on data between the years 2015 and 2020.

1.3 Related work

Methods from machine learning have not been explored to a great extent at Forsmark, but the subject of machine learning applied at nuclear power plants has been studied elsewhere, meaning there exist works to draw inspiration to and make comparisons to. In Continu- ous machine learning for abnormality identification to aid condition-based maintenance in nuclear power plant, the authors have applied machine learning with a similar goal to this project, using similar input signals consisting of temperatures, pressures and flows. The authors showed that machine learning worked well when detecting faults. A difference between that work and this one is that they studied transients and were working on much smaller timescales, whereas this project focuses on steady states [5]. machine learning has been tried at nuclear power plants in a variety of different applications. In A survey of the state of condition-based maintenance (CBM) in the nuclear power industry [6], several works are listed where machine learning has been used to make maintenance more efficient. Among the works listed are Predictive based monitoring of nuclear plant component degradation using support vector regression [7], where models called support vector machines were used to predict degradation of components. Another article where support vector machines were used with a purpose more closely related to this project is A support vector machine integrated system for the classification of operation anomalies in nuclear components and systems [8].

The authors used support vector machines and temperatures to classify anomalies in the reactor, which is a similar goal to the one in this project. There are of course also differences between the projects, such as the kernel function in this project being linear, whereas [8] used a Gaussian kernel. [8] also used temperatures and positions of control valves as input for the models whereas this project used temperatures, pressures and flows. In Plant monitoring and fault detection: Synergy between data reconciliation and principal component analysis [9], a method called principal component analysis was used to make the process of fault detection more efficient. Principal component analysis, as well as a nearest-neighbor method, were used in Monitoring and fault diagnosis of the steam generator system of a nuclear power plant using data-driven modeling and residual space analysis [10] to success- fully identify faults in static states of a nuclear power plant, much like this project. The input data used in [10] was however of much smaller dimension than in this project. also, in addition to flows and pressures, [10] used turbine control valve positions as inputs.

(11)

2 Theory

2.1 Power production at a nuclear power plant

One of the most widely used methods for electric power production today is by the use of so called heat engines [11]. A heat engine is a concept from thermodynamics and describes a device or system that transforms heat (thermal energy) into mechanical energy. Nuclear power falls under this category since thermal power is added by the fuel and is transformed to mechanical work in the turbine. There are two main types of nuclear reactors used for electricity production that are in use today and they are boiling water reactors (BWR) and pressurized water reactors (PWR). PWRs are the most common globally [12], but at Forsmark the three reactors are all BWRs. In BWRs the water is immediately turned into steam inside the reactor while in PWRs, the hot water from the reactor is pumped to a heat exchanger where the heat is transferred to water of a lower pressure (and thereby lower boiling point) in a secondary circuit. Both BWRs and PWRs operate at a significantly higher pressure than the surrounding atmospheric pressure. A typical working pressure inside a BWR is 70 bar, or about 70 times atmospheric pressure at sea level. This increased pressure has as a consequence that the boiling point of the water is higher than at atmospheric pressure. At 70 bar water boils at 286 ^◦C. The working pressure in PWRs is even higher, typically around 155 bar. The high pressure water in a PWR is however not brought to a boil. The high pressure water is instead pumped through a heat exchanger where it transfers thermal energy to water at a lower pressure, making the water in this secondary circuit boil.

The differences between the two reactor types are shown in figures 1 and 2.

Figure 1: Schematic of BWR Figure 2: Schematic of PWR

After steam has been created it is diverted to the turbine where the steam goes from high pressure to low pressure by passing the turbine blades, making the turbine axle spin. The turbine axle is connected to a generator where the rotational energy is transformed into electric energy. After passing through the turbine the low pressure steam needs to be con-

(12)

verted back to liquid water which is done by removing the heat of vaporization. The heat is moved from the steam to the cooling water through heat exchangers. The cooling water never comes in direct contact with the water from the reactor since that would irradiate the cooling water, making the ocean radioactive. After taking up the heat from condensing the steam, the cooling water experiences an increase in temperature and is pumped back into the ocean. Since the cooling water is sourced from the ocean, it will have different temperatures depending on the seasons.

Energy and power are two concepts familiar to those who study energy systems, but since this project ventures into fields a bit further removed from energy, a short explanation will be given here. A useful analogy to understand the difference between energy and power is that power is to energy what speed is to distance. Distance tells us how far we can go while speed tells us how fast we will get there and similarly, energy tells us how much work we can do while power tells us how fast it will get done. A more mathematical way to describe it is that power is the first derivative of energy with respect to time.

2.1.1 The Rankine cycle

The thermodynamic process where heat of a working fluid is transformed to mechanical work via a pump, boiler, turbine and condenser, as described in the previous section, is called the Rankine cycle. In its simplest form the Rankine cycle consists of four steps describing what happens to the working fluid:

1. Pressurizing 2. Boiling

3. Depressurizing 4. Condensing

The four steps of the cycle can be viewed as two pairs of opposite actions: Pressurization and depressurization (steps 1 and 3), and boiling and condensing (steps 2 and 4). Viewing the Rankine cycle as a closed system there are different energy exchanges with the environment at each of the four steps. There are two points where an exchange of heat with the environment takes place and two point where an exchange of mechanical work with the environment takes place [13]. Listing the steps in the same order as previously, defining the energy exchange at each steps looks as follows:

1. Mechanical energy is added to the system 2. Thermal energy is added to the system

3. Mechanical energy is removed from the system 4. Thermal energy is removed from the system

(13)

To be more concrete and place this in the context of this project, the mechanical work is added in the pump pressurizing the water and mechanical work is extracted in the turbine.

It is this rotational energy, converted to electric energy in the generator, that is measured and might deviate from the predicted value. Heat is added in the boiler by the fissioning nuclear fuel and in the condenser heat is transferred from the steam to the cooling water.

The Rankine cycle is a useful tool for analyzing the energy balance of the system, enabling calculations of both power and efficiency. The following figure shows the flow of heat and mechanical energy for a basic Rankine cycle.

Figure 3: Schematic of a basic Rankine cycle where dark blue indicates liquid water and light blue indicates steam

In figure 3, Q denotes heat and W denotes mechanical work. The temperature, pressure and flow data used in this project will come from different points depicted in figure 3. The places in the cycle where heat is added and removed are often referred to as heat source and heat sink in thermodynamics [13].

(14)

2.1.2 The efficiency of the process

In power production, efficiency (usually denoted by the symbol η) is typically defined as a ratio between the amount of energy coming out of the system and the amount of energy going into the system. More concretely this corresponds to the ratio between the electric power being produced by the generator and the heat produced by the nuclear fuel. As an equation, the efficiency can be stated as follows:

η = Pout

P_in (1)

where P denotes power. Rearranging this relationship in equation 1 yields that the output power can be estimated if the efficiency and input power are known. The theoretical maximum efficiency a heat engine can achieve is called the Carnot efficiency. The Carnot efficiency can be expressed in such a way that neither the produced power or the required heat in the process need to be quantified. Instead it uses the temperatures of the heat source and the heat sink:

ηCarnot = 1 − TC

T_H (2)

where T_C is the temperature of the heat sink and T_H is the temperature of the heat source.

Absolute temperature is used to calculate the Carnot efficiency, so the temperatures need to be in Kelvin. The actual efficiency of the process will be lower as there are losses at every step of the thermodynamic cycle [13].

2.1.3 The cooling water curve

The efficiency of reactors at Forsmark depends on the temperature of the sea water. The maximum theoretical efficiency gets lower as the sea water gets warmer, which can be seen in equation 2. The pressure inside the reactor is constant, which has as a consequence that the temperature in the reactor is also constant. This means that the only variable in the expression for Carnot efficiency is the temperature of the cooling water. When predicting how much power will be produced, workers at Forsmark use a tool called the cooling water curve. The cooling water curve shows the electric power output as a function of cooling water temperature when the reactor is working at full thermal power. The cooling water curve is frequently checked for accuracy and calibrated if needed. An example of a cooling water curve from Forsmark is shown in figure 4.

(15)

Figure 4: Graph showing how the power output at Forsmark 1 depends on cooling water temperature

The thick dashed line in figure 4 represents the expected power output and the thinner dashed lines represent the interval where the power output is expected to be. The points are actual observed power delivered over a year, from July of 2016 to July 2017. The power output of the plant (and thereby the efficiency) has a clear negative correlation with the temperature of the cooling water. The cooling water curve is only used for predictions when the reactor is running at full thermal power. Some of the observed points end up outside the dashed lines, meaning the power production deviated from the expected value. The aim of this project is to train machine learning models to predict when these deviations happen.

2.2 Machine learning

Most engineering students will be familiar with the concept of linear regression, which is in fact one of many methods included in the field of machine learning. Machine learning is an umbrella term for different computer algorithms that ”learn” a behavior based on input data.

Tom Mitchell, in the book Machine Learning, defines the overall task of machine learning as ”A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” [14]. Machine learning can slightly simplified be divided into two separate fields; supervised learning and unsupervised learning. This project will focus on supervised learning. A supervised machine learning model is trained by being shown

(16)

samples of data. A sample in the context of machine learning is an ”input output pair”, or a set of inputs with the associated output. The goal is then for the model to find a relationship between the inputs and the outputs. The next step in the machine learning process is to provide the model with samples it has not seen before and have it make a prediction of what the output will be. A common way to test the performance of machine learning models is to split the available data into a training set and a validation set where the model is shown (or ”trained” with) the training set and the model’s performance is gauged by how well it makes predictions on the validation set [4]. Features are the properties being measured and used as inputs for the model. All the features make up what is called the feature space. More concretely in the case of this project, the different signals are the features. A combination of features at one specific time are the samples.

Figure 5: An example of five samples with three features and one output

The aim of figure 5 is to clarify and make concrete the terminology of features and samples.

The rows of the table are the samples, each having three input features in the form of pressure signals and one output feature stating whether or not the power production was below the expected value.

2.2.1 Regression and classification

Machine learning models can be trained to perform regression or classification. The difference between regression and classification is what types of values the output can assume. In regression problems the output is a numeric value (like a continuous variable) whereas in classification problems, the output is a class or a category [4]. When a classification model is trained it will partition the feature space into sections that represent the different classes, creating areas that ”represent” the different classes. The feature space is, as has been mentioned earlier, the space created by viewing each input signal, or feature, as an axis in a space. The boundary between these areas created by the partition in the feature space is called the decision boundary. When machine learning models classify new points they do it by finding out which side of the decision boundary the new point is on.

(17)

Figure 6: Example of a decision boundary dividing a two dimensional feature space into two regions with classes represented by ’x’ and ’o’

Figure 6 exemplifies a decision boundary between two classes. In this example, the model has been trained on two features and has generated a decision boundary which is a horizontal line. When classifying new points, they will be assigned ”category x” if they are above the dashed line, and ”category o” if they are below the line.

The goal of this project was to train models that predict whether or not the power production will be below an expected lowest limit, meaning the outputs can be both categorical and numerical. The limit was determined historically and set at 2 MW, meaning the power output could deviate down to -2 MW and that would be considered ”normal”. In this project, training classification models means training models to assign categories ”below expected power” or ”not below expected power” to points in the feature space. If instead models are trained to predict the numeric value of the deviation, those models would be regression models. The outputs of the regression models can be converted to categorical values by using a numeric threshold, such as if the predicted deviation is more than 2 MW the associated category will be ”below expected power”. By doing this more information may be preserved than if classification models are used. Classification only predicts whether or not the value will be within the expected range, whereas regression models yield that information plus the magnitude of the deviation. This project will train both classification models and regression models and compare the results.

(18)

2.2.2 Measuring the performance of the methods

The goal of these models is to make good predictions on data the models have never ”seen”

before. A problem here is that the model has seen all the data samples it has been trained on, which doesn’t really say much about how well the model performs ”in the real world”, or how well the model generalizes. A common method used to get an idea of how well a model generalizes is cross validation. This means a subset of all available data samples is withheld. The model is then trained on the rest of the data and the model’s performance is judged when the model makes predictions on the withheld data [15].

Being able to put a number on how well a model performs is important since that enables the comparison between different models. A common measure of performance for regression models is the mean squared error (MSE). Simply measuring how far away a prediction is from the actual value is not possible in classification problems since the prediction is either true or not true. This does not mean that measuring the performance of a classification model is impossible. If a classification model makes a number of predictions the amount of correct predictions can be compared to the total amount of predictions. Confusion matrices are helpful tools to visualize the performance of classification models. A confusion matrix is constructed by letting one of the two axes be the actual occurrences of each class and the other axis is the model’s predicted classes [4]. A confusion matrix for two classes is shown in the following figure.

Figure 7: Example of a confusion matrix

(19)

power output. The numbers in the confusion matrix are the enumerated occurrences of each of the four possible outcomes. The two different classes are often referred to as the positive and negative classes. When performing binary classification, there are four possible outcomes for a prediction:

• True positive (TP) - Both the predicted and true class are positive

• True negative (TN) - Both the predicted and true class are negative

• False positive (FP) - The predicted class is positive, but the true class is negative

• False negative (FN) - The predicted class is negative, but the true class is positive.

This means that in the example given in figure 7 there are 662 true positives, 5474 true negatives, 60 false positives and 205 false negatives.

Enumerating the amount of true positives, false positives, false negatives, and true negatives enables the calculation of useful metrics. In Matlab, the standard measure of a model’s performance is the accuracy, which is calculated by dividing the true predictions with the total amount of predictions. As an equation:

Accuracy = T P + T N T P + T N + F P + F N

which can assume values between 0 and 1. 0 meaning that none of the model’s predictions were correct and 1 meaning all of the model’s predictions were correct. Accuracy is a good measure of a models performance if the categories in the data are fairly evenly distributed, i.e. there are about as many examples of one class as the other. However, if the classes are unevenly distributed (for example categorizing anomalies) accuracy might not give an accurate model. As an example, consider a model that is trained on a data set with 95% of the samples belong to the positive class and the remaining samples belong to the negative class. If this model always predicts positive class it will end up with a 95% accuracy on the training set. This means a model which always predicts positive could be chosen if accuracy is used as the metric of success, despite the model not actually performing that well. In the context of classifying anomalies this is not desirable. In imbalanced data sets it may be better to use metrics called precision and recall [16]. Precision is the ratio of true positives to the total amount of positive predictions made by the classifier. Recall is the ratio between the true positives and the total amount of actual positives. Precision and recall are defined by the following equations:

P recision = T P T P + F P Recall = T P

T P + F N

The precision measure can be interpreted as the proportion of positive predictions that were actually positive. Recall can be interpreted as the proportion of actual predictions that

(20)

were predicted properly. Both precision and recall can assume values between 0 and 1 like accuracy. Precision and recall can be put together into one single performance measure called the F1-score. The F1-score is the harmonic mean of precision and recall:

F 1 = 2 ∗ precision ∗ recall precision + recall

which also assumes values between 0 and 1 depending on how well the model performs.

There are more options for measuring the performance of both classification models and regression models.

Figure 8: Example of all mentioned performance metrics used to evaluate models trained on flow signals

Figure 8 shows an example of how the different performance metrics work. Note the high accuracy of the models caused by the imbalanced data. Since the F1-score is the harmonic mean between precision and recall, the F1 score will always be between them.

In this project the F1 score will be used to compare the performance of the classification models and MSE will be used for the regression models.

(21)

2.2.3 Linear regression

Figure 9: Illustration of a simple linear regression

As mentioned in the beginning of this chapter, most engineers will be familiar with linear regression, exemplified in figure 9. Graphically it is fairly easy to understand the goal of linear regression. Things get a bit more complicated however when we leave simple linear regression and start using more explanatory variables. The technique can still be used though and is mathematically summarized by the expression

Y = Xβ +

where Y denotes a vector of outputs, X denotes a matrix of inputs, β denotes the regression parameters and is the error term. The goal is to find, or ”learn”, a set of parameters β on a training set and then do predictions with those β on a validation set. Finding the parameters β is based on the maximum likelihood method, ending up with parameters that make the output Y as likely as possible. Assumptions need to be made about the error term and a common approach here is to assume the error follows a normal distribution. It can be shown that solving for β with the assumption of a normally distributed error gives rise to a closed form solution [15].

β = (X^TX)⁻¹X^TY

Minimizing the error and thereby finding optimal β can be done in a variety of ways. Matlab has several options and uses stochastic gradient descent as default. The goal with stochastic

(22)

gradient descent is to view the error as a surface and always move in the direction that minimizes the error. This direction is given by the gradient, hence the name gradient descent.

2.2.4 k-nearest neighbors

The k-nearest neighbors (kNN) algorithm is quite intuitive. For a point in the feature space to be classified, the model looks at the k nearest neighbors to that point and makes a prediction based on the majority vote from said neighbors. kNN can be used both for regression and classification. In regression, the predicted value is the mean of the values of all the neighbors. kNN is a non-parametric method, meaning it does not work by finding a set of parameters that fit some line or function to the data like in the case of linear regression.

The configuration of the feature space is what, in and of itself, gives rise to the model’s predictions [15]. As the name suggests, kNN computes the distance to different points in the feature space. There are many ways to define distance and in this project the common (and standard in Matlab) method of Euclidean distance was used. Since Euclidean distance is used in the kNN algoritm it has been shown that normalizing the data can significantly improve the model’s performance [17]. If kNN is used for classification and the distribution of class members is uneven, i.e. there are more members of one class than another, alternatives to majority vote might yield better results. Improved classification performance for the kNN was shown in [18] for skewed classes where the authors had a more stringent rule than majority vote. The proportion of neighbors belonging to the majority class needed to be larger than the standard 50% for the new point to be classified as belonging to the majority class. Another way to improve the kNN model is to assign weights to neighbors based on the distance from the point to be classified. This weighted kNN gives nearer neighbors more say than those further away.

The choice of k is made by the user and the optimal choice of k depends on the given situation. A very low k might lead to an ”erratic” decision boundary and is thus prone to overfitting while a high k can have the opposite problem.

The steps to performing kNN classification are quite simple and can be written as:

1. Find the k points in the feature space with the shortest Euclidean distance to the point to be classified

2. Assign a class to the point by majority vote among the k points

(23)

Figure 10: Illustration of kNN classification of the point X using k = 1 (small circle) and k

= 3 (larger circle)

Figure 10 illustrates how the kNN classification works. In this example, using a k of 1 would assign the positive class to X, while using a k of 3 (or greater) would assign the negative class to X.

When using kNN for classification there is a possibility that the vote ends up in a draw if k is an even number. A common way to resolve draws is to randomly assign a class to these points.

2.2.5 Support vector machines

Describing how support vector machines (SVMs) work is a bit more complicated than in the case of linear regression and kNN. A big picture way of understanding SVMs is that the goal is to construct a plane which separates the classes with as large a margin as possible. When classifying new points the SVM model then find which side of the plane the new point is on and assigns a class accordingly.

Getting into a bit more detail, SVMs work by constructing a so-called hyperplane that separates the classes in the given feature space. This hyperplane is a plane which has one less dimension than the feature space. The hyperplane in a one dimensional space would be a point, in two dimensional space it would be a line, in three dimensional space it would be a ”plain plane”, and so on. If the categories are completely separable, meaning there exists a plane where all instances of one class is on one side of the plane and all instances of the

(24)

other class are on the other side, there can be lot of hyperplanes that achieve that separation.

SVMs are a continuation of a type of classifier called maximal margin classifiers. The term maximal margin comes from the fact that if it is possible to construct a hyperplane that perfectly separates the classes, there exist an infinite amount of such hyperplanes. The choice of hyperplane is in this case based on making the margin, or the shortest distance between the observed training points and the hyperplane, as large as possible. This is achieved by constructing vectors perpendicular to the hyperplane, pointing to the observations in the training data. These vector are called support vectors. In many cases, there does not exist a hyperplane that perfectly separates the observations. Classes will have some overlap with each other. A soft margin can be introduced, allowing for some of the observations to end up on the wrong side of the hyperplane. The optimal decision boundary might not be linear. If that is the case, SVMs can use different kernels to accommodate for more irregular decision boundaries [4]. The default kernel function in Matlab is linear. There are other options and users can also write their own kernel functions. The descriptions of SVMs can get quite technical. There are sources online [19] and in the literature [4] that give more detailed descriptions of SVMs.

Figure 11: Illustration of a linear SVM with the associated largest possible margin between two separable classes

Figure 11 shows the kind of large margin that is the goal using SVMs. Many different lines would have fit between the classes, but the one chosen is the one that created ”the widest street” through the data. Note that ”in real life” data will rarely be as neatly separable as

(25)

2.2.6 Training time

Training machine learning models and using the models to make predictions can take a long time. Factors that affect the time it takes include the size of the data set, the complexity of the models used, and the available computing power. If the models are very complex and the data sets large, extra computing power may be necessary. This project will not be using any external so-called ”cloud services” or connect to any computer clusters for extra computing power. Rather, all the scripts will be run locally on laptop computers, meaning some models may end up taking too long to train.

2.2.7 Principal component analysis

Principal component analysis (PCA) can much like SVMs be complicated to describe. A brief summary of PCA will be given, followed by a more in depth description of the technique. The goal of PCA in this project was to reduce the amount of data needed to train the models without the models performing worse than without PCA. The way PCA does this is by constructing a new coordinate system in the feature space. This new coordinate system has fewer dimensions than the feature space. All points are projected onto the new coordinate system and are then used as inputs for the machine learning models.

When talking about dimension reduction in a machine learning context the goal is to reduce the amount of dimensions in the feature space while losing as little information as possible.

One method commonly used for this purpose is PCA. The goal of PCA is to switch out the features for so-called principal components and use fewer principal components than there are features. PCA works by constructing a new coordinate system within the feature space.

This new coordinate system is constructed by maximizing the spread of data along each axis. The axes in this new coordinate system are called the principal components. The first principal component is found by maximizing the spread of data (or variance) along a line for the entire data set. The other principal components are then found by maximizing the variance along lines perpendicular to the already established principal axes. This means the principal components will be ordered in such a way that the first principal component describes the largest variance along a line in the dataset, the second principal component describes the second largest variance along a perpendicular line to previously found lines and so on in descending order. The number of principal components created is the same as the amount of features in the feature space. For each principal component it is known how much of the variance is described [4]. This means that a cutoff can be introduced where adding more principal components does not add much information about the spread of the data.

As an example, consider data in a 100-dimensional space. PCA is performed and 99% of the variance in the data is found in the 15 first principal components. This means that the remaining 85 dimensions in the principal component space cumulatively explain one percent of the variance in the data. In other words, only one percent of the information about the spread of the data is lost when the data are projected onto the first 15 principal components.

(26)

3 Method

3.1 The input data for the models

Thousands of signals are measured and logged at set time intervals for each signal from every reactor at the Forsmark power plant. In preparation for this project, a subset of all signals had been selected. The selected subset of signals consisted mostly of temperatures, pressures and flows from the reactors. In order to access the data from the reactors, Matlab functions were provided at the start of the project. These scripts read the selected subsets of all available signals between specified dates. The amount of signals in these sets are about 300-400 per reactor, varying from year to year depending on a variety of factors that will be discussed later. Programmers at Forsmark have chosen these subsets of all signals to be a good starting point for applying machine learning methods. Feature spaces with several hundred dimensions are not as enormous as if all the thousands of signals were used, but are still large and since the project uses only laptop computers, training time may become an issue. For this reason signals were grouped into smaller sets consisting of temperature signals, pressure signals and flow signals. To further save on time, the period between 2015 and 2020 was chosen rather than a ten year interval, which was the original plan. The reason for choosing this specific time interval will be discussed later in this report. The data were read from Forsmark year by year, creating one table of data for each year between 2015 to 2020. The provided scripts remove signals based on a number of criteria, such as the signals containing too many instances of ”not a number”, or NaNs. The set of signals that were removed was different for each year. This may be caused by different sensors breaking and getting repaired at different times. Models have to be trained using the same set of features, so work had to be done to ensure that the tables contained the same set of features for each year. The Matlab function setdiff [20] was used to find features present in data from one year that were not present in data for other years. Using this function to compare the set of signals for each year and removing ones that appear in some years but not in other years ensured that the all the features could be combined in one big table and machine learning models could be trained on the data. After doing this to the Forsmark 1 data, 303 signals remained.

It was noticed that some signals behaved in unexpected ways. Some of the signals sometimes got stuck at a specific value after having varied before, and sometimes the hourly mean values were physically impossible. An example of a physically impossible value from a signal is a steam temperature of more than 700^◦C coming out of the reactor. A temperature that high is not possible since the steam is at saturation temperature, which at 70 bars of pressure is 286^◦C. When reading the logged data on the smallest time scale for this signal rather than the hourly means the problem became clear: Most of the values were NaN and the few numeric values that existed were very high. The functions that read the signals ignored all the NaN-values and returned the high temperatures. For this reason a Matlab function was written to check all the signals for this problem. The function reads the actual (smallest

(27)

NaN for each signal. The function then returns the names of those signals with a fraction of NaN-values higher than 0.25. The signals that are shown to contain a lot of NaN can then be inspected and, if needed, removed.

It was found that many of the faulty signals found by the written Matlab script were caught and removed by those written by the programmers at Forsmark. Some signals did however slip through the filters in the provided data gathering functions. A lot of the faulty signals repeat year to year. All of the faulty signals were found to be measuring flows in the process except one which measures pressure. No signals were found to contain too many NaNs before 2014 at Forsmark 1. After making sure none of these faulty signals were included in the input data some more processing steps were taken.

Standardizing the data may enhance the performance of some machine learning methods and is also necessary for performing PCA [4]. Standardizing means transforming each signal/feature so that it has a mean value of 0 and a standard deviation of 1. Standardizing data in Matlab is a quite trivial task. The built in function zscore [21] performs the task efficiently. Performing PCA is similarly easy in Matlab. The built in function pca [22] returns both the principal components in decreasing order of variance explained as well as how much of the variance is explained by each principal component.

3.2 Deviation from the cooling water curve

This project investigates the deviation from the so-called cooling water curve. The cooling water curve can be seen as an estimate of how much power will be delivered at a given temperature of cooling water. The cooling water curve can only be used for estimations when the reactors are working at full thermal power, i.e. when the deviations are not too large. Below, in figures 12 and 13, power deviations in reactors 1 and 2 of up to 10 MW are shown for the time interval 2010 through 2019. The deviations are the differences between predicted output power and actual output power and are obtained from another Matlab function written by the workers at Forsmark.

(28)

Figure 12: Historical power deviation at Forsmark 1

Figure 13: Historical power deviation at Forsmark 2

There is a clear jump in the production at Forsmark 2 in 2013. This is because the reactor was upgraded and started running at a higher power. The calculation of deviation does not seem to take this upgrade in power into account, meaning the reactor almost constantly delivers ”more than expected” after 2013. There are jumps and, seemingly, a slight positive trend in the deviation at Forsmark 1. The deviations are more centered around zero at Forsmark 1 which is why this data became the main focus of the project. A Matlab function was written, taking the deviations as input and converting them to the categories ”below”

and ”not below” expected production. The threshold chosen in this project was -2 MW.

(29)

3.3 Applying the methods to the data

When the data has been processed to be used it is possible to just split it up into a training- and validation set and apply some methods using all available signals. Doing this may yield a model that makes accurate predictions, but interpreting and understanding such a complex model can be difficult. In order to have more control over the models and gain a better understanding of which features affect performance, the features can be divided into groups.

Using all features can also lead to very long training times for the models [4]. Dividing the data into a training set and a validation set was done using Matlab’s built in function cvpartition [23], where a 80/20 split was used. That is, 20% of the data were withheld and used for validation.

The separate types of signals were split into different tables, meaning there was one table consisting of all the flow signals, one table containing all the pressure signals an one with all the temperature signals. This was done because it may be interesting to see if one type of signal is a better (or worse) indicator of deviation in power production than other types.

Doing this for the 303 signals at Forsmark 1 yielded 37 flow signals, 72 pressure signals, 178 temperature signals and 16 remaining signals that could not be put in either of these categories.

3.3.1 Training kNN models and changing the amount of features

As an initial approach, kNN models were trained where both the amount of features and the parameter k were varied. In Matlab, kNN models are trained using the built in function fitcknn [24], where k can be chosen by the user.

The training and evaluation was done as follows:

1. Choose a signal to add to the training dataset

2. Train kNN models, varying k from 1 to 50, computing the F1-score for each model on the validation set

3. Return to step 1 until all the signals have been used

At this stage signals were not chosen in a ”smart” way, but rather added one by one in the order they appeared in the tables. This stage of the project, as well as attending a machine learning course given by Mathworks, was used to find a good value for the parameter k.

3.3.2 Selecting features in a more careful way

After having established a good value for k in the previous step, more care was put into choosing what features to include in the models. Features can be selected by implementing a so-called greedy algorithm. This is achieved by training a models on features that are added one by one, always selecting the feature that performs best [25]. This approach is called greedy since it always chooses the best performing feature, ignoring possible synergies between features. Two features might not be good predictors individually, but together form

(30)

a strong predictor for the given problem. There is a chance that a greedy algorithm will miss these kinds of behaviors. In this project, features are selected using a greedy approach to see if there is a point where the performance of the models plateaus. The implementation can be written as follows:

1. Train one model per feature and validate all the models 2. Choose the feature that trained the best model

3. Train one new model per remaining feature plus the feature(s) chosen in step 2 4. Validate the models and return to step 2

3.3.3 SVM and linear regression

The greedy approach to selecting features was used when training the other two model types.

The SVM models were trained using Matlab’s built in function fitcsvm [26], using the default (linear) kernel.

The linear regression models perform regression, meaning the output is not a category but a number. The regression was turned into a classification by applying the -2 MW threshold to the output. The linear regression models were trained using the built in function fitrlinear [27], using the minimum MSE as the criterion for choosing features rather than using the maximum F1-score. Attempts to use the F1 score were made, but failed.

(31)

4 Results

All models in this section were trained on the training set and then validated using the validation set. That is, all the F1-scores and MSEs are the results from the models being applied to data they have never seen before. The results are presented in the following order:

• Performance of kNN models with different values of k and different amount of features.

• Comparison of performance of the three model types when adding features according to the greedy algorithm.

• Performance of the linear regression models, both in terms of MSE and F1-score.

• PCA performed on the different signal types and amount of dimensions needed to explain 99% variance

• Performance of models trained on the data transformed using PCA.

(32)

4.1 Varying both k and amount of features

The results form the initial attempts with varied k and features added in the order they appear in the tables, as described in chapter 3.3.1. This was the first implementation of the machine learning algorithms and was done to explore how performance was affected by features and choice of k. A goal with doing this was to find a good value for k which could later be used when comparing kNN to the other model types. The results are shown in figures 14 and 15.

Figure 14: Performance of kNN-models trained on flow signals varying both k and amount of features

In this case, as can be seen in figure 14 that performance plateaus at around 0.8. The total amount of models trained and evaluated to create this surface was 37*50, or 1850 models.

The larger the value for k and the more features used in the models, the longer the training time.

(33)

Figure 15: Performance of kNN-models using different values of k, trained on ”pressure features”

The climb in performance along the direction of adding more features in figure 15 is steeper than in the case of flow signals. There were more signals here, so more models were trained to create the surface. In total 72*50, or 3600 models were trained which took many hours.

There seems to be a plateau in performance going in both the direction of k and the amount of features. A k larger than about 7 seems to give no further improvement, and choosing features in the order they appear in the data also seems to plateau at around 10 features.

Based on this, a k of 10 was chosen since that value is a bit into the plateau without being too large.

Training models in this way on the 178 temperature features took too much time. It was assumed that a value for k of 10 is good in the case of temperature signals as well.

(34)

4.2 Choosing features based on performance

With k set at 10, all three types of models were trained on flow signals, pressure signals and temperature signals. Features were added to the models one by one using the greedy algorithm. The results are shown in figures 16, 17 and 18.

Figure 16: Performance of the models on flow features, added with the greedy algorithm

kNN outperforms both SVM and linear regression. It also reaches its peak faster than the other two model types, meaning it requires less data and still outperforms the other two.

Compared to the surface in figure 14 the peak (or plateau) of the F1 score is reached using fewer features.

(35)

Figure 17: Performance of the models on pressure features, added with the greedy algorithm

The same pattern can be seen in the case of pressure signals as in the case of flow signals.

kNN outperforms linear regression. It was not possible to generate results for SVM models however, because those models took too long to train. It can also be noted that the climb in performance for the kNN models is about as fast as in the case where the greedy algorithm was not used. One possible explanation for the similar performance with respect to amount of features is that the order of the signals in the non-greedy case happens to be such that the features that perform well are in the beginning of the list.

(36)

Figure 18: Performance of the models on temperature features, added with the greedy algorithm

Only kNN and linear regression models were trained in the cases of pressure signals and temperature signals. This is because training SVM models on the pressure- and temperature data took too long. In the case of temperature features (figure 18), the program was halted after 50 features had been added. Running the program on the entire set took too long. The behavior of the linear regression performance graphs is a bit more erratic than the other two.

This might have to do with the fact that the linear regression models were optimized based on MSE and not F1-Score.

(37)

4.3 Linear regression

Linear regression models were trained adding features according to the greedy algorithm.

Choosing the next feature was done my minimum MSE rather than maximum F1-score.

Figure 19: Performance of linear regression models trained on flow signals, choosing the best performing features

The F1-score increases, but not as steadily as in the cases where the optimisation was based on F1-score rather than MSE. The MSE for the flow signals lands at a rather high value for models trained on flow signals. With a limit for deviation at 2 MW, a MSE of approximately 1.5 MW is not a strong model.

(38)

Figure 20: Performance of linear regression models trained on pressure signals, choosing the best performing features

The models trained on pressure signals perform in a similar way to those trained on flow signals. The MSE is however lower for the models trained on pressure signals.

(39)

Figure 21: Performance of linear regression models trained on temperature signals, choosing the best performing features

The linear regression models were the fastest to train and could even be evaluated on the entire temperature data set. There is a jump in the performance of the temperature features after 100 features have been added as can be seen in figure 21. The models were chosen according to the smallest MSE rather than the highest F1-score. Choosing the signals with the highest F1-score did not yield an increase in performance.

(40)

4.4 PCA

PCA was performed both on all features and on each of the groups of features (temperatures, pressures and flows). The following Pareto plots show how much of the variance in the data, both individually and cumulatively, the principal components explain for each of the cases.

The PCA was applied to all of the data, meaning both the training set and validation set were transformed.

Figure 22: PCA on all signals Figure 23: PCA on temperature signals

Figure 24: PCA on pressure signals Figure 25: PCA on flow signals

The bars in the figures show how much of the variance is explained by each principal axis and the line show the cumulative variance explained. All signal types have around 90% variance explained by the ten first principal components. Temperature signals (as seen in figure 23) has the highest explained variance at ten principal components. Table shows the amount of

(41)

Table 1: Amount of principal axes needed to explain 99% of the variance in the different signal types

Signal type PCA 99% variance Reduction in dimensionality

All signals 51 303 → 51 (83%)

Flow signals 21 37 → 21 (43%)

Pressure signals 25 72 → 25 (65%)

Temperature signals 27 178 → 27 (85%)

The most dramatic reduction in amount of dimensions is in the temperature signals, where 27 out of 178 dimensions (or about 15%) can explain 99% of the variance.

4.5 Training models after using PCA

Models were trained on principal axes explaining 99% of the variance in the data, as shown earlier in this chapter. This means the dimensions of the feature spaces were 21 for flow features, 25 for pressure features, 27 for temperature features and 51 for the entire data set.

Table 2: Performance of the different model types when applied to data after PCA Flow signals Pressure signals Temperature signals All signals

kNN (k = 10) (F1-score) 0.81 0.83 0.84 0.84

SVM (F1-score) 0.76 0.76 0.80 0.79

Linear regression (F1-score) 0.40 0.75 0.80 0.82

Linear regression (MSE) 1.62 0.77 0.66 0.54

One outlier in table 2 is the low F1 score on the validation set achieved when applying the linear regression model to the data after PCA. The MSE is also a bit lower than in the case of linear regression models trained without PCA.

(42)

5 Discussion

5.1 Data

Data from the power production process are messy. Attempts to clean the data and remove samples and/or entire signals have been made, but since there are a lot of signals there is a possibility that some faulty signals slipped through the filters and were used.

This project investigates deviations from the expected power production. A problem with the definition of ”expected” can appear here. The deviations are defined as the difference between expected output and actual output. This means that if the model that computes the expected value is an accurate model, the deviations should be centered around zero.

This is not the case in the data from Forsmark 1 or Forsmark 2, being especially visible in Forsmark 2 after 2013 (when the power output was increased) as the deviation is almost constantly above zero. In other words: It seems the deviation is the new normal state for the reactor since the expected value is almost constantly lower than the actual value. For this reason, the data chosen for applying the machine learning methods was from Forsmark 1 between 2015 and 2020. The data is somewhat centered around zero during that period, but there is a slight upward trend. It would be interesting to look over how these deviations are defined and/or use some other metric of performance for the process and then apply the machine learning algorithms tested in this project. Discussions were had about whether or not to detrend the data for power deviations, but in the end it was decided that the raw data would be used without detrending it. Some attempts were made with detrended data and the models performed well on these, however since the project is based on deviations from the raw data provided there were uncertainties about how to interpret the detrended data, which is why the raw data was used instead.

5.2 Classification or regression?

The power deviation problem was initially conceived as a classification problem. However, since the classification is based on a numeric variable (the deviation), models can also be trained to predict this value directly without the extra step of the classification. The difference in these two approaches is whether the assignment of categories ”below” or ”not below”

is done on the actual data before the models are trained, or on the predicted values after the models have been trained. The second of these two approaches is a regression problem rather than a classification problem. A strength of this approach is that it preserves more information. Not only can the categories be acquired but additionally, the magnitudes of the deviations are also presented, which might be of interest. This was realized quite late in the project which is why there was only time to train simple linear regression models.

5.3 Model performance

Judging from the results, it seems that among the three kinds of features, temperature gives

Using machine learning to predict power deviations at Forsmark