Machine Learning for Predictive Maintenance on Wind Turbines : Using SCADA Data and the Apache Hadoop Ecosystem

(1)

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

202020 | LIU-IDA/LITH-EX-A--2020/008--SE

Machine Learning for Predictive

Maintenance on Wind Turbines

–

Using SCADA Data and the Apache Hadoop Ecosystem

Behovsstyrt Underhåll av Vindkraftverk med Maskininlärning i

Apache Hadoop

John Eriksson

Supervisor : Rouhollah Mahfouzi Examiner : Martin Sjölund

(2)

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

This thesis explores how to implement a predictive maintenance system for wind turbines in Apache Spark using SCADA data. How to balance and scale the data set is evaluated, together with the effects of applying the algorithms available in Spark mllib to the given problem. These algorithms include Multilayer Perceptron (MLP), Linear Regression (LR), Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM) and Gradient Boosted Tree (GBT). This thesis also evaluates the effects of applying stacking and bagging algorithms in an attempt to decrease the variance and improve the metrics of the model. It is found that the MLP produces the most promising model for predicting failures on the given data set and that stacking multiple MLP models is a good way of producing a model with a lower variance than the individual base models. In addition to this, a function that creates a savings estimation is developed. Using this function, a time window function that explores the decisiveness of a model is created. The conclusion is made that a model is more decisive if the failure it predicts occurs in a turbine where it has been trained on failure data from that same component, indicating that there are unknown variables that affect the sensor data.

(4)

(5)

Acknowledgments

I would like to acknowledge my supervisor Fredrik Eklund at Attentec for providing valuable ideas about how to proceed with the research and what to prioritize when being stuck at a cross roads, as well as for providing valuable feedback regarding the report.

I would also like to acknowledge the work of my supervisor Rouhollah Mahfouzi and my examiner Martin Sjölund at LiU for helping me improve and finalize this thesis.

(6)

(7)

2.13 Related work . . . 23 3 Method 25 3.1 Data Discovery . . . 25 3.2 Data Preparation . . . 26 3.3 Model Planning . . . 29 3.4 Model Building . . . 44 3.5 Final System . . . 45 4 Results 47 4.1 Data Discovery . . . 47 4.2 Data Preparation . . . 47 4.3 Model Planning . . . 48 4.4 Model Building . . . 49

(8)

5 Discussion 53 5.1 Results . . . 53 5.2 Method . . . 55 5.3 The work in a wider context . . . 56

6 Conclusion 57

6.1 Future Work . . . 58

(9)

List of Figures

2.1 Wind turbine component scheme. Retrieved from energy.gov. Image is work in

the public domain according to EERE copyright policy. . . 6

2.2 Subsystem downtime per turbine . . . 8

2.3 Subsystem failure rate . . . 9

2.4 Monolithic SCADA architecture . . . 9

2.5 Distributed SCADA architecture . . . 10

2.6 Networked SCADA architecture . . . 11

2.7 Big Data Life Cycle . . . 13

2.8 MapReduce word count example . . . 16

2.9 Bias variance trade-off . . . 21

3.1 Amb_Temp_Avg over time. This variable describe the measured average ambient temperature. . . 28

3.2 Prod_LatestAvg_TotReactPwr over time. This variable describe the measured to-tal reactive power produced. . . 29

3.3 Scaler comparison for a feature with no outliers, using the data from the Amb_Temp_Avg variable. 3.3a, 3.3b and 3.3c illustrate the data distribu-tion of this variable when scaled using the MinMaxScaler, StandardScaler and PowerTransformer-Yeo-Johnson respectively. . . 30

3.4 Scaler comparison for features with no outliers, using the data from the Prod_LatestAvg_TotReactPwr variable. 3.4a, 3.4b and 3.4c illustrate the data dis-tribution of this variable when scaled using the MinMaxScaler, StandardScaler and PowerTransformer-Yeo-Johnson respectively. . . 31

3.5 Visualization of precision, sensitivity and specificity metrics in relation to the ratio between positive and negative samples in training data . . . 37

3.6 Visualization of decision tree based bagging algorithm performance . . . 42

3.7 Visualization of multilayer perceptron based stacking algorithm performance . . . 43

3.8 System that creates a warning Kafka topic using the prediction models . . . 46 4.1 Output from Kafka producer that creates a stream of turbine sensor measurements 51

(10)

(11)

List of Tables

2.1 Related work metrics . . . 24

3.1 Turbine failure summary . . . 26

3.2 Baseline theory about turbine failures with respect to training and testing data . . 27

3.3 Costs for component operations . . . 32

3.4 50-50 ratio with normalizing scaler . . . 34

3.5 50-50 ratio with standardizing scaler . . . 34

3.16 PCA evaluation using data with 70-30 ratio . . . 38

3.17 Algorithm metrics for models produced by the hyperparameter evaluation . . . . 38

3.18 Cross-validation comparison using two and five folds . . . 39

3.19 Bagging model evaluation . . . 40

3.20 Stacking model evaluation . . . 41

3.21 Savings evaluations for the gearbox component . . . 44

4.1 Metrics for training data . . . 49

4.2 Metrics for testing data . . . 49

4.3 Cost estimation for all components using unfiltered predictions . . . 50

4.4 Cost estimation for all components using window filtered predictions . . . 50

4.5 Outcome of model predictions of turbine failures with respect to training and test-ing data . . . 51

(12)

(13)

1 Introduction

This section describes the motivation and concepts discussed in this report, as well as the research questions and delimitations.

1.1 Motivation

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases. Out of all the renewable energy alternatives, wind energy is the most developed technology worldwide with over 597GW capacity in 2018 [1].

Over an estimated wind turbine life span of 20 years, it is estimated that the cumulative operation and maintenance costs is 65-90% of the total investment cost. These costs include crane costs and inflation rates. The lower estimate is based on the Danish fleet of 600kW wind turbines while the higher estimate is based on 600kW-750kW machines located in North America [2]. In another perspective, maintenance costs is estimated to constitute 20-25% of the levelized cost per kWh for wind turbines [3]. It is clear that operation and maintenance costs has an impact on the profitability of the wind farm and on the competitiveness of wind turbines compared to other green energy alternatives. However, this also means that there is great room for improvement using new technologies.

The U.S Department of Energy have put together a guide to achieving operational efficiency where maintenance practices are explained [4]. They define maintenance as either proac-tive or reacproac-tive, where the aim of proacproac-tive maintenance is to correct the error before failure occurs, while reactive maintenance reacts only on errors or failures. They claim that even though reactive maintenance has low running costs it usually leads to increased costs due to unplanned downtime. Proactive maintenance can be grouped into two sub groups: pre-ventive and predictive [4]. Prepre-ventive maintenance means that maintenance is performed on a time-based schedule. This leads to an increased product component life cycle in most of the cases and an estimated 12-18% lower cost compared to reactive maintenance. Predictive maintenance uses sensor information and analysis methods to measure and predict degrada-tion and future component capability. The idea behind predictive maintenance is that failure

(14)

patterns are predictable. If the time when a component will fail can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower. Predictive maintenance leads to an estimated 8-12% cost saving compared to preventive maintenance as it’s less labor intensive [4].

There are many reasons for why proactive, and especially predictive maintenance, leads to lower maintenance costs, such as:

• Wind turbines are often located in remote locations and downtime can last for days before the required spare parts reach their destination.

• Errors classified as a major failures, meaning that the failure has an associated down-time greater than one day, constitute 25% of the amount of errors but are responsible for 95% of the downtime.

• Not only can predictive maintenance reduce the amount of failures by correcting errors, it can also reduce the amount of redundant hours being spent on routine controls or maintenance of well functioning components.

With recent advances in technologies related to internet of things, predictive maintenance is starting to become the norm for industrial equipment monitoring. Still, about 30% of all in-dustrial equipment does not benefit from predictive maintenance technologies and instead relies on periodic inspections to detect anomalies. A study on predictive maintenance tech-niques by Hashemian et al. put these numbers into the perspective of common failure models and presented the conclusion that predictive maintenance is preferred in 89% of the cases [5]. Another argument for monitoring presented in the same study, is that SKF Group, a lead-ing manufacturer and supplier of bearlead-ings and condition monitorlead-ing systems, stress tested bearings and measured the time to failure. They presented data that showed a seemingly uni-formly distributed failure pattern. In addition, the study displayed a high range in durability with some of the 30 bearings lasting fewer than 15 hours and one lasting for 300 hours. Since bearings are a key component in wind turbines, this is an indication that monitoring using sensors is crucial for predictive maintenance in wind turbines.

1.2 Aim

The main objective of this thesis project is to enhance the current body of knowledge re-garding predictive maintenance on wind turbines by implementing a big data analysis on a real world data set. The evaluation is done by comparing the implications of using the chosen techniques for data analysis used in this report with the respective data from related research. All machine learning algorithms available in Apache Spark are considered and the implications of applying stacking and bagging to these algorithms are evaluated. The pro-duced models are compared with the models presented in related research with regards to implementation, run time, usability and accuracy. The researched techniques were chosen in part due to discoveries being made when studying research on predictive maintenance for wind turbines and in part when studying research on big data analysis methods.

1.3 Research questions

1. Can a system for predictive maintenance in wind turbines be implemented using Apache Spark?

2. Will a model created by applying bagging or stacking algorithms on several base mod-els perform better than the base modmod-els?

(15)

1.4. Delimitations

3. Which algorithms available in Spark are eligible for stacking and bagging?

4. How does the final solution compare to the current state of the art predictive mainte-nance for wind turbines with regards to implementation, run time, usability and accu-racy metrics?

1.4 Delimitations

This is a study that focuses on the Hadoop Ecosystem and Spark, in particular. Inherently, Spark does not support algorithms that consider time series, meaning that the models pro-duced in this study have both been trained on, as well as make predictions considering only row of data points at the time.

(16)

(17)

2 Theory

This chapter describes the theory necessary to understand the thesis report as well as relevant background that has influenced this research. First, Section 2.1 describe how information was found in order to make it easier for the reader to find relevant literature for a similar study. In order to understand, validate and prepare the data for the algorithms, some knowledge of the turbine construction and its components is required. Therefore, section 2.2 presents the necessary theory regarding how wind turbines are designed and function with regards to components and sensors and is followed by section 2.3 that connects this theory to wind tur-bine failure modes. Section 2.4 describes what SCADA data is and gives a historical overview on how the architecture of SCADA systems have evolved to the cloud based systems that are being developed today. This is followed by section 2.5 and 2.6 that present what big data is followed by an overview of the tools used to work with big data in this study. Following this is section 2.7 that presents descriptions of the algorithms that have been used and compared for the predictive models in this study. Lastly, related work on predictive maintenance with regards to its relation with SCADA, big data and machine learning is presented in section 2.13.

2.1 Literature Study Method

Search terms included predictive maintenance, wind turbines, big data frameworks, machine learn-ing, failure modes, SCADA, Hadoop, Spark and combinations of these. The platforms used for knowledge discovery was Science Direct, Research Gate, Google Scholar, Springer and IEEEex-plore.

A paper was read in its entirety if the abstract indicated that it would answer one of the following questions.

• How is the Hadoop Ecosystem being used for predictive maintenance?

• Which machine learning algorithms and methods are used for state of the art predictive maintenance?

(18)

Figure 2.1: Wind turbine component scheme. Retrieved from energy.gov. Image is work in the public domain according to EERE copyright policy.

• What does the fault process for wind turbine components look like?

2.2 Wind Turbines

The majority of wind turbines are horizontal axis wind turbines, as opposed to vertical axis turbines, and that is also the type of wind turbine considered in this study.

2.2.1 Design and Components

There are many variations and many schematics that can be applied to wind turbines. The schematic described here captures the significant components needed to create a reference guide throughout the paper.

Figure 2.1 provides an illustration of how the majority of the components described in the following list are placed in the turbine.

• Tower

Made from tubular steel, concrete, or steel lattice. Supports the structure of the turbine. Because wind speed increases with height, taller towers enable turbines to capture more energy and generate more electricity.

• Blades

The blades are shaped to create a pressure differential when air moves across them, causing them to lift in the upwards direction relative to the blade. Most wind tur-bines have three blades for several reasons. A turbine with two blades is prone to a phenomenon called gyroscopic precession, causing a wobbly motion and unnecessary stress on the components. Four blades or more would increase the torque, but also the loads on the tower, the wind resistance and the cost, making the wind turbine less cost effective [6].

• Rotor and Pitch System

The rotor is where the blades are connected to the hub. Inside the hub resides a pitch system that can control the pitch of the blades to control the rotor speed.

(19)

2.2. Wind Turbines

• Brakes

A mechanical disk brake designed to stop the rotor in case of an emergency. The me-chanical brake is used in case of failure of the aerodynamic brake, or during a turbine service.

• Low-speed Shaft

The low-speed shaft connects the hub to the gearbox. It rotates at 20-60 rpm depending on the turbine model. The pipes for the hydraulics system that enables the aerodynamic brakes are contained in the low-speed shaft.

• High-speed Shaft

The high-speed shaft rotates at about 1,000 to 1,800 rpm depending on the model and the current wind speed. The reason for this is because the electrical generator requires a high rotational speed to produce electricity.

• Gearbox

The gearbox connects the low-speed shaft which is also known as the main shaft, to the high-speed shaft and increases the rotational speed of the high-speed shaft. The gearbox is an expensive and heavy part of the turbine.

• Generator

Generates AC or DC depending on the type. • Yaw System

Consists of the yaw drive and the yaw motor, as well as the anemometer and the wheels and pinions needed to drive the component. The yaw system aligns the wind turbine properly using wind speed information from the anemometer. This is necessary in an upwind turbine, however a downwind turbine will achieve this naturally.

• Nacelle

The term for the housing containing all of the electrical components. • Wind Vane

Measures wind direction. This information is used by the yaw system to align the tur-bine.

• Controller

Starts and turns off the wind turbine in a controlled manner depending on the wind speed.

• Electric System

Transformer, fuses, switches, cables and connections needed to carry currents and sig-nals between components.

2.2.2 Sensors and Monitoring Solutions

Sensors are the very heart of all monitoring systems. Modern wind turbines are usually equipped with sensors that provide fault detection on either system or subsystem level. There are other solutions available in addition to the mentioned SCADA system. These include blade monitoring systems and holistic models that may take weather data such as temper-atures and salinity, or hours of continuous work into account [7]. However, these solutions will not be considered further as they are out of scope for this thesis.

(20)

The SCADA system provides fault detection on a subsystem level. Parameters being mon-itored include generator rpm, generator bearing temperatures, oil temperature, pitch angle, yaw system, wind speeds and more [7].

2.3 Wind Turbine Failure Modes

It is clear that wind turbines are expensive equipment, both with respect to procurement as well as maintenance. It is also clear that the current state of the art research regarding maintenance is focused towards predictive maintenance, as it has shown to be the most cost effective maintenance method at this point in time. To be able to understand and validate the SCADA data, as well as prioritizing research efforts, it is important to get a picture of component failure rate and how this affects the turbine productivity. A study published by the National Renewable Energy Laboratory did a review using survey data from six publications on wind turbines failure rates. The failures were compared with regards to failure rate per year and downtime per year. The reasoning was that downtime can be used as an indicator of cost and effort to repair a component, as well as a direct measurement of lost revenue [8]. When comparing the data in Figure 2.2 and 2.3, it is found that some subsystems such as the gearbox or the blades and pitch system generate a high amount of downtime even though they rarely fail. The electric system on the other hand fails much more frequently, however the downtime per failure is much lower.

0 5 10 15 20 25

Gearbox Electric System Blades and Pitch System Generator Control System Hydraulics Main Shaft and Drive Train Yaw System Mechanical Brakes

hours/year Figure 2.2: Subsystem downtime per turbine

2.4 SCADA Systems

SCADA (Supervisory Control And Data Acquisition) systems can be used for both moni-toring as well as controlling industrial systems remotely and provides an efficient way for industries to gather and analyze data in real time. Hundreds of thousands of sensors may be used in larger SCADA systems, generating large amounts of data.

2.4.1 Historical Overview

The first SCADA systems were developed in the late 1960s and have played an important part in improving maintenance efficiency ever since. Vendors of SCADA systems usually release one major and two minor versions every year to take advantage of new technologi-cal advances and meet the requirements of their customers, meaning that SCADA systems

(21)

2.4. SCADA Systems

0 0.05 0.1 0.15 0.2 0.25 0.3 Gearbox

Electric System Blades and Pitch System Generators Control System Hydraulics Main Shaft and Drive Train Yaw System Mechanical Brakes

Number of failures per turbine per year Figure 2.3: Subsystem failure rate

historically have stayed relatively up to date with technological progress. The first iteration of SCADA systems were based on a centralized computing architecture and is referred to as monolithic or stand alone. This architecture is described in Figure 2.4. As internet technology advanced in the 1990s together with system miniaturization, SCADA systems adapted and were developed to run on distributed computing architectures. This improved the response times and reliability of the system as more computing capacity and redundancy could be introduced to a lower cost [9]. This architecture is described in Figure 2.5

Remote Terminal Unit Remote Terminal Unit Remote Terminal Unit SCADA Master WAN WAN WAN

(22)

LAN

Operating Station Operating Station Communication Server

Operating Station Operating Station

Remote Terminal Unit

Remote Terminal Unit WAN

WAN

Figure 2.5: Distributed SCADA architecture

2.4.2 From Distributed to Cloud

Traditionally, SCADA servers have been large and expensive with an expected life span of 8 to 15 years. After that, the system is replaced and the old hardware is usually discarded. With a more open architecture, such as a cloud computing based solution, the lifetime of the system can be improved even further [10]. There have been multiple attempts at describing SCADA systems using a generalized architecture, but no single standard exists. In Church et al. "SCADA Systems in the Cloud" in Handbook of Big Data Technologies, some of the key at-tempts are summarized. Based on the IEEE Standard for SCADA and Automation Systems, a generalized cloud based architecture for a SCADA system is proposed [11]. When compar-ing this cloud based architecture to the generalized architecture proposed in What is SCADA? from 1999 it is clear that the body of knowledge regarding networked distributed computing and scalable solutions have improved [12]. The older architecture did have a network based distributed computing architecture in mind, which can be observed as the file server and the control server are described using similar internal structures. In the newer architecture however, each sensor is connected to a field device that has an internal processor, memory, power supply and network interface. Field devices are grouped and connected to a device server. The module responsible for reading and writing data to the data processing module is moved from the server to the field device to enable parallel read and writes, thus making it possible to utilize one advantage of cloud computing. The field devices are connected to the file server to achieve a scalable solution where field devices can be added to the system when needed. This third generation of SCADA systems are referred to as networked SCADA systems and can be observed in Figure 2.6. One of the main improvements compared to the distributed architecture comes from the use of WAN protocols for communicating with servers and equipment, allowing the system to be spread across multiple LAN networks and thus also geographically, allowing for more cost-effective scaling for very large scale SCADA systems [11].

(23)

2.5. Big Data

Communication Server

Legacy Remote

Terminal Unit Networked Remote_{Terminal Unit} Cloud Service SCADA Master

Figure 2.6: Networked SCADA architecture

2.5 Big Data

During the last three decades, there has been an exponential increase in data volumes. In the 1990s, data was measured in terabytes and could be managed using standard relational databases. A decade later, the data volumes had increased to being measured in petabytes. The increase in volume stems from an increase in connected hardware such as industrial machines, but also content repositories and network attached storage systems. Moving for-wards to the decade of 2010s, data is being measured in exabytes, even though there are few applications or companies that store or process close to an exabyte of data. Everything from machines and human interaction with machines to the actual processing of data, gen-erate data. Mobile sensors, surveillance, smart grids, medicinal imaging, gene sequencing and more is driving this modern age deluge of data and it is clear that a paradigm shift has happened [13].

Big data has become a buzzword and is sometimes misused. Big data is not a framework or a technology in itself, but rather a problem statement. Tools designed to handle big data were designed with the size and complexity of big data in mind and might not be the best choice unless the data in mind is actually big data. The first step towards choosing the right tools should therefore be to understand what big data is. Big data is a very wide term and con-sequently, there are many definitions for it. The most known definition is what is described as "3V". The term springs from the words Volume, Variety and Velocity. Another common

(24)

definition is known as "5V". This term includes the "3V" and adds Value and Veracity to the list of V’s [14].

• Volume - big data volume is always increasing and can be billions or rows and millions of columns.

• Variety - big data reflects the variety of data sources, formats and structures. It can be both structured and unstructured, or a combination of both. This increased the com-plexity of storing and analysing the data.

• Velocity - big data can describe high velocity data, with high speed data ingestion and data analysis. Handling these demands is often a challenge.

• Value - a research project that does not produce value is not worth the investment, however it can be difficult to determine if and when big data research will deliver the desired value.

• Veracity - it is crucial to ensure correctness and accuracy of the obtained data. Factors to consider include trustworthiness, authenticity, accountability and availability.

Lately, the focus of big data research regarding the storage and computing solution has shifted from Message Passing Interface (MPI) and Distributed Database Management Sys-tems (D-DBMS) to Cloud Computing. Reasons for this shift is the elasticity regarding the usage of computing resources and space, as well as flexible costs and lower management efforts that is associated with Cloud Computing [15].

2.5.1 The Big Data Life Cycle

Jagadish et al. mentions a best practises guide for big data where incremental steps and conditions for moving forwards to the next step are defined [16]. This approach, illustrated in Figure 2.7, is the approach that was used during the research presented in thesis to ensure that raw data, processed data, and models met the requirements before moving forward to the next step.

• Discovery. Learn the domain, find data sources and evaluate the quality and sustain-ability of the data sources. Define the problems, the aims and the hypotheses of the project.

• Data Preparation. Clean, integrate, transform, reduce and discretize the data according to the project needs. If analytical models are fed with poor quality data, the predictions will most likely be suboptimal or even misleading.

• Model Planning. The techniques, methods and workflows for the models are chosen. Ensure that the choices made will enable the earlier defined hypotheses to be proven or disproved.

• Model Building. The models defined in the model planning step are developed and executed. The models are fine tuned and the results are documented.

• Communicate Results. The criteria for success and failure are evaluated against the outcome of the research by assessing the results of the models. The results are discussed and recommendations for future research is made.

• Operationalize. The models are deployed and tested in a small scale production-like environment before a full deployment is made.

(25)

2.5. Big Data

(26)

2.6 Big Data Tools

The advances of wireless sensor technology and the introduction of SCADA systems has pro-vided companies with new ways of collecting more data about performance and degradation of their industrial machines in an easier manner. One of the challenges that developers of such systems for wind turbines have to overcome is that the daily data volumes produced by a SCADA system are too large to be processed with traditional technology [17]. Even though some analysis could be made using traditional methods, an important part of big data anal-ysis is ensuring a response within an acceptable time. This is where a natural connection between big data analysis tools and predictive maintenance for wind turbines is made. As data volumes are increasing at a faster rate than the computing resources meant to analyse them, new methods and tools have had to be discovered in order facilitate the needs. These tools have made it possible not only to analysing more data, but also to discover new anal-ysis methods made possible by the access to these kinds of data volumes. This section will cover Hadoop and relevant parts of the so called Hadoop Ecosystem that has sprung from its creation.

2.6.1 Hadoop

Hadoop is frequently mentioned as the number one framework for big data management. It was introduced by Apache in 2007 as an open source implementation of the MapReduce processing engine bundled with a distributed file system. Hadoop thus solves the scalability problem of a MapReduce on its own by using distributed storage and processing. Hadoop provides an extensible platform for applications that process large data volumes, such as machine learning. Because of that, many open source and commercial extensions have been based upon Hadoop since its release and it has grown into what is known as the Hadoop Ecosystem. The components of the Hadoop Ecosystem can be described as follows [18], [19]. Major Components

• HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator

• MapReduce: Programming based data processing

• Common: Set of common utilities needed by other components Extensions

• Spark: In-memory data processing

• Kafka: Distributed publish-subscribe message streaming system • HIVE: SQL-like data querying

• Pig: High level scripting

• HBase: Column oriented data store built on top of HDFS • Mahout, Spark MLlib: Machine learning algorithm libraries • Solr, Lucene: searching and indexing

(27)

2.6. Big Data Tools

• Oozie: Job scheduling • Hue: Web interface HDFS

The Hadoop Distributed File System is designed specifically to store large amounts of struc-tured and unstrucstruc-tured data across multiple nodes. There are two major components in the HDFS: Name node and Data node. HDFS is designed using a master-slave architecture. The name node is the master, which holds references to file locations and metadata. Its primary responsibility is directing traffic to the data nodes. The data nodes are the slaves in this sys-tem. They can consist of commodity hardware which increases scalability, as commodity hardware is readily available, cheap and easily extendable.

HDFS stores files in blocks that it distributes over the cluster. A block size is typically 64Mb. If possible, the file blocks are stored on different machines, enabling parallel map step opera-tions on the blocks. This design entails that for a system that has many files smaller than the block size, HDFS is most likely not the best solution.

YARN

YARN is a resource manager. As such, it schedules and allocates resources for the Hadoop system across the clusters. YARN was introduced together with Hadoop 2.0 in 2012 to handle some of the deficiencies of the older Hadoop version, where the MapReduce module was responsible for resource management and job scheduling. This introduced the possibility of running other types of distributed applications beyond MapReduce within the Hadoop framework. There are three main components in YARN: resource manager, node manager and application manager.

MapReduce

The MapReduce framework is used to break a task into smaller tasks, execute them in parallel and collect the individual outputs. As can be gathered from the name, a MapReduce job consists of two phases, a map phase and a reduce phase. The mapper contains the logic to be processed on each data block, which produces key/value pairs that are sent to their respective reducer based on key value. A reducer thus receives many key value pairs from multiple mappers, which it then aggregates according to the defined reducing logic into smaller set of key/value pairs that are the final outputs. The MapReduce framework may be best explained with an example. In Figure 2.8, a word count algorithm is illustrated. The input data is split and stored on different machines according to block size, called map nodes. The map nodes execute the job, which is to count how many times each word occurs and outputs the pairs. The map nodes are also responsible for the shuffling phase that sorts the output and writes to disk. The sorted output is then sent to the reducer nodes, which reduces the output and writes that part of the final result to the output folder.

One of the main deficiencies with MapReduce is that it is difficult to design an algorithm that uses iterative processing, which is common for machine learning and graph applications. Iterative computations in MapReduce not only require careful manual programming and scheduling of multiple MapReduce jobs but are also slow as the data is written and read from disk between each iteration. In addition to this, MapReduce can’t do real-time analysis as it was designed to do batch processing.

(28)

Deer Bear River Car Car River Deer Car Bear

Deer Car Bear Car Car River

Deer Bear River

Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1

Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)

Bear, 2 Car, 3 Deer, 2 River, 2

Bear, 2 Car, 3 Deer, 2 River, 2 Input Splitting Mapping Shuffling Reducing Final Result

Figure 2.8: MapReduce word count example

2.6.2 Spark

Spark started as a project at the University of California, Berkeley, but is now a top-level project supported by Apache. Spark is based on MapReduce and is designed to resolve some of the deficiencies with MapReduce mentioned above. It supports iterative algorithms and does fault tolerance without replication through its data storage model called Resilient Dis-tributed Dataset (RDD). An RDD is an immutable data set that remembers each deterministic operation that was performed. As such, in case a worker node fails, the RDD can be recreated using the operation lineage. RDDs are primarily used for manipulating data with functional programming constructs. For high-level expressions, Spark has introduced DataFrames and Datasets.

Spark has been proven to be fast and highly scalable. In an article by García et al. [20], Spark is compared to Apache Flink by implementing the two popular machine learning algorithms SVM and LR and comparing speed and scalability of the training process. Another article by Xianrui et al. [21] compared Spark to MapReduce with regards to speed and scalability by im-plementing five different algorithms. It was found that “MapReduce’s scheduling overhead and lack of support for iterative computation substantially slow down its performance on

(29)

2.7. Machine Learning Theory

moderately sized datasets. In contrast, MLlib exhibits excellent performance and scalability, and in fact can scale to much larger problems”.

DataFrames and Datasets

When working with Spark, one will also encounter the terms DataFrame and Dataset. A DataFrame is described as a two-dimensional structure where each column contains values concerning one variable and each row contains one set of values. The Spark DataFrame was introduced as an extension of RDDs to improve the performance and scalability of Spark for semi-structured and structured data, as well as providing developers with high-level abstrac-tions. As an example, these abstractions give developers access to SQL queries and to MLib’s Machine Learning API, making them very useful for machine learning applications.

2.6.3 Kafka

Kafka works on a publish-subscribe basis and delivers a fault tolerant messaging system that is scalable and distributed by design. It achieves fault tolerance by replicating messages within the cluster. Kafka has many use cases, for example aggregating statistics and logs from distributed application and making this available to multiple consumers, or for stream processing. Relative to many other messaging systems, Kafka has a low overhead because of how it keeps messages for only a set amount of time and thus makes the consumer responsi-ble for tracking relevant messages [22].

Spark has an API named Spark Streaming which is used to integrate Kafka with Spark. It enables Spark to ingest the data stream in a scalable and fault-tolerant manner, divide it into batches and process the data using the Spark engine. Finally, the processed data can be either stored on the HDFS or pushed to another Kafka topic to be consumed by a subscriber.

2.7 Machine Learning Theory

Machine learning (ML) has, just like big data, become a buzzword as it has boomed in pop-ularity. It is sometimes incorrectly used to describe what is really Artificial Intelligence (AI). In the same way, Deep Learning (DL) is sometimes used to describe what is ML. To clarify, ML is a subset of AI that uses statistical methods to enable machines to improve as they are exposed to more data, where AI are a much broader term that can be used to describe all techniques that mimic human behaviour. DL is a subset of ML where the models being used are based on artificial neural networks and have one or more intermediate layer between the input and the output layers. Additionally, the model parameters of the intermediate layers are learned using outputs of the preceding layers, instead of being learned directly from the features of the training data [23]. ML can be used for a wide range of real-world application such as image processing, natural language processing, computational finance and more1_.

When choosing which algorithm to use for a specific problem, it is important to know the difference between supervised and unsupervised learning as well as training and validation methods to be able to explore and prepare the data correctly.

2.7.1 Model Training and Validation

It is common to divide the data intro three parts; training, validation and testing data sets. The training data is used to fit the initial model and the validation data is used to provide an unbiased metric on how well the model generalizes with some hyperparameters of choice,

1_{https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/i/88174_}

(30)

meaning that the validation data is used to select the model with the best hyperparameters. When the best model is trained and found, the testing data is used to evaluate how well the model generalizes using a data set that has not affected the model in any way. It is common to first split the data into two, where the first part contains both training and validation data and the second part is the testing data. The first part is then split according to the needed amounts of training and validation data. A model with many hyperparameters are more difficult to tune and will require a higher amount of validation data. The split ratio is dependent on the amount of hyperparameters that the model has, as well the amount of data available. As a rule of thumb, sklearn uses a default ratio of 75% training and 25% testing data2.

There are some important metrics for evaluating a prediction model, which are defined below. These are reoccurring in related studies and will be used throughout this thesis to benchmark model performance. TP, TN, FP and FN stands for true positive, true negative, false positive and false negative.

Sensitivity= _TP+FNTP = CorrectPositivePredictions_{AllPositiveValues} Speci f icity= _TN+FPTN = CorrectNegativePredictions_{AllNegativeValues}

Precision= _TP+FPTP = CorrectPositivePredictions_{AllPositivePredictions} Accuracy= _TN+TP+FN+FPTN+TP = CorrectPredictions_AllValues

2.7.2 Supervised Learning

Supervised learning can be used when each sample of the data set used to train the model has a set of known input values x as well as an output value y. The algorithm at hand is then used to find a mapping function f(x) =y. This is a common case for classification and regression problems, which constitute the majority of machine learning problems. Under-standing the difference between these is key to classifying whether the machine learning task is a classification or a regression problem.

Classification problems consists of creating a mapping function such that it maps the input to a discrete or categorical value. Regression problems on the other hand, creates a mapping function to a continuous variable. With this in mind, predictive maintenance can be either a regression or a classification problem, depending on if the output variable used to train the model is designed as a categorical value such as warning level, or a continuous value such as estimated time to failure [24].

2.7.3 Unsupervised Learning

Unsupervised learning is used to make inferences from data when the output values for a given set of input values are unknown. Because the output data is unknown, applying regres-sion directly is not possible. Using some technique, the input is interpreted and grouped by finding previously unknown patterns or underlying structures in the data. Common tasks in-clude clustering and association used for exploratory analysis and dimensionality reduction [25]. The only unsupervised learning technique used during this thesis is Principal Com-ponent Analysis (PCA) and therefore the report will not describe any other unsupervised techniques more in depth.

2_{https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_}

(31)

2.8. Predictive Classifiers

2.8 Predictive Classifiers

This section will present the fundamental theory for each of the algorithms that have been considered for building predictive models in this thesis. Knowing fundamental theory about the algorithms being considered for the solution is key to selecting the best algorithm and evaluating it in a correct manner. The following information is gathered from The Hundred-page Machine Learning Book [23].

2.8.1 Decision Trees

A decision tree is built by repeatedly splitting a set of data into subsets and choosing the split that minimizes the entropy. The algorithm stops either when the tree reaches a configured maximum depth d, or when all possible splits reduces the entropy less than e. It is necessary to be aware of and explore these parameters cautiously. An example of what can happen is that a very tall decision tree will model insignificant noise and overfit to the training data, meaning that it will perform very well on the training data, but perform poorly on future examples. The final result of the decision tree algorithm is an acyclic graph that can be used to make decisions by inspecting one feature at the time.

2.8.2 Support Vector Machine

Support Vector Machines uses the dot product between feature vectors to solve an optimiza-tion problem that consists of finding the hyperplane that has the greatest margin to the near-est point of any class. Finding the hyperplane with the largnear-est margin is important as that contributes to how well the model will function for future examples. However, outliers may affect the SVM such that the data is not linearly separable. For those cases, a hinge function is used that introduces a trade-off possibility between decreasing margin size in order to clas-sify the training data well and the ability to clasclas-sify future examples well. For cases where the data is inherently non-linear, SVM can be extended by the use of kernels to make non linear classification models. The final result is a model that can assign a feature vector to one of two categories.

2.8.3 K-Nearest Neighbors

K-Nearest Neighbors (KNN) produces a model which is a collection of all the training sam-ples. It works by comparing all the previously recorded samples and compares them to the new data point. Once it has decided the K nearest samples, the new data point is assigned the label that the majority of those K samples has. The distance function for comparing the dis-tance between data points needs to be chosen by the data analyst, however Euclidean disdis-tance is frequently used.

2.8.4 Neural Networks

Neural networks consists of an input layer, one or more hidden layers and an output layer. In each layer is a number of neurons and inside every neuron is a core with the assignment of using an activation function to produce an output given an input. There are many types of neural networks, however most are not relevant for the case of using tabular multivariate time series data for predictive maintenance. To further narrow it down, the only type of neural network provided by Spark MLlib is the Multilayer Perceptron Classifier3.

(32)

Multilayer Perceptron

Multilayer Perceptron Classifier (MLP) is feedforward neural network that have been de-scribed as the classical type of neural network, or vanilla neural networks [26]. MLPs are suitable for classification prediction problems as well as regression prediction problems, mak-ing them highly relevant for predictive maintenance. MLP trains by solvmak-ing an optimization problem where each neuron weight is optimized through gradient descent and back propa-gation techniques. The metric used for solving the optimization problem is the mean squared error between the model output and the answers.

2.9 Dimensionality Reduction

Dimensionality reduction can be used to speed up model training by allowing simpler mod-els to be used. With modern techniques such as cloud computing and improved graphical processing units, dimensionality reduction is less important now compared compared to the past. Another more common use case nowadays is to be able to visualize higher dimensional data [23]. One effect of dimensionality reduction is that it removes redundant or highly cor-related features and thus removes noise in the data. Therefore, it can also be used to reduce confusion and improve model precision when working with complex data, as can be seen in an article by Meigarom Lopes [27].

2.9.1 Principal Component Analysis

PCA is an unsupervised method that is used to find the linear combination of variables that maximizes the variance in the variable space. This means that the new set of components will be of a lower dimension, while still retaining all or most of the information needed to find patterns in the data.

2.10 Model Tuning

The prediction algorithms do not inherently tune hyperparameters. Instead, these parameters have to be tuned manually.

2.10.1 Grid Search

Grid search is a simple technique for hyperparameter tuning where a number of values are entered for each variable that required optimization. All possible combinations of these vari-ables are then tested in a parameter search and the best performing model are kept.

2.10.2 K-Folds Cross-Validation

In order to evaluate how well a certain model using the hyperparameters currently under test generalize to an independent data set, a k-fold cross-validation may be applied. The training data set is split into 1 partition of validation data and k partitions of training data. The result is the average evaluation of k models trained on different subsets of the data set. Combining cross-validation with grid search is a powerful technique to find good hyperparameters that usually gives a more sound estimation of how well the parameters are performing, however it can be computationally expensive as every model is trained k times instead of one4_.

(33)

2.11. Ensemble Methods Bias Err or Model Compexity Total Error Variance

Figure 2.9: Bias variance trade-off

2.11 Ensemble Methods

This section will present ensemble methods as a concept and three different methods for producing an ensemble model: bagging, boosting and stacking. Ensemble methods are al-gorithms that are based on the idea that multiple models can be combined to create a single model that is better than any of the base models. Bagging and boosting considers homo-geneous weak learners, meaning that the weak learners are different variants of the same algorithm. Stacking however, often considers heterogeneous weak learners, meaning that it combines multiple models based on different learning algorithms.

The terms bias and variance will be frequently used in this section. A model with a high bias is prone to underfitting, while a model with a high variance is prone to overfitting. A model with a high bias will make incorrect predictions both on the training data as well as on the test data, where a model with a high variance will be very accurate on the training data but make incorrect predictions on the test data. As is illustrated in Figure 2.9, there generally exists an optimal balance between bias and variance. This is known as the bias variance trade-off. An explanation to why ensemble methods works is that each model can be viewed as a signal and some surrounding noise, meaning that a model usually has either a bias, or a variance that is too high. Assuming that the noise is evenly distributed around the true value, averag-ing the models cancels out the noise and finds a balance. Ensemble methods are divided into parallel and sequential methods. Parallel methods combine base models that can be trained independently of each other, while sequential methods are limited to training one base model at a time as the current model rely on information that the previous model generates.

2.11.1 Bagging

Bagging, which stands for Bootstrap Aggregating, is a method used to improve stability and accuracy while at the same time decreasing the variance of a model by avoiding overfitting. Bagging does not reduce the bias of the models however and therefore, base model with a low bias and a high variance should be selected. Bagging achieves a decrease in variance by averaging multiple base models that are trained using random subsamples of the training data. The first step in this process is the bootstrap sampling, which selects a number of samples at random for every base model. The second step aggregates the base models into a final model. Averaging or voting is used for aggregating the models in an optimal way.

(34)

2.11.2 Boosting

Boosting has the objective of converting weak learners to strong learners by reducing the bias of the model, with the drawback that overfitting may be increased. Therefore, weak learners with a high bias and a low variance should be selected when using boosting. A weak learner can be represented by a single model, or a combination of models, but it must be better than random chance. The method is based on the question posed by Kearns and Valiant: "Can a set of weak learners create a single strong learner?" [28] Boosting does not take subsamples from the training data, but instead uses the complete data set together with a weighting system to make future models focus on classifications which were perceived as difficult by previous models. For each model being added, the data weights are readjusted to increase the weights for data that have been previously misclassified and decrease the weights for points that are easy to classify. Each weak learner is given a voting weight based on their accuracy, such that models with a higher accuracy have a stronger vote. This vote is later used when combining the weak learners into a more complex, strong learner.

2.11.3 Stacking

Stacking utilizes a learning algorithm, referred to as a meta learner, that learns how to combine base-level model predictions into a new model. As previously stated, the base-level models used with stacking are often heterogeneous, however one may use homogeneous models if desirable. As noted by Saso Džeroski and Bernard Ženko [29], one may have to experiment when deciding the number of base-level models and which algorithm to use for the meta learner. A good starting point however is to consider between three and seven base-level models. They propose a solution called multi-response linear regression for the meta learner and shows that it outperforms other stacking approaches. One may have more than one layer in a stacking ensemble. In such a model with i.e three layers, the base layer would be connected to a number of meta learners that are connected to a final meta learner. This is suggested by Rodolfo Lorbieski and Silvia Modesto Nassar [30] as the ensemble alternative that is most likely to increase the accuracy of the model with the drawback that the computing time is heavily increased.

2.12 Homogeneous Ensemble Algorithms

This section presents the common algorithms random forest and gradient boosting that are a inherent to the Spark MLlib5and thus can be used and evaluated with no further implemen-tation work required.

2.12.1 Random Forest

Random forests ensemble is a method that can be used for both classification and regression problems. They exist to solve the problem of overfitting that decision trees are prone to. Over-fitting means that the model performs very well on the training data but does not generalize very well. The reason why this can happen for a decision tree is that the tree is designed too perfectly with regards to the training data and thus it ends up with branches that make strict decisions on very small amounts of data. A random forest is an aggregated model of a collec-tion of trees that are trained on subsets of the training data. These subsets are created using the standard bagging method, with the small twist that only a certain amount of features are selected for each tree. For tabular data, this means that a subset of rows are selected and then a subset of columns are picked from these rows. Apart from random inducing features, the training process is identical to the one for decision trees.

(35)

2.13. Related work

2.12.2 Gradient Boosting

Like random forests, gradient boosting can be used for both classification and regression problems. Gradient boosting is typically used with shallow decision trees as its weak learn-ers. Each tree tiis trained using the training data and is then added to the model together with

a weight withat represents the accuracy. The residuals [31] are used to update the weights

to configure what the next tree should focus on. When the configured maximum amount of trees are trained, they are combined and the ensemble model is returned.

2.12.3 Neural Network

The greatest strength with neural networks is that they can find and predict complex non-linear relationships in data. The cost of this high flexibility is that they are highly sensitive to noise and errors in the training data, which often results in a high variance. An article by Lars Kai Hansen, Peter Salamon shows evidence that an ensemble of neural networks per-form better than a single neural network when trained using bagged subsets of the training data[32] and ensembles consisting of MLP neural networks will therefore be considered in this thesis.

2.13 Related work

This section presents and discusses related research. The related work has influenced the algorithms taken into consideration and the methods used for benchmarking their perfor-mance. In addition, the related research has acted as a source of validation for the obtained values during the thesis work.

Olgun Aydin and Seren Guldamlasioglu [33] investigated how to implement the Keras library on the distributed clustering platform of Spark. For this purpose they used Elephas, which is an extension that allows for deep learning models built in Keras to run on Spark. They implemented an LSTM model with the purpose of predicting engine condition. The time frame in this study was set to “200 epochs” and they received an accuracy of 85%. Any other metrics are not disclosed, however they present the conclusion that Spark provides a large-scale distributed data processing environment that is suitable for this kind of research and that an LSTM model is a promising alternative when creating a prediction system.

Chinedu et al. [34] described a generalized solution for fault detection on complex systems using SCADA data. The article describes how to process the SCADA data from the data ac-quisition step to data preparation, model training and model validation in order to create an Artificial Neural Network (ANN) that can predict future readings from a target component. The importance of data preparation is emphasised. The authors choose to remove low vari-ance features, impute missing values by calculating neighbor mean value and filter outliers by using the Interquartile Range Rule for Outliers. The models were built using a four-fold cross-validation ensemble method for ANNs. The conclusion is that cross-validation ensem-ble technique is superior to the classic ANN when comparing predictive ability. The authors hypothesize that better results could be made using an 8-12-fold cross-validation ANN, but do not mention why they choose four folds for their solution. However, this is still convinc-ing evidence that usconvinc-ing k-fold cross-validation should be considered when workconvinc-ing with data from complex systems that SCADA systems usually monitor.

Leahy et al. [35] investigated how to build a predictive system for wind turbine fault de-tection using SCADA data and support vector machines. The produced models are able to predict an error up to 12 hours in advance for a specific failure. The authors achieve a very high recall, but express their concern with the poor precision metric of the models. They

(36)

Author Accuracy Sensitivity Specificity Time frame Kusiak et al. 76.50% 77.60% 75.70% 5 hours Canizo et al. 82.04% 92.34% 60.58% 1 hour

Table 2.1: Related work metrics

precision and propose that a future study should consider the costs of false negative versus false positive to find an optimal balance between precision and specificity. Feature extraction will therefore be explored, as will the relationship between accuracy metrics and maintenance costs.

Canizo et al. [36] describe a complete solution for predictive maintenance using HDFS and Spark, showing that such a solution is possible to build using only the Hadoop framework. The only algorithm considered in that study is the random forest algorithm, achieving an accuracy of 82.04%. It is worth noticing that the specificity was significantly lower compared to related work by Kusiak et al. [37]. The metrics found in these two studies are presented in Table 2.1 together with the time frame in which the predictor is designed to operate. In addition, Canizo et al. performed some experimentation on the number of trees Ntrees and

the depth of the trees Maxdepthwith the conclusion that Ntrees =40 and Maxdepth=25 results

in an optimal random forest algorithm. This is however, under the condition that the SCADA data is not only very similar but also preprocessed in the same way and should therefore only be used as a starting point for a parameter analysis. Lastly, they present a hypothesis that the accuracy of the predictive model could have been improved if the data set had been balanced during the preprocessing step.

(37)

3 Method

This chapter describes the stages that were followed during the thesis and how each stage was carried out. Before anything else, a literature study was made to determine the feasibil-ity of the research questions and the current state-of-the-art knowledge in each area of inter-est. After the literature study, the steps described in Figure 2.7 were followed to ensure that no premature decisions that would have negative consequences in a consecutive step was made. Therefore, a period of data gathering took place to determine if sufficient data was available, followed by a period of going back and forth between data preparation and model planning. When the quality of the data and how to process the data to make it useful had been determined and the analytical plan was defined, the models were built and evaluated.

3.1 Data Discovery

Data discovery is the first step in any big data project. To find suitable data, a number of companies were contacted and sources for open access datasets such as World Bank Open Data, EU Open Data Portal, Data.gov, Kaggle and more were searched. The found datasets were compared and explored with the goal of finding a strategy to use the dataset in question to build a classification or regression model.

3.1.1 EDP Data Set

Energias De Portugal (EDP) provided a dataset that was used by the competitors during a wind turbine themed hackathon held in May 2019 named Wind Turbine Failure Detection1. This dataset is now open access and was found to be suitable for evaluating software, methods and models for this thesis. The distinction between this and many other datasets was the presence of identifiable distinct error codes, making it possible to add labelling columns such as remaining useful life that the algorithms can use for training and validating.

The dataset contains measurements from five turbines. Measurements have been recorded every ten minutes over the course of two years, 2016 and 2017. The data from 2016 is used as

(38)

Component F/C F in training F in testing T 1 T 2 T 3 T 4 T 5 Gearbox 4 2 2 1 1 0 2 0 Generator 7 5 2 0 5 1 0 1 Generator Bearing 6 4 2 0 0 2 4 0 Transformer 3 2 1 1 0 2 0 0 Hydraulic Group 8 2 6 0 2 2 1 3 Failures/T 28 16 12 2 8 7 7 4

Table 3.1: Turbine failure summary

training data and the data from 2017 is used for testing purposes. The measurements are com-posed of 81 variables derived from sensors that monitor 12 components and environmental aspects. For details about the variables, sensors and components, please see the EDP Open Data website2. The datasets are available for registered users on the Data page, however the description of the data can be found under Challenges and Wind Turbine Failure Detection. Failure Data Analysis

The error codes provided by EDP cover five components - gearbox, generator, generator bear-ing, transformer and hydraulic group. Table 3.1 summarizes the failures (F) per component (C), per turbine (T), how the failures are divided between the training and the testing data as well as the amount of failures per turbine.

These error codes can be translated reasonably well to subsystem downtime per turbine, see Figure 2.2 presented in section 2.3. When evaluating the EDP dataset downtime coverage, two assumptions were made. First, that the transformer is the error prone component of the Electric System. Secondly, that the downtime for both the generator as well as generator bearing is included in Generator. Under these assumptions, the five components that the models are trained to predict are responsible for 62% of the average downtime, which was deemed as sufficient coverage to continue working with this dataset.

Assuming that the models are more precise the more training data they have, the three mod-els for the gearbox, transformer and hydraulic group components would perform equally well, given that they have two failure occurrences each in the training data. Further, the gen-erator bearing predictor would perform slightly better and and the gengen-erator would perform the best. However, if the performance of the model is indeed related to the specific turbine that it has been trained on, a model will be limited to only predicting component failures where it has been trained on failure data in that very same turbine. The cells marked with yellow in Table 3.2 illustrate training data that is expected to be of no use and the testing data where the models are expected to fail, while the cells marked with green illustrate the training data that is expected to be useful and the testing data where the models are expected to predict an error with high precision, according to this theory. Therefore, to prove that a model is able to generalize predictions across turbines, failures marked with yellow must be predicted.

3.2 Data Preparation

Data preparation is the process of transforming the raw data such it can be used with machine learning algorithms and is last step where any deficiencies in the data should be corrected or removed before planning and building the models. Jupyter Notebooks, Pandas and Matplotlib were used to visualize and explore the data.

(39)

3.2. Data Preparation

Component Training T1 Testing T1

Gearbox 1 0

Generator 0 0

Gen. Bearing 0 0

Transformer 0 1

Hyd. Group 0 0

(a) Turbine 1 failures with respect to training and testing

Gearbox 0 1

Generator 4 0

Gen. Bearing 0 0

Transformer 0 0

Hyd. Group 1 1

(b) Turbine 2 failures with respect to training and testing

Gearbox 0 0

Generator 0 1

Gen. Bearing 1 1

Transformer 2 0

Hyd. Group 0 2

(c) Turbine 3 failures with respect to training and testing

Gearbox 1 1

Generator 0 0

Gen. Bearing 2 1

Transformer 0 0

Hyd. Group 0 1

(d) Turbine 4 failures with respect to training and testing

Gearbox 1 0

Generator 1 0

Gen. Bearing 1 0

Transformer 0 1

Hyd. Group 1 2

(e) Turbine 5 failures with respect to training and testing

Table 3.2: Baseline theory about turbine failures with respect to training and testing data

3.2.1 Cleaning

First, an analysis was made on the data with regards to outliers, null values and the variance of the features.

Outliers

The data was found to be of high quality with few apparent outliers. The available choices are to drop, keep or transform the outlier. Since outliers in this dataset may be either the result of a temporary malfunction in a sensor, or a correct value that is the result of component malfunction, there is a chance that similar cases will occur in the test data. Therefore, all outliers were kept in the dataset for the first model evaluation.

Null Values

Null values have to be handled, since the algorithms do not know how to handle missing values and will throw errors if null values are present. Out of 521, 784 measurements, 7 contained null values. Due to the large amount of available correct data, these measurements were removed from the dataset.

Low Variance Features

Low variance features were removed. The standard deviation for each feature was compared against a threshold of 0.1 to determine which features to remove. 4 features were removed from the dataset: Grd_Prod_CosPhi_Avg, Grd_Prod_Freq_Avg, Prod_LatestAvg_ActPwrGen2 and Prod_LatestAvg_ReactPwrGen2. By performing three PCA, creating 5, 25 and 50

Machine Learning for Predictive Maintenance on Wind Turbines : Using SCADA Data and the Apache Hadoop Ecosystem

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

202020 | LIU-IDA/LITH-EX-A--2020/008--SE

Machine Learning for Predictive

Maintenance on Wind Turbines

Using SCADA Data and the Apache Hadoop Ecosystem

Behovsstyrt Underhåll av Vindkraftverk med Maskininlärning i

Apache Hadoop

John Eriksson

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Literature Study Method

2.2

Wind Turbines

2.2.1

Design and Components

2.2.2

Sensors and Monitoring Solutions

2.3

Wind Turbine Failure Modes

2.4

SCADA Systems

2.4.1

Historical Overview

2.4.2

From Distributed to Cloud

2.5

Big Data

2.5.1

The Big Data Life Cycle

2.6

Big Data Tools

2.6.1

Hadoop

2.6.2

Spark

2.6.3

Kafka

2.7

Machine Learning Theory

2.7.1

Model Training and Validation

2.7.2

Supervised Learning

2.7.3

Unsupervised Learning

2.8

Predictive Classifiers

2.8.1

Decision Trees

2.8.2

Support Vector Machine

2.8.3

K-Nearest Neighbors

2.8.4

Neural Networks

2.9

Dimensionality Reduction

2.9.1

Principal Component Analysis

2.10

Model Tuning

2.10.1