• No results found

REAL-TIME PREDICTION OF SHIMS DIMENSIONS IN POWER TRANSFER UNITS USING MACHINE LEARNING

N/A
N/A
Protected

Academic year: 2021

Share "REAL-TIME PREDICTION OF SHIMS DIMENSIONS IN POWER TRANSFER UNITS USING MACHINE LEARNING"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

REAL-TIME PREDICTION OF SHIMS

DIMENSIONS IN POWER TRANSFER

UNITS USING MACHINE LEARNING

Rasmus Blomstrand

rbd12001@student.mdh.se

&

Daniel Jansson

djn13003@student.mdh.se

Examiner: Mobyen Uddin Ahmed

alardalen University, V¨

aster˚

as, Sweden

Supervisors: Shaibal Barua & Shahina Begum

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Manasi Jayapal,

(2)
(3)

Abstract

Since the creation of assembly lines back in the beginning of the 20th century, a constant strive to ease the work done in the manufacturing process has been a goal for many factories. Coming into the 21th century, a step has been taken to try and automate the manufacturing processes even further to minimise the risk of human errors and the injuries in said factories. Among all different areas of manufacturing, one company focuses on the manufacturing of Power Transfer Units. The Power Transfer Units are manufactured at a rate that if 1% were faulty than it would result up to a total of 12000 faulty units a year. This thesis proposes a method to possibly reduce the number of faulty Power Transfer Units by having Machine Learning methods predict, through regression, new values for the computation of shims which aligns the gears in the Power Transfer Units, and this within Real-Time requirements specified by the company. Additionally, this thesis proposes Machine Learning methods which can classify on the given data in an attempt to help the company foreseeing faulty Power Transfer Units before they are assembled, so that the company can automate their assembly line. A Graphical User Interface is presented for making the usage of the implemented models in this thesis easier. This thesis presents evidence and validation that said methods can make an increased automation achievable for the company. This thesis lays ground for future research and development of automated manufacturing processes regarding Power Transfer Units.

(4)

Acknowledgements

The authors would like to extend their gratitude to this thesis supervisors, Shaibal Barua and Shahina Begum, and the thesis examiner, Mobyen Uddin Ahmed, for giving their guidance. The authors would also give a big thank you to GKN Automotive K¨oping AB for the opportunity to work with them on this project and especially to the supervisor at the company; Manasi Jayapal. This thesis would not have been possible without any of the mentioned people, who gave their help and shared their knowledge to make this thesis quality improve.

(5)

Abbreviations

AE Absolute Error. 17

ANN Artificial Neural Network. 7, 8, 10, 11, 13–15, 19, 29–33, 37–41, 54 AUC Area Under Curve. 16, 26–29, 34, 35, 41

AUROC Area Under the Receiver Operating Characteristic Curve. 10, 11, 16, 17, 26, 33 BN Bayesian Network. 11

CBFS Correlation-based Feature Selection. 6, 10

DT Decision Tree. 8, 10, 11, 14, 20, 24–27, 33, 34, 40–42, 52, 53, 56, 59 FCBF Fast Correlation-Based Filter. 6, 10

GUI Graphical User Interface. iii, 2, 21 IBL Instance-Based Learning. 11

kNN k-Nearest Neighbour. 8, 9, 11, 14, 20, 21, 24, 26, 28, 29, 33, 35, 36, 40–42, 52, 53 MAE Mean Absolute Error. 10, 17

MAPE Mean Absolute Percentage Error. 17

MATLAB Matrix Laboratory. 10, 12, 19, 29, 33, 40, 42 MCC Matthew’s Correlation Coefficient. 10, 11

ML Machine Learning. iii, 1–12, 14, 15, 19, 26, 42–44, 51 MSE Mean Square Error. 10, 11, 17, 18

NB Naive Bayes. 9, 11, 14, 52, 53

PCA Principal Component Analysis. 5, 6, 10, 11, 13, 43 PLS Partial Least Squares. 9, 15, 29, 31, 32, 37–39, 41

PTU Power Transfer Unit. iii, 1–4, 10, 18–22, 24, 26, 29, 33, 37, 40–44 R-T Real-Time. iii, 1, 4, 9, 14, 41, 44

RF Random Forest. 8, 10, 11, 14

RMSE Root of Mean Square Error. 10, 17, 18

ROC Curve Receiver Operating Characteristic Curve. 27–29, 34–36 S1 Station considering pinion shim. 3, 18–26, 29–33, 39–42, 52, 53, 56–58 S2 Station considering tube shaft shims. 3, 19–22, 33–35, 37–42, 59–61

SMOTE Synthetic Minority Over-sampling Technique. viii, 6, 13, 19, 20, 23, 24, 33, 40, 43 SSE Sum of Square Error. 17, 18

SSR Sum of Squares due to Regression. 17, 18 SST Total Sum of Squares. 17, 18

SVM Support Vector Machines. 9–11, 14, 15, 19, 20, 24–27, 33–35, 40–42, 52–54, 57, 60 SVR Support Vector Regression Machines. 9–11, 15, 29–32, 37, 39, 41, 58, 61

(6)

Table of Contents

1 Introduction 1

1.1 Problem formulation . . . 1

1.1.1 Hypothesis . . . 1

1.2 Research questions . . . 2

1.3 Report content and structure . . . 2

2 Background 3 2.1 Preprocessing . . . 4 2.1.1 Data cleaning . . . 5 2.1.2 Feature selection . . . 5 2.1.3 Feature extraction . . . 6 2.1.4 Sampling fraction . . . 6

2.2 Classification & Regression . . . 6

2.2.1 Artificial Neural Networks . . . 7

2.2.2 Bagging and Boosting . . . 7

2.2.3 Decision Trees . . . 8

2.2.4 k-Nearest Neighbour . . . 8

2.2.5 Naive Bayes . . . 9

2.2.6 Partial Least Squares . . . 9

2.2.7 Support Vector Machines . . . 9

2.3 Validation and evaluation . . . 10

2.4 State Of The Art . . . 10

3 Method 12 3.1 Software . . . 12 3.2 Preprocessing . . . 13 3.2.1 Data Cleaning . . . 13 3.2.2 Feature Selection . . . 13 3.2.3 Feature Extraction . . . 13 3.2.4 Sampling fraction . . . 13 3.3 Classification . . . 13

3.3.1 Articifial Neural Network . . . 13

3.3.2 Decision Trees . . . 14

3.3.3 k-Nearest Neighbour . . . 14

3.3.4 Naive Bayes . . . 14

3.3.5 Support Vector Machine . . . 14

3.4 Regression . . . 14

3.4.1 Artificial Neural Network . . . 14

3.4.2 Partial Least Squares . . . 15

3.4.3 Support Vector Machine . . . 15

3.5 Validation and evaluation . . . 15

3.5.1 Classification . . . 15 3.5.2 Regression . . . 17 4 Implementation 19 4.1 Preprocessing . . . 19 4.2 Classification . . . 19 4.3 Regression . . . 20 4.4 GUI . . . 21

(7)

5 Evaluation and Results 22

5.1 Station considering pinion shim . . . 22

5.1.1 Preprocessing . . . 23

5.1.2 Classification . . . 24

5.1.3 Regression . . . 29

5.2 Station considering tube shaft shims . . . 33

5.2.1 Preprocessing . . . 33

5.2.2 Classification . . . 33

5.2.3 Regression . . . 37

6 Discussion 40 6.1 Implementation and Results . . . 40

6.2 Limitations . . . 41 6.3 Future Work . . . 42 7 Conclusion 43 7.1 Research questions . . . 43 7.2 Hypothesis . . . 44 8 Ethics 45 References 46 Appendix A Tables on Classification results 51 Appendix B Tables on Optimisation results 55 Appendix C Code 62 Appendix D GUI 65 4.1 Live . . . 65

(8)

List of Figures

1 Main housing of PTU . . . 3

2 Shim locations in PTU . . . 4

3 Figure depicting a simple version of an Artificial Neural Network . . . 7

4 Figure depicting a simple version of a Decision Tree . . . 8

5 Figure depicting the process of this project . . . 12

6 Confusion matrix . . . 16

7 Graph depicting an AUROC . . . 17

8 Training model . . . 20

9 Implementation model . . . 21

10 Scatter plots of inputs in S1 of good and bad PTUs . . . 22

10a All samples as good and bad . . . 22

10b Only good and samples with error = -1 . . . 22

10c Only good and samples with error = -2 . . . 22

10d Only good and samples with error = -3 . . . 22

11 Scatter plots of inputs in S1 with adjustment factor . . . 23

11a All samples as good and bad . . . 23

11b Only good and samples with error = -1 . . . 23

11c Only good and samples with error = -2 . . . 23

11d Only good and samples with error = -3 . . . 23

12 Scatter plot of original dataset and dataset with synthetic samples . . . 23

12a All samples as good and bad . . . 23

12b All samples after SMOTE . . . 23

13 Scatter plot of original dataset and dataset with synthetic samples, with adjustment added as a third dimension . . . 24

13a All samples as good and bad . . . 24

13b All samples after SMOTE . . . 24

14 Cross-validation loss for optimisation of ensemble methods on S1 . . . 25

15 Cross-validation loss for optimisation of SVM on S1 . . . 25

16 ROC showing the performance of Bagged DTs as classifier . . . 27

17 ROC showing the performance of SVM with Gaussian kernel as classifier . . . 27

18 ROC showing the performance of Fine kNN as classifier . . . 28

19 ROC showing the performance of Subspace kNN as classifier . . . 28

20 ROC showing the performance of Weighted kNN as classifier . . . 29

21 Error Histogram for the ANN . . . 29

22 Figure depicting the performance for the ANN model on the S1 data . . . 30

23 Figure showing the training state for the ANN model on the S1 data . . . 30

24 Cross-validation loss for optimisation of SVR on S1 . . . 30

25 Comparison of relationships between shim thickness difference based on bad, good and predicted, with the ANN model, adjustment values for S1 . . . 31

26 Comparison of relationships between shim thickness difference based on bad, good and predicted, with the PLS model, adjustment values for S1 . . . 31

27 Comparison of relationships between shim thickness difference based on bad, good and predicted, with the SVR model, adjustment values for S1 . . . 32

28 The cross-validation loss from each iteration of ensemble methods on S2 . . . 33

29 The cross-validation loss from each iteration of SVM on S2 . . . 34

30 ROC showing the performance of Bagged DTs as classifier . . . 34

31 ROC showing the performance of SVM with Gaussian kernel as classifier . . . 35

32 ROC showing the performance of Fine kNN as classifier . . . 35

33 ROC showing the performance of Subspace kNN as classifier . . . 36

34 ROC showing the performance of Weighted kNN as classifier . . . 36

35 Cross-validation loss for optimisation of SVR on S1 . . . 37

36 Comparison of relationships between shim thickness difference based on bad, good and predicted, with an ANN model, adjustment values in S2 . . . 38

(9)

37 Comparison of relationships between shim thickness difference based on bad, good and predicted, with an PLS model, adjustment values in S2 . . . 38 38 Comparison of relationships between shim thickness difference based on bad, good

and predicted, with an SVR model, adjustment values in S2 . . . 39 39 The live tab of the program . . . 65 40 The data analysis tab of the program . . . 66

(10)

List of Tables

1 The five top performing algorithms on the synthesised training data in S1 . . . 24

2 Resulting optimal parameters for SVM and ensemble methods on the synthetic training data of S1 . . . 26

3 Results for the trained ML models on the test data from S1 . . . 26

4 Run-time of regression on S1 . . . 32

5 Resulting optimal parameters for SVM and ensemble methods on DTs for the syn-thetic training data of S2 . . . 34

6 Results for classifying algorithms on the test data from S2 . . . 35

7 Resulting optimal parameters for SVR on S2 . . . 37

8 Run-time of regression on S2 . . . 39

9 Results for classifying algorithms on S1 training data . . . 52

10 Results for classifying algorithms on synthesised S1 training data . . . 53

11 Results for ANN with different learning algorithms and fine Gaussian SVM when kernel scale mode set to auto instead of manual . . . 54

12 Results from optimising Ensemble Method DTs on the synthetic training from S1 data . . . 56

13 Results from optimising SVM on the synthetic training set from S1 . . . 57

14 Results from optimising SVR on the training set from S1 . . . 58

15 Results from optimising Ensemble Method DTs on the synthetic training from S2 data . . . 59

16 Results from optimising SVM on the synthetic training from S2 data . . . 60

(11)

1

Introduction

In factories which focuses on manufacturing of products there are often one or more assembly lines which assembles pieces of the product and results in the final product at the end of the assembly line. As time has passed since the first moving assembly line, created by Henry Ford 1913, the automation of assembly lines has become a more desirable option for industries, since it removes the human error and lowers the risk of injuries among workers, and thus leading to new revolutions in technology within industries ever since. The latest revolution within industries are known as ”The 4th industrial revolution”1. Within the 4th industrial revolution there is a concept called ”Industry 4.0” and has become more and more popular among manufacturing factories. The concept of ”Industry 4.0” includes many different topics, but for this project the one of interest is ”smart manufacturing”[1]. That is; to have the processing phase automated and interacting with each other object in that phase, and with close to no human interaction. In order to bring a manufacturer into Industry 4.0, each step along the assembly line has to be automated[2]. For this to happen one has to understand what happens at each part of the assembly lines and what data is being produced. To understand the data generated on the assembly line, it has to be saved over time and this leads to large amounts of data, also called ”big data”. For one to gain any information out of the data, different methods can be used. One example to achieve knowledge out of the data is to mine the data as in ”data mining” [3]. Data mining is the process of discovering patterns in a big data set with the help of Machine Learning(ML), pattern recognition or statistical analysis, to mention a few.

In this report a project has been conducted to help a company, which manufactures drivelines for vehicles, to classify good and bad Power Transfer Units(PTUs) based on the dimensions of the PTUs housing and the adjustment value used for calculating the shims thickness. If a PTU is classified as bad, regression is used to predict a new adjustment value. If successful, this will help the company take a step towards automating their assembly line, so that the level of human interaction along these assembly lines are reduced, leading to less risk of human error.

1.1

Problem formulation

At the company multiple contact sensors are used along their assembly line to check that the dimensions of their products are within tolerances. As of right now the data from control points on the assembly line goes into a computer system and are used in mathematical formulas to cal-culate three different shims thicknesses and physical shims are chosen appropriately. Occasionally the chosen shims will not be within the tolerated limits, then a worker have to manually alter a compensation factor to get a new thickness on the shim so that it is within the tolerated values. This complete process of manufacturing has to be done within 62 seconds otherwise the process is slowed down. If a shim is chosen and it turns out that the final PTU is faulty, then the PTU has to be disassembled and the correct shim has to be inserted. This creates some data that are not complete and thus may have to be removed before using the data to train and test the algorithms. Therefore, in this thesis different ML models are used to see which one is the best for the task of first classifying the given inputs to determine whether they will produce a correct or faulty PTU. If the PTU are deemed faulty, a regression model will be predicting what compensation factor will give the right dimension of a shim, within the given time frame of 62 seconds. The data given contains both good and bad data, meaning both PTUs that are deemed good and those that are deemed faulty. To get the best result out of these models it is important for the data to have good quality, so preprocessing methods will have to be developed as well.

1.1.1 Hypothesis

The hypothesis is that it is possible to implement both a classification and a regression model based on ML that can compete with the currently used model, while still meeting Real-Time(R-T) requirements. Accuracy and time are crucial metrics for evaluation since the system will depend on that the models are optimised in those two metrics.

(12)

1.2

Research questions

The research questions set out to be solved to prove this hypothesis are as follows: • RQ1: Which features are important from the measurement data of the PTUs? • RQ2: Which ML methods are best suitable to classify on the given data? • RQ3: How ML methods could be implemented to predict the shims dimensions?

• RQ4: Can the ML methods predict a better adjustment value than what has previously been chosen?

1.3

Report content and structure

In this thesis different ML algorithms were used to classify the condition of a PTU with the given data and ML models are validated in the purpose of predicting a new adjustment value for the calculation of a new shims thickness. A Graphical User Interface(GUI) is also created in the purpose making the usability of the created program for prediction more intuitive. A method is presented on how the problem was tackled and how to validate and evaluate each model created for each task. At the end of the report, representations of the results from both the classification and prediction are presented followed by a discussion and conclusion revolving around the results and thesis as a whole.

(13)

2

Background

Around the 19th and 20th century, big steps were taken within industries as the technology went from industry 1.0. This is known as the industrial revolution, a step that meant that industries went from hand production to mechanical machines[4]. Industry 2.0 was introduced by the beginning of the 20th century and by the last decades of the century, industry 3.0 was introduced. The automation of factories necessitated sensors, over time the sensors can produce huge sets of data which can be used for optimisation, thus several inventive ways were implemented to process the data and among those were ML algorithms[5].

At a company that manufactures and assembles drivelines for vehicles, multiple measurements points have been installed to collect data on each part to ensure that these parts holds up to the expectations of their customers. With this data, relationships between measurements can possibly be found through analysing of the data.

One of the parts which the company assembles are the PTU, these are the units which, as the name suggest, transfers the power from the transmission in the front of the vehicle to the axle in the rear of the vehicle. This is possible through two gears, a pinion and a tube shaft, which can be seen in Figures 1 and 2. In Figure 1 one can see the pinion sticking up from below. This gear is connected to the tube shaft so that the gears align and depending on the position of both gears the efficiency of the PTU can differ. If not positioned correctly, it can result in vibrations and other unwanted problems which tears on the PTU and other parts of the car over time. So to prevent this, shims are used to calibrate the position so that the gears align as good as possible.

Figure 1: Main housing of the PTU seen through the tube shaft, the pinion can be seen at the bottom. Courtesy of GKN Automotive K¨oping AB.

To get the right thickness of the shim, a mathematical formula has been derived to, with the help of measurements and tolerance, calculate which thickness of the shim is the optimal one to the parts which are measured. At this moment, a compensation factor is manually inputted by a operator if the PTU is not within tolerances. This factor makes sure that the formula calculates the correct thickness of the shim. This procedure takes place at two different stages along the assembly line, at the Station considering pinion shim(S1) and at the Station considering tube shaft shims(S2), where the first shim changes the tolerances for the next shims. If a shim would be chosen so that the final result lies outside the tolerances it results in that the final product is faulty and thus must be disassembled to be re-assembled again at a later time. This leads to higher costs due to an increase in work hours and waste of material e.g. if the PTU has been filled with oil, since it has to be drained from the oil and the parts cleaned for later use. According to the company about 2-3% of the PTUs have to be re-assembled each day, where a 1% fault rate can mean up to 12000 units a year.

(14)

Figure 2: Shim locations in PTU. Courtesy of GKN Automotive K¨oping AB.

having a ML algorithm aid the mathematical model in its calculations. So that the final outcome can be predicted in an earlier stage of the production and use this information to automate the calculations. This is to prevent the occurrences of faulty products or at least lower the rate of faulty products compared to the current state from the mathematical model. For the algorithm to be able to predict these outcomes, it has to process the data given from the assembly line to find a model which can predict the correct thickness of the shim which lies within the tolerances.

One way of using ML algorithms is to discover patterns in the data, i.e. data mining[3]. This is useful to predict the outcome of a given set of data, which is important in an industry where cost, in form of materials and work hours, and effectiveness play a huge role. If the calculated model is accurate enough with its predictions it can prevent faulty products and thus save the factory a lot of time and money which otherwise would go to either throwing away a batch of faulty products or disassemble them to re-assemble them in a later stage.

To predict the outcome of PTUs from a set of data with high accuracy is not the only criteria a factory have. Since time is of the essence in an assembly line, the prediction should also be calculated within a certain time frame so that the assembly line does not end up slower than it was before implementation of the algorithm. Thus it is necessary for the algorithm to compute the prediction within the R-T requirements, so no downtime occurs because of computation of the data.

2.1

Preprocessing

To get a ML algorithm to perform better, it is necessary to have clean data and to select/extract useful features from the dataset. This can improve efficiency and accuracy[6]. Feature selection can be done manually by people with expert knowledge revolving the given data[7]. It can also be done automated and algorithmically which will be presented in this section along with feature extraction and the differences of selection/extraction.

(15)

2.1.1 Data cleaning

Data read from datasets can contain a number of different faults or inconveniences, especially data collected with sensors where dirt and other things can make for strange values. Thus, it is important that the data is clean enough so that minor deviations in the data does not affect ML algorithms too much.

Outliers

The data can have outliers, which are values that are unreasonably low or high hence, out of bounds. In [8] and [9] statistical methods are proposed for outlier detection. By checking the mean, standard, deviation and range values out of bound can be identified. The data can then be deleted or replaced with an average value. In the papers they also mention that clustering algorithms can be used to see what values are outside of the pattern, also violation of association rules can be used.

Missing data

Missing data is another problem to take into consideration, when data is missing empty fields will be encountered in the datasets, this is something not all ML algorithms can deal with[10]. The problem of missing data can be tackled in a few different ways. Two papers that goes far in listing ways of handling missing data are[11,12] where the first method for solving this is to delete the sample with the missing value, which works with large datasets. There is also a pairwise deletion method which deletes a sample only when a missing value is to be used in that sample, as long as only other variables in the sample is used it is untouched.

With model based replacement methods, the missing values can be replaced with values predicted by distribution of observations and finally ML models can also be used for dealing with missing data, both unsupervised and supervised methods.

Scales

Having scales that differ between variables can favour some variable over another, this is some-thing people have solved by normalisation or standardisation. In a paper written by Zhiqiang Ge et al[13]. normalisation of the values in the dataset is mentioned as an important step, at least in the case of Principal Component Analysis(PCA) modelling, the scale difference of the variables, due to different absolute values can be inclined to a variable, meaning that variables are not equally considered. However this is also said to not be true for all cases, some models might need different scales of variables.

Noise

When dealing with sensor data in a process environment with data generated by electronic sensors, noise can be induced on the signal due to the environment. There are mainly two types of filters that can handle this, model based and data driven filters. Among the model based the Kalman filter is the most widely used and model based includes e.g. digital filters etc[11]. In [12] binning, clustering and regression are named as methods for handling noisy data.

2.1.2 Feature selection

Feature selection reduces dimensionality by finding the best subset of the original features, this can make the interpretation of algorithms outputs easier to understand than for transformed features brought up in the next section[6,14]. Feature selection is mainly done with a supervised learning algorithm and the techniques used are divided in wrappers, filters and embedded approaches. Wrappers

The wrapper methods chooses sets of relevant features in the learning algorithm. It does this by a search for the feature subset that maximises the predictive accuracy of classifier models built on feature subsets in the feature space itself[6]. Different search strategies can be used in wrapper methods, some of those strategies are hill-climbing, best-first, branch-and-bound and genetic algo-rithms[14]. The performance can be assessed with a validation set or cross-validation. The wrapper is said to get better predictive accuracy but is more computationally expensive than filters. The combinatorial possibilities of wrappers makes them infeasible for use on large scale problems[15].

(16)

Filters

Filters have gained support in many real world applications due to wrappers high computational cost[15]. Filters try to remove irrelevant features before the application of learning algorithms[6]. The typical filtering algorithm can be divided in two steps, first features are ranked on a criteria, evaluated in a univariate or multivariate scheme, where univariate means individual ranking and multivariate in batch[14]. In the second step the highest ranked features are used to induce classifi-cation models. Performance criteria used for filter are e.g. Fisher score, information gain methods and ReliefF. In [6] the proposed techniques are Chi-Square measure, Information Gain and Odds Ratio. Correlation-based Feature Selection(CBFS) and Fast Correlation-Based Filter(FCBF) are two filter techniques that are popular according to [15].

Embedded

Embedded methods fills the space between filters and wrappers[15] by combining the advantages of wrappers with the advantages of filters. The wrappers inclusion of interaction with the classifi-cation model and the filters far less computational costs[14].

2.1.3 Feature extraction

Feature extraction is a part of the feature transformation family where the original features are transformed to a new reduced set of features with less dimensionality[6,14]. Feature generation is the other part of this family but since extraction is the most widely used in recent work, feature generation will not be included. Feature extraction is mainly done with unsupervised learning. Principal Component Analysis(PCA)

PCA is an unsupervised learning algorithm that uses orthogonal transformation in a statistical procedure to convert correlated variables into linearly uncorrelated variables(principal compo-nents)[13,16]. The first principal component has the highest variance, the succeeding components then have less and less variance, but each have the highest possible variance among the components orthogonal to the preceding.

2.1.4 Sampling fraction

When sampling data there is a possibility that one class gets sampled more than the other class and this makes classifier algorithms biased towards the class with more samples[17].

SMOTE

To avoid that one class gets overrepresented a technique called Synthetic Minority Over-sampling Technique(SMOTE) can be used to make the ratio between the classes more even. This is done by taking the k nearest neighbours, where k is an integer, of a sample in the minor class and create synthetic samples with similar attributes to the real sample. This procedure is repeated until an acceptable ratio is achieved[17].

2.2

Classification & Regression

Predicting the future is of interest in many areas, since if one knows the future they can prevent any accidents from occurring. This applies especially well in the realm of industries, where a single fault within whether manufacturing or personal safety can lead to a high cost for the company. With industries moving into industry 4.0, prediction of the generated data from each point along the assembly line is even more critical. If one can predict the outcome in the first step of the assembly line, a lot of time and material can be saved.

In this section a number of ML methods will be presented and described on how they are used for classification and regression.

(17)

2.2.1 Artificial Neural Networks

Artificial Neural Networks(ANNs) are algorithms which are built like the human brain where the most basic network consists of three layers: input, output and hidden layer. Each layer consists of nodes, similar to the neurons in the brain. The neurons connections are associated with weights simulating the neural pathways of the brain, the weights emphasises with varying values how strong the connections are. An example of an ANN can be seen in Figure 3, where the weights are denoted as w1−6 and wh1.

The most common method for training of these networks and assigning the correct weights is with back propagation[15]. This is the idea that the weights are set to an arbitrary value and the input are sent through the ANN. These values are calculated together through a certain transfer function to get a combined value, which then are sent towards the output layer. Then, after the output node, an error is calculated and then the weights closest to the output node are changed accordingly to which path that weights more. After these weights are changed, the weights attached to the hidden nodes are changed and then the weights going into the hidden layer are changed. This procedure repeats a number of iterations chosen by the user, either a set number of iterations or iterating until an error under a certain threshold is achieved.

Figure 3: Figure depicting a basic version of an ANN with two input nodes and two hidden nodes.

2.2.2 Bagging and Boosting

Bootstrap aggregating(Bagging) and Boosting are two meta-algorithms which can be used to im-prove the accuracy and robustness of weak ML algorithms[18–20]. If the learning algorithm is stable already, using bagging might not always be desirable since in some cases it might worsen the performance of the algorithm[21].

Bagging is a method which revolves around letting a learning algorithm train on a replacement set Di, where i is the number of iterations, of the complete dataset D with d tuples and then predict the value of an unknown tuple X. Di can have the same original tuples as D, so that Di= D, but it can also contain doublets of some tuples instead of containing all individual tuples. When the algorithm has predicted X for all iterations, the predictions are counted and the prediction with the most counts is considered as the most accurate prediction and thus X gets that prediction[3]. In boosting, each training tuple is assigned with weights. These weights are updated after an algorithm has trained on them to make sure that the next algorithm training can be more cautious about the tuples which the previous algorithm had problem classifying. The final al-gorithm then combines the prediction from each alal-gorithm, since the weight in each alal-gorithms prediction is bounded to its accuracy[3]. A popular form of boosting algorithm is the Adaptive Boosting(AdaBoost) algorithm[22,23].

(18)

2.2.3 Decision Trees

The structure of Decision Trees(DTs) are like that of an inverted tree, having the root at the top and the leaves at the bottom. The leaves in the end of the branches are the predicted values. The branches leading to the leaves are made up of a conjunction of features, which branch the decision tree will traverse depends on the conditional probabilities. The three basic algorithms which is often used are C4.5, CART and ID3[3,24]. One of the factors which differentiates CART from the others are that CART is using Gini Diversity index to split the outgoing branches meanwhile the other two use information entropy to select nodes that splits branches as effectively as possible, this is due to that C4.5 is an extension of ID3[25,26]. The trees are constructed from the top down and to prevent overfitting they can be pruned from the bottom up[15]. The accuracy of DTs differs between applications and thus can be considered high by some and not that great in its basic form by some[13,15,26].

Figure 4: Figure depicting a basic version of a Decision Tree with three levels.

Decision Tree(DT) are considered simple to implement in general since every branch and leaf is based on probability. A major advantage decision trees have above other ML algorithms is that they are very easy to interpret since it mirrors human reasoning around making decisions, thus making it an effective algorithm when how predictions are made has to be presented. Another advantage with decision trees are that they do not need any preprocessing, such as normalisation of the data for it to work[27,28]. However, DTs performance can be improved through preprocessing data by discretization [29] or by taking an average and sum over a set time interval [30]. In [31] an ANN is used for preprocessing of the data for the DT and the accuracy was significantly improved. It is suggested that bagging or boosting should be implemented to get a stronger classification from the DTs[15].

Another strength of DT algorithms are that if one tree is not good enough for a certain task, it can be combined with more DTs to create a so called Random Forest(RF)[15,32], which is an ensemble method much like bagging and boosting(mentioned in Section 2.2.2). It is called ”random” since it selects a subset of the feature space at random to then perform a conventional split selection procedure within the given subset[10]. It is proven that RF is significantly stronger than DTs at classification even though it takes longer to train depending on the amount of data [33].

2.2.4 k-Nearest Neighbour

k-Nearest Neighbour(kNN) is considered one of the simplest ML algorithms[25]. It is an non-parametric method which can be used, as previously mentioned algorithms, for both classification and regression[34]. Depending on which, the output differentiates. As for regression the output will normally be an average of the input values k nearest neighbours within a feature space, hence the name of the algorithm[13,26]. If one consider an instance within the n-dimensional space of feature-vectors, then the position of this instance can be defined through the distances between instances. To calculate these distances, different metrics can be used and the most notable ones

(19)

are: Camberra, Chebychev, Euclidean, Kendall’s Rank Correlation, Manhattan, Mahalanobis and Minkowskis metric[24,35].

kNN is fast in training with respect to number of attributes and instances. To make the algorithm even more accurate, weights can be implemented so that closer neighbours might have more impact on the output. A common way to do this is to assign each neighbour a weight of 1/d, where d is the distance to each neighbour respectively.

2.2.5 Naive Bayes

Naive Bayesian classifier(NB) is a simple form of Bayesian Networks(BNs) which applies the Bayesian theorem. That is, it shows the probability between different scenarios occurring, with a strong assumption that each feature, that describes the objects to be classified, are statistically independent[36]. So if the features are not independent, NB will not perform well. NB classifies through the calculation[37–39]:

vN B= argmax vj∈V P (vj) Y i P (ai|vj) (1)

Where vN B denotes the target value output, vj denotes the targeted value in a set of m classes, called V , available in the dataset, P (vj) is the probability of given value vjand ai is the individual attributes values. To predict the class label of a, the following equation is used[40]:

P (a|vi)P (vi) > P (a|vj)P (vj) f or 1 ≤ j ≤ m, j 6= i (2) 2.2.6 Partial Least Squares

Partial Least Squares(PLS) is a statistical modelling method which is based upon finding a linear regression model by projecting the predicted and observable variables to a new space, all at the same time[13]. It is fundamentally used for finding a way to describe as many multidimensional directions in the predicted outcome with the observable outcomes multidimensional directions, making it a widely used method for linear regression.

2.2.7 Support Vector Machines

Support Vector Machines(SVM) is a supervised ML algorithm which revolves around creating a hyperplane that separates the data into two different classes in a way that the closest instances of each class is as far from the divider as possible, which makes it a good algorithm for classification, novelty detection and regression analysis[16,26,41].

SVM operates in general best on binary classifications of data, but it can also produce promising results for multiclass classifications by using a set of hyperplanes instead of only one hyperplane[32]. A power of using SVM is that it can use the kernel trick to tackle non-linear classification. With these kernel tricks, SVM becomes an algorithm which can adapt to the different requirements which can arise[10]. With some kernel functions, the SVM can be applied to work on infinite dimensional space[42]. The most used kernel functions for SVM are linear, polynomial, radial basis and two-layer kernel function[43]. The radial basis(RBF) kernel function is documented to be the most used kernel when it comes to nonlinear data since it is strong at nonlinear mapping[44–50].

Since SVM is often considered to have a long computation time when it comes to training with big data[24,25,28], there is a way to optimise it for R-T analysis by using Least Squares Support Vector Machines (LS-SVM)[51]. Instead of solving a quadratic programming problem involving inequality constraints, LS-SVM only solves linear equations [52,53]. This results in that LS-SVM is superior to regular SVM when it comes to how long it takes for the algorithm to train[54].

A SVM version for regression, called Support Vector Regression Machines(SVR), was intro-duced in 1996[55]. These algorithms are used to, as the name suggests, to create a regression function for estimating relationships between the features of the data[56,57]. It can, with the help of kernel tricks as mentioned above, be applied on multidimensional data but it struggles with large-scale training samples and to solve multi-classification problems[43].

(20)

2.3

Validation and evaluation

When a ML model has been training on a set of data, it has to be tested on a set of data which is independent to the data which it was trained on, since the model is considered biased to the training data. To know if the model is good at classifying or predicting on independent data, an evaluation is made on the result out of the classification/regression. This evaluation determines whether the model is fit to predict on more data which is independent to previous trained data, if this is the case the model is said to have a good fit. This type of testing of the model is called validation of the model.

Since there is a demand for the algorithms to be precise, the accuracy of each algorithms is to be evaluated. To evaluate an algorithm, two key difficulties arise which must be handled with[37]: 1. Bias in the estimate. That is, when an algorithm is trained over a certain set, its accuracy on this set is biased which results in that the accuracy over that set is a poor estimator of its accuracy for other sets which is not identical to the training set. To prevent this, the accuracy is evaluated over a test set which the algorithm has not been trained on.

2. Variance in the estimate. Even though the accuracy is evaluated over a set which is indepen-dent to the training set, the measured accuracy can still differ from the true accuracy. The smaller size of the test set is, the expected variance will be greater.

2.4

State Of The Art

Even though there are no current research which revolves around ML algorithms predicting out-comes of PTUs, there are however, papers on ML algorithms doing predictions in industries and assembly lines. But according to [10] the research problem does not have to be in the exact same domain. The interesting part is the dataset and how the ML algorithms handle multivariate and high dimensional data.

The paper by Karl Hansson et al.[15] presents different supervised ML techniques and serves as a guideline for a paper mills implementation of a model for predicting the bleaching time of paper pulp. The proposed techniques are Multilayer Perceptron(MLP), SVM, DT and RF. If bound by old computer hardware MLP and SVM are recommended otherwise Ensemble learners based on trees are concluded as the best and with SVM as second best. When it comes to preprocessing however, the feature selection methods wrappers, filters and embedded methods are presented. Wrappers are good at finding the best feature sets but are however, not feasible for large datasets. Filters are not as computationally costly as wrappers, are good at finding generalising feature sets but which can for complex feature interactions lack a bit in predictivness. Two filter techniques that are popular are CBFS and FCBF. Two of the most popular embedded methods are presented as SVM with recursive feature elimination(SVM-RFE) and Regularised Trees.

In [57], SVR with a Gaussian RBF kernel is implemented through MATLAB to predict the layer of thickness of dielectric layers deposited onto a metallization layer of the manufactured wafer. To validate the algorithm, 5-fold cross validation was used with Mean Square Error(MSE), Root of Mean Square Error(RMSE), Coefficient of Variation of RMSE(cv(RMSE)) and R2 as evaluation metric. It is considered to normalise or scale the data as of preprocessing to get a better performing SVR algorithm. The algorithm can also be improved by optimising C and γ via grid search.

This algorithm is mentioned in [47], where it is compared with the results of ANN and M5’. Unlike the previously mentioned report, cv(MAE) is also used for evaluation of the algorithms in this report. It is shown that the SVR algorithm is slightly better when it comes to accuracy and M5’ is better when it comes to computational speed. However, it is still concluded that all three of these algorithms can be used for prediction within manufacturing.

Another comparison between algorithms are done in [33]. Here, the algorithms ability to predict failures in production lines are tested and evaluated in metrics of Area Under the Receiver Operating Characteristic Curve(AUROC), Matthew’s Correlation Coefficient(MCC) and training time. As for preprocessing, all features where converted into binary so that those with values were set to 1 and those with missing values were set to 0. PCA was used for reducing the dimensionality of the features and then a K-means algorithm was used to divide all the data into clusters. The methods which are in focus are: DT, gradient boosting, which is a version of boosting mentioned in

(21)

Section 2.2.2, logistic regression and NB and RF, which is mentioned in Section 2.2.3. A three-fold cross validation was used to train the algorithms and RF was optimised through grid search to then evaluate them on a test set which is disjoint to the other sets. These methods were implemented through the Scikit-Learn package. The algorithms were outshined by the two ensemble methods when it comes to AUROC and MCC. When it came to training time the ensemble methods was way slower than the algorithms, which proves that these are not suitable for data with over 100000 observations.

PCA is also used in [16] to reduce the dimensionality of the data set. Besides from reduction of dimensionality, outliers are removed by normalising all the data in the interval of [0,1]. The algorithms which are used for regression here are AdaBoost with DT, RF and SVR, with a RBF kernel. These algorithms were used in the area of ultrasonic crimpings to predict the crimped connections withdrawal force. To evaluate these algorithms a 7-cross validation was used with the metrics of evaluation being the average of R2 and the standard deviation. It was proven that SVR was the best performing when evaluation on R2, but both AdaBoost and RF performed almost as good as SVR, but there were more deviations in AdaBoost than in RF.

ML techniques used in the context of manufacturing are also studied by Thorsten Wuest et al.[10]. Supervised learners are here recommended for pattern recognition in data collected in manufacturing, since this is data that most often contains labels. The supervised learners presented here are BN, Instance-Based Learning(IBL), ANN, SVM and Ensemble Methods. For training and validation of the ML model 70% of the data is proposed to be for training, 20% for validation and the last 10% for testing. BNs advantage is the low storage requirements, ability to handle missing data and outputs easy to understand. IBL was said to mostly be based on kNN and was excluded because of its difficulty to handle little known domains, complicated calculations in large datasets and its tendency to overfitting with noisy data. ANNs have been applied in a wide range of applications, are good at handling multivariate and high dimensional data. Disadvantages for ANNs are the need for large datasets, time consuming training, risk of overfitting, complex models and no tolerance for missing data. SVM is presented as an algorithm gaining attention in recent years. SVMs advantages are high performance, high accuracy and can adapt to different problems with different kernels. SVMs can also handle high dimensional data. Ensemble methods are mentioned as a committee of base learners, AdaBoost and RF are mentioned as popular ensemble methods, when ensemble methods are appropriate to use and when not is not mentioned. The overall conclusion of the research done in the paper is that ML is a powerful tool in manufacturing and its importance will increase in the future.

Han, Ji-Hyeong et al.[56] presents ML methods for keeping product costs as low as possible and increasing productivity in manufacturing, which are requirements to keep competitive in the globalised market. The ML algorithms are proposed to deduce proper decisions based on data in order to predict. The data used in the work is from an industry which makes various shafts for automobiles. Simple linear regression and SVR with different kernels were chosen for a comparison and the metrics which were used to compare them, were the learning time and MSE. Simple linear regression had the fastest learning times but the biggest MSE, SVR with normalised polynomial kernel had the most reasonable learning time and mean absolute error. As the degree of the poly-nomial increases an increase in learning time and decrease in MSE was reported. The conclusion was that for tool wear prediction SVM with a higher degree of normalised polynomial should be chosen, where the degree should be decided based on the size of the dataset.An open source library called Weka was used for implementation of said algorithms for the comparison.

In a paper by Ming Luo, Ying Zheng and Shujie Liu[58] kNN is used on historical data to predict the output in semiconductor manufacturing. In single product manufacturing the mean, variance and MSE of the actual output and the predicted output were very close when subjected to no disturbance. When tested with disturbance the mean and variance were still close and MSE still small. The same result was achieved with multi-product manufacturing where multiple semiconductors are manufactured in the same tool at the same time. The predicted value was then used and compared with actual values in a fault alarm in the manufacturing control system.

(22)

3

Method

In this section a presentation on the methods this thesis is using for implementation and the methodology behind each method is explained in respective subsection. The methodology behind these methods are based on books, public research and surveys.

Figure 5: Figure depicting the process of this project. The process of this thesis is illustrated in Figure 5.

At first, a literature study was done to gather knowledge on projects revolving around the combination of manufacturing and ML. From there the research continued around ML and prepro-cessing, which is of interest due to the possibility to optimise ML algorithms. A lot of reports where gathered and then a score in form of quality of the report and relevancy to this thesis project was appointed by each of this thesis authors individually, so that only those deemed to be in the scope of this project were prioritised. Once proper knowledge about how the algorithms worked were achieved, the data received from the company were analysed with this knowledge in mind. From there a hypothesis was created and some relevant research questions were composed for the task of trying to prove the hypothesis. Once hypothesis and research questions were established, brain-storming commenced on how to fulfil the hypothesis, answer the research questions, mentioned in Section 1, and which software to use for the project. From analysing the data and determining a software. A design on how to implement the algorithms were discussed and if already implemented tools were desirable for classification/regression or if there were other ways to make the implemen-tation of algorithms easier. After deciding on the design of the implemenimplemen-tation, the building and test phase followed. Where the algorithms were implemented and the trained models were tested and evaluated to see how good fit to the data could be accomplished and how good the predictions, given from the regression models, were. If the outcome was not good enough, steps back would be taken to establish how better performance could be achieved from the trained models.

3.1

Software

In the paper[59], a number of tools were used for the task of preprocessing data. One of those tools was MATLAB, although it did not make top 4 and go through their hands-on testing, the things that made MATLAB not to be chosen in the test was the lack of a user friendly GUI, it requiring some coding, using a bit more memory and supposedly not being an easy tool to learn. Seeing as at least three of these disadvantages does not apply in this thesis due to both writers having major experience within MATLAB comparable with the other software. Thus, MATLAB

(23)

has been chosen as the tool for implementation.

3.2

Preprocessing

Due to the nature of the received data containing more variables than deemed necessary, some preprocessing of the data had to be done. In this section each category mentioned in Section 2.1 is gone through and the methodology is presented for each mentioned topic. This is done to answer the first research question and to maybe ease the work that has to be done for trying to prove the hypothesis.

3.2.1 Data Cleaning

Since the raw data from the assembly line has missing values, the data has to be preprocessed to handle the issues of missing data. Based on research done in Section 2.1, the samples which have missing data can be removed to reduce the number of samples within the data without loosing too much information[11], since the samples with missing data are such a minority the training will not be affected if those samples are removed. Since there are more variables in the data than might be necessary to use, some variables was selected as the only important ones while the rest was removed. This was done with the help of some experts knowledge on this data.

3.2.2 Feature Selection

Due to the data cleaning, the dimensionality of the data got reduced significantly. In fact, it got reduced in such a large number that any of the methods mentioned in the feature selection section were not considered for this thesis.

3.2.3 Feature Extraction

PCA is according to research done the most widely used method for reduction of data dimension-ality in industries[13]. However, as just mentioned above, since the number of variables which will be used in this thesis is not that many, PCA is not considered. Besides from the low number of variables, PCA will not be used since the data should be presented, by request from the company, in an intuitive way, which is not something that PCA does due to it transforming the data. 3.2.4 Sampling fraction

In case one of the correct or faulty samples in the data would be overrepresented, SMOTE was considered and used to even the ratio between the two classes. This is due to it using oversampling to increase the number of faulty samples instead of decreasing the number of correct samples.

3.3

Classification

For classification of the data, the following algorithms in this section have been considered to try and answer the second research question and also prove the stated hypothesis in Section 1.1.1, given by the research presented in Section 2.4.

Explanations why the given algorithms are chosen or not are presented, in form of strengths and drawbacks, in the following subsections. How each of the chosen algorithms were used in the project are presented is in the following section, Section 4, where the implementation of this project is presented.

3.3.1 Articifial Neural Network

ANN is one of the most used algorithms when it comes to data mining in the process industry. This is because of it being an adaptable algorithm which can serve many purposes thanks to its modelling ability[13]. ANN is a very good algorithm for classification, even on nonlinear data, because it is considered very fast at classify and does not suffer any reduction in its accuracy because of it[24], which is a desired attribute in this project since every classification has to be

(24)

done in R-T and still have a very high percentage of correct classifications. However, since there are more than just ANN in focus, there is not much time that can go into focusing on the depth of how the ANN will be constructed and thus it might suffer in some parameters. If it shows promising results it might be reconsidered to go more in depth to create a better ANN model in future work.

3.3.2 Decision Trees

DT is a simple algorithm which is very easy to interpret since it is mirroring the human way of reasoning. The algorithm alone is not always good for classification but its strength in quantity, introduced as RF in Section 2.2.3, makes RF a very good classifier, but it comes at the cost of taking a longer time to train than if it only were a DT[33]. It is also proven, besides from using Bagging, that RF can be implemented to reduce the effect of overfitting the data[60]. Thus, both DT and RF are considered, but more focus is put on RF, for classification moving into implementation. 3.3.3 k-Nearest Neighbour

kNN is not considered as a fast classifier [24] and kNN also require a lot of storage space for training and equally much for execution, if not more. Since kNN use neighbours to its sample, it can be hard for it to distinguish between good and bad samples due to the similarity between good and bad data samples. However, due to the lack of storage space which kNN might require, it is not one of the primarily focused algorithms to develop a model for classification.

3.3.4 Naive Bayes

NB is used in many areas for prediction because of its efficiency, simplicity and short computational time for training when working on datasets[61], but it does not perform well on datasets which have complex attribute dependencies present[62]. NB performance excels the more data, especially in form of high dimensionality of the data, it has to train on. This is because the more data to train on, the algorithm groves more certain and the probabilities improves for classification[63]. Even though this attribute might backfire on the model due to the probability of irrelevant samples in the data, it is considered to have a bad accuracy in general compared to other algorithms[24,28]. Thus, NB is not in primarily focus of this project but is, however, still considered for this project for the plausibility of having good accuracy on this particular set of data, since it does not contain any irrelevant samples.

3.3.5 Support Vector Machine

Due to the fact that SVMs is fundamentally built upon separating data, its performance on binary classification is undisputed. It is also considered that SVM is as fast as ANN and also being more accurate at its classifications[24], which might be the reason why it is the only algorithm that is more mentioned in fault classification than ANN according to research done in process industry[13]. Since the data in this project is divided into a correct and a faulty class, SVM is considered as one of the primarily algorithms for this thesis.

3.4

Regression

A couple of ML algorithms were considered, for the task of predicting a new adjustment value through regression, to try and answer the last two research questions and prove the last part of the hypothesis in Section 1.1.1 with the help of the research presented in Section 2.2.

3.4.1 Artificial Neural Network

As mentioned about ANN in the classification section previous to this section, ANN is an algorithm which is very adaptable to different tasks. It is the most used algorithm in process industry and second biggest when it comes to quality prediction[13]. It can predict with high outcomes without being too heavy when it comes to computational time, which makes it good for prediction on the data in this project.

(25)

3.4.2 Partial Least Squares

PLS is the second most dominant method used in the process industry[13]. Since it is based upon trying to find a way to explain as many predicted variables with the observable variables, it has found a lot of success in a lot of areas mentioned in [13] and it works well on nonlinear data. 3.4.3 Support Vector Machine

For SVM to be used in regression, a version of SVM is called SVR and is mentioned in section Section 2.2.7. The idea of using SVR for nonlinear regression is the same as having SVM classifying on nonlinear data, with help kernel tricks it can do predictions[47]. This makes SVR an adaptable algorithm which is used a lot in the process industry, only being surpassed by ANN and PLS[13].

3.5

Validation and evaluation

In this section, how to validate the ML models and which metrics are of interest when evaluating the ML models are brought up. To know how well a ML model works on its problem, an evaluation has to be done on its performance in form of a parameter from the model. To get an evaluation from a ML model, it has to be validated first.

Ways to validate a model can be turned into two different scenarios: external validation and internal validation. External validation is when a sample data can be gathered from the same population or from a similar population[64]. Internal validation is when the data is divided into a training and a testing set.

Cross-validation

A popular way of internal validation is cross-validation. Its fundamentally built on that at first the algorithm trains on the training set a number of iterations before it is tested on the test set, from there a comparison can be done through looking at the accuracy(correct predictions) of the result or the number of faulty predictions from each algorithm[37]. The two most common procedures for cross-validation are: exhaustive data splitting and partial data splitting[65]. Examples for exhaustive data splitting are: leave-one-out and leave-p-out. In these methods a data point, or a subset of data, is left out for validation and the procedure iterates through so that all data is used for validation.

Partial data splitting is as the name suggest, the data is split up partially where some is used for training and some for testing. Examples of partial data splitting is: k-fold cross validation, holdout method and repeated learning-testing. K-fold cross validation and holdout method are often used for evaluation. An example on holdout method; if one has a set of data then one can use 60%, or any arbitrary percent, of the data as training data. Then the rest, in this case the 40%, can be used as test to validate the algorithm[61].

There are many questions revolving ML algorithms, but the three questions which are important when it comes to evaluation[37,66] are:

1. Given the observed accuracy of an algorithm over a limited set of data, how well does this estimate its accuracy over new samples?

2. Given that one algorithm outperforms another over some sample of data, how probable is it that this algorithm is more accurate in general?

3. When data is limited what is the best way to use this data to both learn an algorithm and estimate its accuracy?

3.5.1 Classification

If a set of sample data is to be considered, then the observed value from the outcome variable of the classification model can be divided into four different outcomes. These four outcomes are: true positive(TP), false positive(FP), true negative(TN) and false negative(FN). True Positive are the predictions which are predicted to be positive and are correctly predicted, meanwhile false positive

(26)

Figure 6: Image depicting a confusion matrix.

are the ones predicted positive but are actually negative. True Negative are the predictions which are predicted to be negative and are correctly predicted, and false negative are predictions of negative but their actual value is positive. These four outcomes are best described through the confusion matrix, as can be seen in Figure 6. Out of the confusion matrix one can get different metrics to evaluate the classifier. But those metrics which were focused upon in this project are: accuracy, fall-out(false positive rate), precision(positive predictive value), recall(true positive rate), specificity(true negative rate), F1 score and balanced accuracy. These metrics are calculated through following equations:

Accuracy = T P + T N T P + F P + T N + F N (3) F all − out = F P F P + T N (4) P recision = T N T N + F P (5) Recall = T P T P + F P (6) Specif icity = T N T N + F N (7)

F 1 score = 2 ×P recision × Recall

P recision + Recall (8)

Balanced Accuracy = Recall + Specif icity

2 (9)

F1 score in Equation (8) describes how well the ratio between precision and recall is in a clas-sification and balanced accuracy in Equation (9) shows the relation between how well a model classifies both true and negative values correctly. Beside the confusion matrix, the most important classification metric is the AUROC. AUROC is a curve which shows the relation between a clas-sifiers fall-out and recall. This relation creates a curve as can be seen in Figure 7, where the area under the curve is the Area Under Curve(AUC). The AUC shows how well a classifier is able to distinguish between two different classes, with 1 being it classifying every class 100% correct and 0 being that it actually classifies correct classes as faulty and vice versa. If the AUC is 0.5 it means that the classifier cannot distinguish between the correct and faulty classes at all.

(27)

Figure 7: Graph depicting an AUROC.

3.5.2 Regression

There are two dimensions which can be used to determine an algorithms performance when it comes to prediction. These are calibration and discrimination. Calibration is a measurement of how close predicted possibilities are to the observed rate to the positive outcome of the test data. Discrimination is a measurement of how good the algorithm is to distinguish between outputs that has either positive or negative outcome. It dictates a model such that if its good then the algorithm can be recalibrated, but if the discrimination is bad then there is no adjustment that could correct it. These two dimensions cannot exist in a perfect state together, since if one were to maximise the discrimination, then they would do it at the expense of calibration and vice verse, so to get a good algorithm one has to find the best trade-off between these two [64].

A way of evaluating the performance of a models prediction is by looking at the observational error. That is, the error which is generated by comparing the estimated value of the model with the true value. In chapter 7 of [67], observational error can be divided into two categories:

1. Random error, the difference between the estimated and expected value.

2. Systematic error, the difference between the model prediction and the true value.

A random error can be caused from factors which cannot be controlled and an example of a systematic error is that the model may be biased. There are many ways to take both of these errors into consideration, some of them are Absolute Error(AE), Mean Absolute Error(MAE), Mean Absolute Percentage Error(MAPE), Sum of Square Error(SSE), MSE, RMSE , R2, Sum of Squares due to Regression(SSR) and Total Sum of Squares(SST). In the following equations Equations (10) and (18), it is shown how to calculate these different summations of error out of an models prediction, chapter 9 in [67] and [68]:

AE = n X i |i| (10) M AE = 1 n× AE (11) M AP E =100 n × n X i |i yi | (12)

(28)

SSE = n X i 2i (13) M SE = 1 n× SSE (14) RM SE =√M SE (15) SSR = n X i (y∗i − y)2 (16) R2= 1 −SSE SSR (17) SST = SSE + SSR (18)

Where n is the number of predictions, i is the error generated from the ith prediction, y∗i stands for the ith predicted value and y stands for the mean of all observations, y, of corresponding response variable.

However, evaluation of regression models by calculating the generated error cannot be used in this project. This is because there are no correct values to compare the predictions against, giving an generated error from the model, since the corresponding PTUs data is not saved after it is corrected, according to an expert at the company. Therefore, a more relevant evaluation of a regression models prediction in S1 is to look at the relationship between a measurement regarding the pinion and the difference between the actual and the calculated shim for the pinion. This will create a linear relationship which tells how much the difference of the chosen and the calculated shims thickness actually changes the measurement which regards the pinion. By comparing the inclination between the line created by the prediction algorithm and the ones created by the actual values taken from the data, an evaluation can be done to see how well the models prediction lies at relationship with the good and bad data. This relationship was discovered by an expert at the company, which also considered this evaluation to be the most relevant way to present the models performance in relevancy of how it is done along the assembly line.

A similar relationship exists between the difference between the actual and calculated thickness of the house shim and backlash which is generated from the PTU. This creates, just like with the pinion, a linear relationship which tells how much backlash is created depending on the difference between the actual and the calculated shims thickness. This also evaluates how well the predictions, made by the models, actually are.

(29)

4

Implementation

4.1

Preprocessing

The saved data is read from an excel document, where each column corresponds to one variable, the interesting variables for the work in this paper are the measurements. But a variable showing date and time of creation of the PTU is used for sorting the data according to time of creation. Another variable used for preprocessing is one showing error codes. The error code tells where at the assembly line a PTU have been deemed faulty. These error codes are used to group the data into four different groups, where data for good PTUs are marked with 1, then data are marked by -1, -2 and -3 depending on where on the assembly line the fault have been detected. The next step of the preprocessing is manual variable selection, the variables selected for further use have been marked by an expert and constitutes of 30 variables out of a total of 155, code on how the preprocessing was done can be seen in Algorithm 1 in appendix C.

The data have an imbalance toward good PTUs, with a majority of about 95% good against 5% bad. This was a problem that was detected during implementation of the classifier as the imbalance seriously affects classifiers negatively, since it gets biased toward predicting PTUs as correct. To solve this, SMOTE was implemented on the training data. With SMOTE synthetic data were created to increase the total number of faulty samples in the data. This is done by taking the faulty PTUs k nearest neighbours and create new, synthetic data to represent PTUs with similar attributes to the faulty ones. This is done until the data has en equal amount of correct and faulty PTUs.

As there are two stations where shims are chosen for the PTU, two models had to be implemented. The data had to be divided into two sets of input-output relations. The station where a shim is chosen for the pinion height is the first of the two stations and is thus called S1, S1 have two measurements as input and as output 1 and -1 marking good and bad PTUs for prediction, or the compensations factor as used for regression. The second station where shims are selected for the backlash and preload of the tube shaft is called S2. The output of S2 is the same as for S1 and depends on if a predictor is to be trained or if regression is to be done. Pseudocode on how the data sets were created for S1 and S2 can be seen in Algorithm 4 in appendix C.

4.2

Classification

In the first station, the house of the PTU is measured and then a pinion is chosen with a matching shim to adjust the height of the pinion. To calculate the dimension of the shim, a mathematical model is used, by the company, on a set of two measurement variables plus an adjustment value which is manually set. The result of the mathematical model is then used to take a suitable di-mension which is available among the existing shims. After a shim has been chosen, the house of the PTU, pinion and the shim are passed on to determine whether the pinion is in a good height or not compared to the tube shaft. If it is good it gets passed on, and if it is at a bad height it gets rejected. By knowing where along the assembly line a PTU is deemed faulty, all PTUs which have faults that revolves around S1 can be isolated from those who are deemed faulty after the control station of S1. Also, the ones who are deemed faulty after S1 are removed from the data set revolving S2.

The three inputs which are used for the S1 are the two measurement inputs which are used in the first step along the assembly line and the adjustment value. These inputs are used since they are the ones used in the mathematical formula and thus are the three factors which determines the correct dimension of the pinions shim. Once these inputs are chosen out of the data set for S1, and changing so that all faulty PTUs in S2 are deemed correct in S1, the balance changes so that good PTUs were approximately 97% and bad PTUs were 3% and is presented in Figure 13a in Section 5. This data was then separated into a 70% training set and a 30% test set for determining how well an algorithm will work for classifying this data.

The two ML methods which were in focus on this thesis, mentioned in Section 3, is ANN and SVM. To find the best kernel for SVM, MATLABs application for classifications were used. Beside finding the best kernel, all other classifiers in said application were tested for finding the

Figure

Figure 2: Shim locations in PTU. Courtesy of GKN Automotive K¨ oping AB.
Figure 3: Figure depicting a basic version of an ANN with two input nodes and two hidden nodes.
Figure 4: Figure depicting a basic version of a Decision Tree with three levels.
Figure 8: The implemented model for training of the regression model used for predictions of adjustment values.
+7

References

Related documents

Aim Our aim is to describe single-living community health needing elderly people’s thoughts on their everyday life and social relations.. Method This study uses a qualitative

Key Words: Institutions, quality of government, economic diversification, elites, democracy,

The main findings reported in this thesis are (i) the personality trait extroversion has a U- shaped relationship with conformity propensity – low and high scores on this trait

In recent years the Swedish education system has experienced a rise in the number of newly arrived students speaking different languages and with different

The European Year of Languages 2001 highlighted this idea and on the 13 December 2001 a European Parliament resolution was issued on the same topic and was followed by the

More trees do however increase computation time and the added benefit of calculating a larger number of trees diminishes with forest size.. It is useful to look at the OOB

In order to make sure they spoke about topics related to the study, some questions related to the theory had been set up before the interviews, so that the participants could be

[7] developed a method for outlier detection using unsuper- vised machine learning by dividing the data feature space into clusters and used distance measures within each cluster