Using Machine Learning as a Tool to Improve Train Wheel Overhaul Efficiency

Full text

(1)LiU-ITN-TEK-A--20/057-SE. Using Machine Learning as a Tool to Improve Train Wheel Overhaul Efficiency Oskar Gert 2020-10-15. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--20/057-SE. Using Machine Learning as a Tool to Improve Train Wheel Overhaul Efficiency The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet. Oskar Gert Norrköping 2020-10-15. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Abstract This thesis develops a method for using machine learning in a industrial process. The implementation of this machine learning model aimed to reduce costs and increase efficiency of train wheel overhaul in partnership with the Austrian Federal Railroads, Oebb. Different machine learning models as well as category encodings were tested to find which performed best on the data set. In addition, differently sized training sets were used to determine whether size of the training set affected the results. The implementation shows that Oebb can save money and increase efficiency of train wheel overhaul by using machine learning and that continuous training of prediction models is necessary because of variations in the data set.. 1.

(4) Contents 1 Introduction 10 1.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Theory 2.1 Machine Learning . . . . . . . . . . . . 2.1.1 Supervised Learning . . . . . . 2.1.2 Unsupervised Learning . . . . . 2.1.3 Reinforced Learning . . . . . . 2.1.4 Machine Learning Applications 2.1.5 Machine Learning Work Flow . 2.1.6 Machine Learning Algorithms . 2.1.7 XGBoost Classifier . . . . . . . 2.2 Category Encoders . . . . . . . . . . . 2.2.1 One hot encoding . . . . . . . . 2.2.2 Sum Encoder . . . . . . . . . . 2.3 Sliding Window . . . . . . . . . . . . . 2.4 Cost Function . . . . . . . . . . . . . . 2.5 Cross Validation . . . . . . . . . . . . 2.6 Hyperopt . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 12 12 12 13 13 13 14 14 16 18 18 19 19 20 20 22. 3 Method 3.1 Data Description . . . . . . . . . 3.1.1 Pre-processing . . . . . . 3.1.2 Data Storage . . . . . . . 3.2 The Process . . . . . . . . . . . . 3.3 Data Exploration . . . . . . . . . 3.3.1 Error Codes . . . . . . . . 3.3.2 Zuf¨ uhrungs Codes . . . . 3.3.3 Wheel set components . . 3.3.4 Measurements . . . . . . 3.3.5 Pearson Correlation . . . 3.4 Script Implementation Overview 3.5 Data Filtration . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. 23 23 23 24 24 25 26 27 28 30 30 32 33. 2. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . ..

(5) 3.5.1 Approach 1: Paths . . . . . 3.5.2 Approach 2: Stations . . . 3.5.3 Approach 3: Predict only at 3.6 Pre-processing methods . . . . . . 3.6.1 Data Cleaning . . . . . . . 3.6.2 Category Encodings . . . . 3.6.3 Feature Engineering . . . . 3.6.4 Custom Cost Function . . . 3.6.5 Custom Cross Validation . 3.7 Hyperopt Implementation . . . . . 3.7.1 Search Space . . . . . . . . 3.7.2 Trial Object . . . . . . . . . 3.7.3 Training Models . . . . . . 3.8 Machine Learning Models . . . . . 3.9 Data Visualization . . . . . . . . . 3.9.1 Results Visualization . . . . 3.9.2 Process Visualization . . . . 3.10 Calculating the final result . . . . .. . . . . . . . . . . . . . . first station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 33 35 36 37 37 37 37 38 38 40 41 41 41 42 42 43 43 45. 4 Results 4.1 All classifiers, One hot encoding . . . . . . . . . . . . . . . . 4.2 XGBoost, All encodings . . . . . . . . . . . . . . . . . . . . 4.3 Feature Encodings . . . . . . . . . . . . . . . . . . . . . . . 4.4 Model Per Year . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Importance of different features . . . . . . . . . . . . . . . . 4.6 Performance in correlation to the training data window size 4.7 Oebb Predictions and Actual Savings . . . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 46 46 47 50 51 52 53 56. 5 Discussion 5.1 Continuous training or one model per year 5.2 Sliding window size and adaptability . . . 5.3 Critical features . . . . . . . . . . . . . . . 5.4 Feature Encodings . . . . . . . . . . . . . 5.5 Hyperopt implementation . . . . . . . . . 5.6 Oebb Business Value . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. 58 58 58 59 60 60 60. . . . . . .. . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . .. 6 Future Work 61 6.1 Oebb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Internal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7 Conclusion. 63. A MongoDB collection descriptions 67 A.1 Process Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.2 Wheelsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.3 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69. 3.

(6) List of Figures 1.1. Wheel sets outside the factory in Knittelfeld . . . . . . . . . . . .. 10. 2.1 2.2 2.3 2.4 2.5 2.6. Simple illustration of a classic machine learning workflow. . . . . Illustration of a linear regression . . . . . . . . . . . . . . . . . . Illustration of k nearest neighbor . . . . . . . . . . . . . . . . . . Illustration of a decision tree . . . . . . . . . . . . . . . . . . . . Illustration of a fully connected neural network . . . . . . . . . . Illustration of how the sliding window algorithm can work. The green boxes represent the data in the window, The white boxes are outside the window. . . . . . . . . . . . . . . . . . . . . . . . Illustration of K-fold cross validation with four folds. The Blue boxes contain data for the validation set, while the green boxes contain data for the training set. Each row represents one training set for the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of Time Series Split with 4 + 1 folds. The Blue boxes contain data for the validation set, the green boxes contain data for the training set and the white boxes represent data which was not used in the run. Each row represents one training set for the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14 15 15 15 16. Number of wheel sets and scrap wheel sets per month in the data The distribution of error codes over the different stations . . . . Graph showing error codes which are mostly given to scrap wheel sets. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. . . . . . . . . . . . . . . . . . . . . Graph showing the 20 most common error codes and their scrap distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph showing the 20 most common Ausbau Zuf¨ uhrungs codes and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above is how many percentage of the wheel sets that were scrap printed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26 26. 2.7. 2.8. 3.1 3.2 3.3. 3.4 3.5. 4. 20. 21. 22. 27 27. 28.

(7) 3.6. 3.7. 3.8. 3.9. 3.10 3.11 3.12 3.13 3.14 3.15 4.1. 4.2 4.3. 4.4 4.5. Graph showing the 20 most common Befundung Zuf¨ uhrungs codes and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. . . . . . . . . . Graph showing the 20 most common bearing types and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. . . . . . . . . . . . Graph showing the 20 most common axle types and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. . . . . . . . . . . . . . . . . Graph showing the 20 most common work groups (Tauschgruppen) and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. . . . . Wheel set axle illustration depicting where wheels and brake discs are placed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Pearson correlation for all features over time. . . . . . . . . Pearson correlation over time for features with a mean smaller or larger than +- 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of the trained machine learning models . . . . . . . Visualization showing all paths occurring more than 100 times in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subset of the process visualization with extra information enabled Graph showing the results per month when hyperopt had 8 different classifiers to choose from. Which classifier was used is shown by the color. All data was encoded using one hot encoder. . . . . Graph showing the results per month for the XGBoost classifier when hyperopt had 7 different category encoders to choose from. Graph showing the 60 most important features for all XGBoost models. The number above each bar represents how many models the feature was part of. . . . . . . . . . . . . . . . . . . . . . . . Graph showing the feature importance over time for the 10 features with highest mean importance . . . . . . . . . . . . . . . . Score per encoder for two different timestamp, Training data is a moving window with a maximal size of two years. The XGBoost classifier is used for prediction and 500 pipelines is trained for each encoder. The prediction data is all data from the timestamp to 2020-01-01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 28. 29. 29. 30 30 31 32 43 44 45. 47 49. 49 50. 50.

(8) 4.6 4.7 4.8. 4.9. 4.10. 4.11. 4.12. 4.13. 4.14. 4.15. Graph showing the results per month for the XGBoost with one hot encoded category features . . . . . . . . . . . . . . . . . . . . Graph showing the results from three different hyperopt sessions with different feature space. . . . . . . . . . . . . . . . . . . . . . Show the prediction score dependent on which training data was used to train it. All predictions were made on the same prediction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Show the true and false positives predicted by the model dependent on which training data was used for training it. All predictions were made on the same prediction set. . . . . . . . . Shows the true precision dependent on which data was used to train the model. All predictions were made on the same prediction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Show the prediction score dependent on which training data was used to train it, with training data size limited to six months. All predictions were made on the same prediction set. . . . . . . . . Show the true and false positives predicted by the model dependent on which training data was used for training it, with training data size limited to six months. All predictions were made on the same prediction set. . . . . . . . . . . . . . . . . . . . . . . . . . Show the true precision dependent on which data was used to train the model, with training data size limited to six months. All predictions were made on the same prediction set. . . . . . . Graph comparing the score for the predictions dependent on the training data between unlimited training data and limited to six months . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Actual savings per year. The savings earned by Oebb’s predictions and the machine learning prediction have been subtracted.. 6. 51 52. 53. 54. 54. 55. 55. 56. 56 57.

(9) List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 3.7. Displaying the path distribution . . . . . . . . . . . . . . . Displaying the sub path distribution . . . . . . . . . . . . . Displaying the mapping between stations and stages . . . . Displaying the stage path distribution . . . . . . . . . . . . Displaying the stage sub path distribution . . . . . . . . . . Cost Model Presented by Oebb . . . . . . . . . . . . . . . . Which predictors where choosen by hyperopt and how many. 4.1. Statistics about the results, for hyperopt session with all classifiers and one hot encoding . . . . . . . . . . . . . . . . . . . . . . Statistics for the results from the hyperopt session with XGBoost and seven different category encoders . . . . . . . . . . . . . . . . Usage of different category encoders . . . . . . . . . . . . . . . . Statistics about the results . . . . . . . . . . . . . . . . . . . . . Statistics about the results, For three different hyperopt sessions with different feature sets . . . . . . . . . . . . . . . . . . . . . . Oebb Predictions Confusion Matrix . . . . . . . . . . . . . . . . .. 4.2 4.3 4.4 4.5 4.6. 7. . . . . . . . . . . . . . . . . . . times. 25 34 34 35 35 36 42. 47 48 48 51 52 57.

(10) Glossary CSV Abbreviation that stands for Comma Separated Values, it is a file format where the data in the file is separated by commas.. 23 data leakage the machine learning model get information it should not and will not have in production. An example of this is information which is gathered after the prediction is made, in that way it is information from the ”future”.. 21, 25, 36 ETL Abbreviation that stands for Extract , Transform, Load. A term which is used in machine learning for when data from one source is extracted to be transformed into a machine learning friendly format and then loaded into database.. 23, 62 false negative Is when in machine learning the model predict false, but the correct prediction is true, In our case when a wheel set is predicted not scrap, but it is scrap. 8, 9, 36, 38 false positive Is when in machine learning the model predict true, but the correct prediction is false, In our case when a wheel set is predicted scrap, but it is not scrap. 8, 9, 36, 38, 47, 51, 59, 61, 62 hyperopt session A hyperopt session is when the entire data set is trained and predicted with a set prediction interval and training window size. 6, 7, 46–48, 51, 52, 57, 59, 60 Oebb The Austrian federal railways.. 1, 3, 6, 7, 10, 11, 24, 27, 36, 45, 56–58, 60, 61, 63 precision Is calculated for the true and false classes separately and is the amount of the predictions that were correct . True precision is calculated by truepositive/(truepositive + f alsepositive) and false precision is calculated by truenegative/(truenegative + f alsenegative). 36, 47, 48, 51, 52. 8.

(11) recall Is calculated for the true and false prediction classes separately and is the amount of correct predictions found in that class. True recall is calculated by truepositive/(truepositive + f alsenegative) and false recall is calculated by truenegative/(truenegative + f alsepositive). 43, 47, 48, 51, 52 true negative Is when in machine learning the model predict false and the correct prediction is false, In our case when a wheel set is predicted not scrap and it is not scrap. 8, 9, 36, 38 true positive Is when in machine learning the model predict true and the correct prediction is true, In our case when a wheel set is predicted scrap and it is scrap. 8, 9, 36, 38, 47, 51, 59, 62. 9.

(12) Chapter 1. Introduction 1.1. Project Description. The aim of this thesis was to evaluate machine learning techniques in the context of industrial process optimization. The Austrian Center for Digital Production, CDP, is working with the Austrian Federal Railways, Oebb at their facilities in Knittelfeld, Austria, where they inspect and repair train wheel sets. This process takes place at different stations throughout the factory, and activities and measurements performed on the wheel sets are documented at each step. Each wheel set takes a path through the factory that evolves depending on the data collected at each station. Our goal is to predict as early as possible in the process what type of treatment a wheel set will need. In particular, we are interested in estimating the probability of having to discard the wheel set. The models will be trained using four years of historical data provided by the company. An image showing examples of what a wheel set can look like is seen below in figure 1.1. Figure 1.1: Wheel sets outside the factory in Knittelfeld. 10.

(13) 1.2. Aim. In a business-client relationship with Oebb, this project was performed at the Austrian Center for Digital Production (CDP) in order to use machine learning algorithms to accurately predict whether wheel sets from trains at a wheel overhaul facility should be scrapped or repaired. At the outset, Oebb provided CDP with historical data collected on previous wheel sets that had gone through this facility. This data served as the basis for supervised machine learning predicting whether wheel sets were scrap or not. This was done to determine whether predictions made from our machine learning algorithm could be used to save the company money in wheel maintenance costs. The goals for the project were as follows: • Can the machine learning implementation generated for this data set be used in production as a support tool for scrap prediction? • Can the implementation outperform the current human experience-based predictions? • Does continuous training of machine learning models result in better predictions than the traditional approach in which one model is trained and used for all subsequent predictions? • When using a sliding window for the training data, what is the optimal window size for good performance and adaptability? • Which data features are critical for accurate scrap prediction? • Can the model selection pipeline handle changes in data when new data is collected? How quickly can a machine learning pipeline adjust to these changes? • Given that a large part of the data is in string format, which categoryencoding methods yield the best performance?. 11.

(14) Chapter 2. Theory 2.1. Machine Learning. Machine learning is the field in which computers are taught to recognize patterns and use these patterns to categorize new data. [27, 26]. While machine learning is not a new field, its popularity has increased enormously in the last ten years. This is largely because of the great performance increases in computing during the same time [26]. The idea is to teach a computer to do a task which could be done by a human, but both faster and more accurately. Examples of such tasks include recognizing patterns in road signs which can be used for autonomous vehicles, or predicting the weather based on meteorology data or, as done in this project, deciding if a item in production is faulty and should be discarded. The most common way this is done is by feeding a machine learning algorithm data from which it can learn patterns [27, 26]. These patterns can be stored and later executed on new data. Feeding a machine learning algorithm data is what is called training the model; there are three major ways this can be done, as discussed below.. 2.1.1. Supervised Learning. Supervised learning is when the machine learning algorithm is fed data together with the correct answer for what should be predicted. The algorithm should find a pattern in the data that agrees with the given prediction answer. This pattern can then be applied to new data, for which the prediction is unknown [27]. Since the training data contains the correct predictions, a measure such as accuracy (how well the model did in its predictions) can be calculated [27]. Neural networks, decision trees and linear regression, among others, are commonly used supervised algorithms [27].. 12.

(15) 2.1.2. Unsupervised Learning. Unsupervised learning differs from supervised learning by not providing the correct answers for the predictions in the training data. Instead, the patterns are found in the data by sorting and grouping similar data points. Since no correct answers are provided, there is no simple or direct way to benchmark the performance of an unsupervised algorithm. Common unsupervised algorithms are the following: K-means clustering, neural networks and K nearest neighbours clustering.. 2.1.3. Reinforced Learning. Reinforcement learning is more similar to how humans learn. The machine receives data, on which it is then asked to perform an action; this action is rewarded if it was the desired action. This is analogous to training a dog with a food reward. When it follows the trainer’s command, it receives a treat. The same situation reappears several times and based on the memory from the previous situation, the machine learning algorithm learns which actions it should perform in which situation [28]. Common reinforced algorithms are Markovs decision processes and Q-learning.. 2.1.4. Machine Learning Applications. Different ways for doing machine learning exist to solve different kind of problems. The most common problem types, as found in Google’s course Introduction to Machine Learning Problem Framing are listed below [10]. • Classification: The algorithm tries to predict to which class a certain piece of data belongs. For example, in image recognition of animals, it may want to predict whether there is a cat or a dog in the image. It then classifies the data according to what animal is detected in each image. • Regression: The model tries to predict a number, such as the sales price of a house. The house price can be predicted based on features such as the number of rooms, total area, and surrounding neighbourhood. • Clustering: The goal of the algorithm is to group similar data points together, such as when grouping certain customers together in order to offer them products in which they might be interested as opposed to ones they would not buy. • Association Rule Learning: The model finds association patterns in the data, such as grouping together people who watched a particular movie, and have also watched another particular movie as opposed to a different one. • Structured Output: The model predicts a more complex object, rather than just a number. An example would be the structure of a molecule. 13.

(16) • Ranking: As the name suggests, the model ranks results by a scale or a status. A popular example of this is Google’s search engine.. 2.1.5. Machine Learning Work Flow. A typical machine learning workflow is illustrated in Figure 2.1. The provided data goes through a pre-processing step. This pre-processing is usually composed of several sub-tasks; two common sub-tasks are data cleaning and feature engineering, discussed below. After pre-processing, the data is split into a training set and a test set. The training set is used for training the machine learning model, while the test set is used for testing it. After training, the trained model makes predictions on the test set which are used for evaluating the accuracy of the model.. Figure 2.1: Simple illustration of a classic machine learning workflow.. 2.1.6. Machine Learning Algorithms. Existing machine learning algorithms are diverse, with new algorithms rapidly being implemented on a regular basis. Most of the algorithms can be grouped together by their function similarity, with many containing overlapping features such that they could be grouped into more than one class. Instead of focusing on the algorithms themselves, these general groups of algorithms will be presented. That some of these groups overlap with the list of applications presented earlier in chapter 2.1.4 is no coincidence, since many of these problems can be addressed with the discussed classes of algorithms.. 14.

(17) Regression algorithms A regression model tries to find a relationship between the input variables and the prediction labels, or correct predictions. This is done using a cost function in which the goal of the model is to minimize the cost. The cost is a way to measure how far from from the correct value the prediction was. Many different types of regression exist, but one of the simplest to underFigure 2.2: Illustration stand, linear regression, is depicted in figure 2.2. of a linear regression Instance-based algorithms Instance-based algorithms make predictions based on similarity to the data points in the training set. This is accomplished by putting all the data into an ndimensional space and calculating the distance from the prediction data to the training data. The training data surrounding the prediction data will determine what value to predict [14]. Different distance measures can be used such as euclidean distance. K Figure 2.3: Illustration nearest neighbour is a instance-based algorithm were of k nearest neighbor k is the finite number of the closest neighbours to take into consideration when making the predictions. This algorithm is illustrated in figure 2.3.. Decision tree algorithms A decision tree is, as its name suggests, an algorithm that creates a tree-like decision path. It starts at the root with a decision rule which divides the data into two or more groups of subsets of the data. This procedure is repeated n times until a prediction for each subgroup can be made [21]. The aim for the decision rules is to find one which has the cleanest split in relation to the prediction label. A simple decision tree is illustrated in figure 2.4, where the dark node is the Figure 2.4: Illustration of a decision tree root and the white nodes are leaves.. 15.

(18) Neural network algorithms Neural networks are intended to mimic the way the human brain works. A neural network is commonly depicted as a directed and weighted graph; this representation can be seen in figure 2.5, where the grey nodes are input nodes, the white ones are output nodes, and the light grey in between are the neuron nodes [11]. Between the nodes are edges. Each edge is given a weight, which is multiplied by the outFigure 2.5: Illustra- put values from the previous node to create the input tion of a fully connected value for the current node [11]. In each neuron node, neural network an activation function is called on the input value; the result from this activation function is the output value of the node. The weights between the nodes are adjusted iteratively until the network performs as desired.. Bayesian algorithms Bayesian algorithms apply Bayes’ theorem, shown in equation 2.1. The theorem assumes that there is independence among its predictors, which is seldom the case [13]. But even though this is not often the case, it can still perform well, especially on multi-class predictions. P (A|B) =. 2.1.7. P (B|A) · P (A) P (B). (2.1). XGBoost Classifier. The XGBoost classifier played a significant role in this project. XGBoost is an advanced tree that uses the concept of boosting to increase its prediction accuracy. XGBoost is not alone in using boosting, but what makes XGBoost stand out is not only how it builds its regression trees, but also how efficient and fast it is. The concept of boosting involves combining multiple weak models to create a strong one. This is done by training several trees in a series where the output predictions from each tree create a weight that becomes part of the input data for the next tree. For XGBoost, this weight is the residual, or the difference between the predicted and actual values, or in case of classification, the difference between the predicted probability of a class and the actual class. There are several different ways of doing boosting and XGBoost implements the following three: gradient, stochastic gradient, and regularized gradient boosting. All three implementations use gradient descent, an optimization technique in which small steps are taken in the direction of the gradient to find its maximum or minimum value. This enables higher accuracy of the subsequent predictions by scaling the result of each prediction tree in the boosting by a learning rate. 16.

(19) (the size of the small steps). The resulting prediction is then the sum of all predictions in the boosting where all of them have been scaled by the learning rate. The biggest benefit of using boosting is that it is usually much simpler to create many weak models than it is to create one strong one. As mentioned above, the way XGBoost builds its regression trees is one of its main advantages. XGBoost uses a unique loss function, see equation 2.2, and more specifically, the regularization part of it, see equation 2.3 [6, 2]. Lxgb =. N X. L(yi , F (xi )) +. M X. Ω(hm ). (2.2). m=1. i=1. 1 (2.3) Ω(hm ) = γT + λ||ω||2 2 In this equation, T represents the number of leaves, γ controls the amount of gain needed to create a new node, λ is the regularization penalty, and ω is the output scores of the leaves. The γ value helps to keep the trees simple by encouraging pruning of nodes in the tree; thus a higher gamma value leads to more leaves being pruned. The λ variable helps with preventing over-fitting of the data. Equation 2.2 can be solved using a second order Taylor approximation and can be simplified for the optimal output value to equation 2.4 [6]. In this equation, where g and h are the first and second derivatives of the loss function, respectively, the numerator is simply the sum of the residuals squared, while the denominator is the number of residuals plus λ, the regularization penalty. P 2 T 1 X ( i∈Ij gi ) P + γT (2.4) OptimalOutputV alue = − 2 j=1 i∈Ij hi + λ Equation 2.2 can in a similar manner as 2.4 be solved for what is called the similarity score, which describes the similarity of the values being grouped together in each node of the tree. The similarity score is shown in equation 2.5 P i∈Ij gi SimilarityScore = − P (2.5) i∈Ij hi + λ The similarity score is used for finding the optimal split value when splitting a node into two new ones. This is done by adding the similarity score for the two new nodes and subtracting the similarity score from their node, yielding the gain. The split which yields the highest gain is the best split. The equation for calculating the gain is displayed in 2.6, where it can also be seen that the variable γ regulates whether a new node is created or not, such that if the sum of the nodes’ similarity score is greater than γ, then a node is created [6]. " P # P P 1 i∈IL gi i∈IR gi i∈I gi P Gain = +P −P −γ 2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ. 17. (2.6).

(20) XGBoost implements two different algorithms to iterate over the possible splits and find the best one, these are [6]: • Basic exact greedy: Iterates over all possible splits and chooses the best one. This is a simple implementation which naturally is not very efficient. It is also not possible to use when all data does not fit in memory or when used on a distributed system. • Approximate algorithm: Chooses splitting points based on the statistics of the features in the data. This is done by looking at the percentiles of the feature distribution, dividing the continuous features into blocks, and then aggregating these. From the aggregated blocks, the approximate best split can be found. What makes the implementation of XGBoost so effective is that finding the optimal split values has been parallelized. In addition, it has a cache aware system that uses separate threads for pre-fetching data into the cache to decrease the amount of cache misses. It has also implemented solutions for when the data is too large to fit in memory, or which enable the system to run on a cluster efficiently.. 2.2. Category Encoders. In the field of machine learning, category encoding is a diverse topic and a lot of different models exist. The encoding method used somewhat depends on what type of category it is. Two classes of categories exist: nominal and ordinal. Nominal categories have no quantitative value, meaning that they cannot be ordered by any scale. Examples of such categories are car types [sedan, pickup, van, SUV, station wagon] or states [Wisconsin, New York, Texas, California] [23]. Ordinal categories, on the other hand, can be ordered by a scale, but the ”value” between the categories cannot be decided. Examples for ordinal categories are [cold, warm, hot] or [good, bad] [23]. In this project, one hot encoding and sum encoder were the most prominent category encoders and provided the best accuracy of the encoders used.. 2.2.1. One hot encoding. One hot encoding (or dummy encoding) is one of the most widely used encoding techniques when having to encode categorical features [23]. Here, one hot encoding was first used to encode the categorical features. This encoding simply takes all categorical values and changes them to binary encoding, meaning that a feature with n categorical values is turned into n features with a binary value, where a value of 1 means that the feature is present, while 0 indicates that it is not [1, 23].. 18.

(21) 2.2.2. Sum Encoder. The sum encoder encodes variables by first comparing the mean of the prediction values for a certain variable to the mean of the prediction values for all variables. The difference between these means is taken as the encoded value. This is done for each variable, thus generating an encoded value for each one [24]. This is best explained with an example. We have a column with different car brands. There are four different brands in the column: Volvo, BMW, Saab, and Ferrari. The value to predict is the sales price of the car. Now each brand of car is encoded by comparing the mean price for a particular brand to the mean price of the other brands. This is done in such a way that all the brands are crosscompared with each other. The encoded value is then the difference in mean price between the brands.. 2.3. Sliding Window. The sliding window algorithm is most commonly used for machine learning with time series or different kinds of image recognition. The fundamental idea is that only a subset of the data is considered for training a model. The subset that is used is in the ”window” [8]. As the window slides, the subset is updated. An illustration of this can be seen in figure 2.6, where the window has a size of six boxes and is sliding with steps that are the width of one box. The benefit of using this approach is that, for example, in object detection in images, each window can be predicted to be interesting or not, thereby eliminating large parts of the data that would not aid in recognizing the subject of the image, such as the background. Using a sliding window on the training data can be used to determine how robust vs. agile the model is. A larger window means more data and a more stable model, but it will have a harder time detecting small changes in the data than a model with a small window. However, a window that is too small can lead to overfitting and instability in the model.. 19.

(22) Figure 2.6: Illustration of how the sliding window algorithm can work. The green boxes represent the data in the window, The white boxes are outside the window.. 2.4. Cost Function. The role of a cost function (also known as a loss function) is to evaluate how well the machine learning model performed. This is done by comparing the predicted values to the actual values provided in a supervised machine learning approach. Different cost functions have different strategies for how to quantify the difference between the predicted and actual values. Two major groups exist in cost functions–bilateral and unilateral. It has been shown that bilateral cost functions are more suited for regression tasks, while unilateral are better for classification [20]. As a developer, it is also possible to implement your own cost function, which is useful for problems with unequal error costs [19].. 2.5. Cross Validation. Cross validation is a common technique in machine learning which is used primarily to prevent overfitting and hence to train a more generalized model [25]. This is done when training the model by splitting the training data into different training and validation sets. There are a number of different ways of how to split the data. The idea is that instead of training the model only once on the training and test set, the model is trained several times, but using smaller subsets of the data. This technique has both pros and cons. One advantage is that this method decreases overfitting because the model is tested on several different sets and has to perform well on all of them to achieve a high score. On the other hand, if the training data set is small to begin with, then it will be divided into even smaller parts and the overall model might be less general because of it. One of the most common forms of cross validation is k-fold. This divides the training data into k number of folds, and then it uses one fold at a. 20.

(23) time as a validation set and the rest of the folds as the training set [25]. This is illustrated in figure 2.7.. Figure 2.7: Illustration of K-fold cross validation with four folds. The Blue boxes contain data for the validation set, while the green boxes contain data for the training set. Each row represents one training set for the model. An alternative to the common k-fold is the time series split. As the name suggests, it is used for cross validation of data which has a time series attribute. An illustration of how it works is seen in figure 2.8. The biggest differences between these two methods of splitting is that the time series split does not predict all of the folds. This is because the time series split assumes a relationship between the previous folds, which means that the training data must occur before the validation data. If the same relationship exists when using k-fold, it would be data leakage. The time series split also leaves data out because of this reason.. 21.

(24) Figure 2.8: Illustration of Time Series Split with 4 + 1 folds. The Blue boxes contain data for the validation set, the green boxes contain data for the training set and the white boxes represent data which was not used in the run. Each row represents one training set for the model.. 2.6. Hyperopt. The machine learning implementation of this thesis will be centered around the python library hyperopt. hyperopt is a library for hyperparameter tuning and model selection, where several machine learning models with different preprocessing algorithms and unique hyperparameters can be defined in a search space [3]. This search space, together with a cost function, can then be given to hyperopts fmin function which will run a search over the search space and return the model with hyperparameters which gave the best score in the given cost function. Each parameter in the search space has to be limited to a distribution and can be done so by using several different distribution models [3]. Another common hyperparameter tuning library is gridsearch [15]. The biggest difference between gridsearch and hyperopt is that unlike gridsearch, which tries all possible combinations of hyperparameters in the search space and returns the best, hyperopt tries to avoid going through all of these combinations by converging the search towards the area in the search space where the best results are generated.. 22.

(25) Chapter 3. Method 3.1. Data Description. The data for this project was provided through unlabeled CSV files. With the help of domain experts, some have been labeled, but unfortunately not all. Of the received CSV files, four of them describe the process, that is, a wheel set’s path through the factory. This includes the stations a certain wheel set has visited, as well as what activities were performed at each station and a time stamp indicating at what time the activity was done. The rest of the CSV files were labeled with the name of the station from which they contained data. Only the stations which collect data on the wheels sets as they pass through the station have a corresponding CSV file.. 3.1.1. Pre-processing. The data from the labeled CSV files went through following process. First, mapping files are auto generated to help with mapping the data from each file into the right collection and the right format for final storage. Only a foundation for the mapping is generated, while the rest has to be manually filled out. The next step is to import all files into an SQL database for better query possibilities. Each CSV file will be imported in its raw form into a table in the SQL database. Up until now, all the data is intact, but now the ETL process starts. The data will be stored in three MongoDB collections that are discussed in 3.1.2. The previously mentioned mapping files dictate which value gets mapped into which Mongo collection. For each collection, an import script is executed, and all of them follow a similar pattern. Each table containing data for the collection is cleaned through several different functions. For example, these functions convert types of columns into the correct type, replace empty fields with default values (when such exist), and discard data which are invalid, such that they are missing types and values. When all tables have been cleaned, they are merged together and transformed into the right format and inserted into their corresponding collection. 23.

(26) 3.1.2. Data Storage. The storage of choice is a Mongo database. This database contains three collections, named the following: Process Instances, Wheelsets, Components. The schemas for these collections are found in appendix A. The primary key in all of the collections is the ”radsatznummer” (wheel set number). Process Instances The process instances collection contains information about the process. This includes information such as, which stations a wheel set has been to, what activities where done, and what measurements were taken. The process is what contains the bulk of information. Wheelsets This collection contains information which is specific to a wheel set; all of this information is known before the wheel set arrives to the factory. The three most important values in this class are as follows: ”wellentyp” (axle type), ”lagerbauart” (bearing) which are the components the wheel set is built from and ”tauschgruppe” (work group) which is dependent of the first two. Components The components collections hold more detailed information about the three values mentioned above, such as sub-components and their reference values and tolerance values. The axle for example has a tolerance value for what the minimum diameter of it is allowed to be.. 3.2. The Process. During a previous project, the CDP created a process model for the Oebb wheel overhaul facility. This process model was developed before the data used in this project was available and is thus deduced from the information of domain experts. A simplified analysis of the process was also done after receiving the data. Both the domain model and the data model describe the process as complicated. The process contains different paths, which consist of several stations. Thus, a path describes to which stations a wheel set has been and in which order it has been to them. Table 3.1 shows the distribution of the paths in the data. 13199 unique paths exist in total among the 83586 wheel sets.. 24.

(27) Table 3.1: Displaying the path distribution. Total. Occurrences. Number of paths. 1 2 3 - 100 101 - 999 >1000. 8856 1529 2689 118 7. 83586. 13199. The process of overhaul consists of many loops, meaning that a wheel set can visit the same station several times during the process. At each station, one or more activities can be executed, thereby generating data. Because of the complex nature of the data structure has several ways of filtering the data been explored. These will be further discussed in chapter 3.5.. 3.3. Data Exploration. The first step was to explore the data and gain domain knowledge. This was done through data exploration, by creating plots of the different features and look at their statistics. This was not only important to get better understanding of the data, but also to find potential data leakage. An example of such a data leakage is that a value in this data set was changed for every scrap wheel set even though it was not the label for scrap. There are, as mentioned in chapter 3.2 83586 wheel set entries in the data set. A subset of these, 9132, are recurring wheel sets, meaning that they have been overhauled more then once. Out of these 83586 data entries, 14594 (17.5%) are labeled as scrap. The distribution of wheel sets per month as well as scrap per month is shown in figure 3.1. It can be seen in these figures that the number of wheel sets overhauled per month decreases over time, while the number of scrap wheel sets increases over time.. 25.

(28) (a) Number of wheel sets per month. (b) Number of scrap wheel sets per month. Figure 3.1: Number of wheel sets and scrap wheel sets per month in the data. 3.3.1. Error Codes. An error code either describes a wheel set characteristic or a problem with the wheel set. There are 189 unique error codes in the data set. A wheel set gets on average 6.6 error codes during the process and can be given an error code at any station. However, wheel sets are more likely to receive an error code at some stations more than others. On average, 2.3 of the 6.6 error codes are given at the first station (station 110), where most error codes are given as seen in figure 3.2.. Figure 3.2: The distribution of error codes over the different stations Of the 189 possible error codes, less than 30% have been given to more than 1000 wheel sets. In figure 3.3 the number of times the 20 most common error codes were given are shown. The figure also shows how many of the wheel sets with the given code were in fact scrap.. 26.

(29) Figure 3.3: Graph showing error codes which are mostly given to scrap wheel sets. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. Out of the 189 error codes, only 3 codes designate a wheel set as scrap and these are used by Oebb in their own ”predictions.” These are ”RSAUS”, ”WDM zk” and ”KO AUS”. There are also other codes which seem to only be given to scrap wheel sets, as seen in figure 3.4, but these are, despite being associated with scrap wheels sets, not labels for scrap themselves.. Figure 3.4: Graph showing the 20 most common error codes and their scrap distribution. 3.3.2. Zuf¨ uhrungs Codes. The Zuf¨ uhrungs codes describe what work needs to be done on a wheel set. These codes come in two sets, the ”Ausbau” (Extension) and the ”Befundung” (Diagnosis) Zuf¨ uhrungs codes. The extension codes are given by the owner of the wheel set, while the diagnosis codes are given by the worker on the first station in the factory. Not all wheel sets receive a Zuf¨ uhrungs code. This can be observed in both figure 3.5 and 3.6 where the ”NaN” bar is by far the highest.. 27.

(30) Figure 3.5: Graph showing the 20 most common Ausbau Zuf¨ uhrungs codes and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above is how many percentage of the wheel sets that were scrap printed.. Figure 3.6: Graph showing the 20 most common Befundung Zuf¨ uhrungs codes and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed.. 3.3.3. Wheel set components. Each wheel set consists of two main components, the Wellen (axle) and the Lagerbauart (bearing); the combination of these two defines to which Tauschgruppe (work group) the wheel set belongs. 62 different bearing types exist in the data and the 20 most commonly used ones are displayed in figure 3.7. In the same figure, it can be seen that components from an older generation generally have a higher failure rate, such as bearings R87 and R2.. 28.

(31) Figure 3.7: Graph showing the 20 most common bearing types and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. Similar patterns can be observed in the axle types figure 3.8, where the amount of scrap for VRS and VRS-G exceeds 40% and 25%, respectively.. Figure 3.8: Graph showing the 20 most common axle types and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed. In addition, figure 3.9 shows that certain work groups (Tauschgruppen) are over-represented in the scrap class.. 29.

(32) Figure 3.9: Graph showing the 20 most common work groups (Tauschgruppen) and their scrap distribution. The blue part of the bar represents wheel sets that are not scrap and the red part wheel sets that are scrap. Above the bar is also the number of scrap wheel sets and the total number of wheel sets in the bar printed.. 3.3.4. Measurements. The data set contains more detailed information about the components, some of which being physical measurements. This measurement information comes from certain stations in the process where a specific measurement can be taken. Examples of such measurements are different diameters of an axle. An axle can have several places where other parts are mounted, such as wheels and brake discs; these places have a different diameter than the rest of the axle, as shown in figure 3.10 for reference. Most measurements also come with corresponding tolerance values which are set according to domain rules and are not to be exceeded. Unfortunately, the mapping between the measurement and its tolerance value provided in the data was unclear and for this reason, among others, was not used in the project.. Figure 3.10: Wheel set axle illustration depicting where wheels and brake discs are placed.. 3.3.5. Pearson Correlation. As part of the data exploration, certain statistical analyses were performed on the data to highlight important features within the data set. This was accomplished by running the data through a script which calculated these statistics 30.

(33) about the features in the set. One such analysis is a Pearson correlation. The script calculates the correlation between the feature and the label for scrap. This gives broad guidance into which features might be more important for predicting a wheel set as scrap. In figure 3.11 the Pearson correlation for each subset can be seen. It is worth noting that the size of the subset is not constant. This is due to the fact that the script uses an expanding window for the first 12 months and a sliding window after that. Thus, the subsets for the first 12 months are smaller than the rest of the subsets. Despite this, we have a wide band of features which, according to the Pearson correlation, are not correlated with being scrap. These are all the features with a Pearson correlation value near 0.0.. Figure 3.11: The Pearson correlation for all features over time. These features have been filtered out in figure 3.12, which shows only features with a high Pearson correlation value. However, we must note that most of the features that appear correlated with being scrap decrease in importance over time.. 31.

(34) Figure 3.12: Pearson correlation over time for features with a mean smaller or larger than +- 0.1 .. 3.4. Script Implementation Overview. An outline of the execution steps in the scrap prediction script is seen below. The following chapters will follow the same structure to give the reader some reference as to where in the process we are. The steps below are executed once for every set of predictions that are made. In our case, the prediction set is one month, thus, the steps are executed 47 times for the 48 months to be predicted since no predictions can be made for the first month because no training data exists for this month. • Query Data • Filter Data, according to the different filter methods described above • Clean Data – Clean duplicate columns – Clean columns that have a high correlation with each other (carry the same data) – Clean columns with low variance (most rows have the same value) • Encode categorical features, if this is not done in the pipeline with the encoder as hyperparameter. • Create new features through feature engineering • Split data into training, verification, and prediction sets 32.

(35) • Create cross validation object for the training set • Use Hyperopt to train models – Create search space – Retrain old models with new data, update their score and probability limit – Train new models based on historical models – Train new models, if the best one is better than the historical models, add this to the historical set • Predict data in verification set • Score verification predictions • If a score lower than 0 is received, tune the probability limit based on the verification set • Predict data in prediction set • Score prediction set predictions • Store result into database. 3.5. Data Filtration. As mentioned in 3.2, the data structure from this process is so complex that it might necessitate training several models on different subsets of the data. The three different approaches used to divide the data into these subsets will be discussed in the following chapters.. 3.5.1. Approach 1: Paths. In this approach, the models were divided by which path the wheel set had taken. This was done by first extracting all unique paths in the data set, which have been summarized in table 3.1. After this, all unique sub-paths were extracted. Examples of sub-paths for the path [421, 110, 130 ,520] would be [421], [421, 110], [421, 110, 130], [421, 110, 130, 520]. The distribution of these paths have been summarized in table 3.2 and show an enormous increase in unique paths when dividing them into sub-paths.. 33.

(36) Table 3.2: Displaying the sub path distribution. Total. Occurrences. Number of paths. 1 2 3 - 100 101 - 999 >1000. 54389 9311 18124 1065 153. 1304287. 83042. If we were to train models only for the paths that had 1000 wheel sets or more, this would translate into training 153 different models. To avoid this, grouping was done, such that stations that belonged together in the process model, mentioned in chapter 3.2, were grouped into a stage. The mapping between stations and stages is displayed in table 3.3. Table 3.3: Displaying the mapping between stations and stages. Stage. Stations. A B C D E F H I K J. 421, 110 130 140, 150 410 510, 520, 535, 530, 320, 550 490, 480, 430, 595, 170, 506, 430 190 320, 340, 630 680 640. By doing this and transforming the paths into stage paths, new distributions of these paths could be generated. These are summarized in table 3.4 and 3.5 and show that if all stage sub-paths with over 1000 wheel sets would be trained, then only 30 models would have to be trained.. 34.

(37) Table 3.4: Displaying the stage path distribution. Total. Occurrences. Number of paths. 1 2 3 - 100 101 - 999 >1000. 14 6 33 18 17. 83586. 88. Table 3.5: Displaying the stage sub path distribution. Total. Occurrences. Number of paths. 1 2 3 - 100 101 - 999 >1000. 17 1 44 20 30. 385072. 112. Of these 30 paths, models only for the paths which had more than 10% of their wheel sets classified as scrap were trained. Thus, for this data set, 9 models were trained.. 3.5.2. Approach 2: Stations. The second approach filtered the data by training one model per station. There are a total of 40 stations in the factory, but some of them are duplicates (carrying out the same task). Thus, after grouping these duplicates, 29 unique stations exist. The stations were filtered to train only the ones where more than 5% of the wheel sets were classified as scrap. For this data set, 13 models were therefore trained. Filtration by station means that the data set is scanned for all wheel sets that have visited a particular station. For each wheel set, its data. 35.

(38) is filtered to contain only data from the station it has visited before and the current station at which it is located. This is done to avoid training the model on data which would be collected from the ”future” (later stations), since this would be considered data leakage.. 3.5.3. Approach 3: Predict only at first station. The third approach was decided after showing the results from the approach discussed in chapter 3.5.2 to the customer. In this meeting, it was discussed that even if a wheel set were to be predicted as scrap in the middle of the process, it would be difficult to remove the wheel set from the process, since the process follows a relatively set path and wheel sets cannot simply be taken out of it. It would therefore be more beneficial for the customer if the wheel sets could be predicted as scrap or not scrap before even entering the factory. This is merely a simplification of the approach in 3.5.2 since it uses the same implementation, but instead of training models for every station, only models for the first station are trained. The aim for this approach was to increase the performance of the first station’s model. To do this, a very general economic model, which was used when training the new models, was supplied by Oebb. The model, which is visualized in Table 3.6, simply says that every false positive will cost them 2000 e and every true positive will save them 250 e. The true negatives as well as the false negatives will neither cost nor save them any money, because this is the current state in the factory. Table 3.6: Cost Model Presented by Oebb. Ok Scrap. Predicted Ok. Predicted Scrap. 0e 0e. -2000 e 250 e. This model heavily punishes false positives and requires a precision value for the predicted scrap of more than 88% to break even. Threshold tuning was used to achieve this ratio. Threshold tuning is done when the probability threshold is tuned for a which a value is predicted, in this case, scrap or not scrap. The standard threshold is 0.5 or greater equals scrap, while less than 0.5 equals not scrap. By adjusting this threshold, the aggressivity of the model can be adjusted. A higher threshold leads to fewer scrap predictions in general, but it punishes the false positives more than the true positives, based on the fact that the true positives have a higher probability than the false positives. The tuning was done in two steps. The first step is done on the result from the cross validation. A limit between 0 and 1 is iterated on the predicted probabilities limit with a 0.01 step size. For each iteration, the prediction is 36.

(39) sent to the custom cost function described in chapter 3.6.4 with the weights from the economic model mentioned above. The limit that gives the best result (highest amount of savings) is saved with the model. The second step is when the trained model predicts the validation set. If the result from the validation set is negative (money will be lost), then the same iterative process is executed on the validation set and the best limit for the validation set is used for the actual predictions. If the predictions are positive, then the limit from the first step is used for the predictions.. 3.6. Pre-processing methods. Pre-processing was performed before training the model as the raw data usually contains empty fields, bad values, and strings, etc. The raw features might also not be enough for the model to make good predictions, thus, feature engineering (discussed below) can be done to improve this.. 3.6.1. Data Cleaning. Data cleaning is done to remove any redundant features, such as features that contain the same or almost the same information. In our case, this was done in three steps. The first step is to clean duplicate columns, those in which the value for every wheel set is exactly the same, meaning that they hold exactly the same information and having both will not increase the performance of the model. The second step is to clean all columns that have a high correlation to another column. A threshold for how high the correlation needs to be for one of them to be removed was set to 0.8 in our case. The third step is to remove columns with a low variance, such that almost all values in the column are the same. A threshold was applied here as well; all columns with a variance lower than 0.01 were removed.. 3.6.2. Category Encodings. Large parts of the data consist of categorical features which must be encoded due to the fact that machine learning algorithms can only handle numerical values. Most of the categorical features in the provided data are of the nominal kind. The encodings which have been used in this project are: One hot encoding [24, 5, 12], sum encoding [24], hashing encoding [5, 12], ordinal encoding [12], target encoding [5], weight of evidence encoding [9], leave one out encoding [12].. 3.6.3. Feature Engineering. Feature engineering is when new features are created from already existing ones. This is very common in machine learning and can lead to big improvements in the prediction quality. The goal is to engineer features which help the model to distinguish between the prediction classes more easily. For this thesis, several features were engineered, these are listed below. 37.

(40) • Duration: how long a wheel set spend in a station or activity • Number of error codes • Number of Zuf¨ uhrungs codes • Error code Inhalstaltungstuffe: Derive a Inhalstaltungstuffe from the error codes • Zuf¨ uhrungs Inhalstaltungstuffe: Derive a Inhalstaltungstuffe from the Zuf¨ uhrungs codes • Work type Inhalstaltungstuffe: Derive a Inhalstaltungstuffe from the work type In addition, statistical values such as the mean, maximum, minimum, and standard deviation of the data have also been calculated.. 3.6.4. Custom Cost Function. Since wheel sets are very expensive parts, it is of the customer’s highest interest that none of the wheel sets are predicted as false positives, i.e., that no good wheel sets are classified as scrap. This trumps even having better performance of the overall prediction accuracy. Because of this, two custom cost functions have been implemented, which are discussed below. Weighted Accuracy As the name suggests, this is a weighted version of the sklearns accuracy score 1 . It is calculated based on the values given by the model’s confusion matrix. The function can take an argument in form of a list of weights. This is where the cost model discussed in chapter 3.5.3 is entered. The values in the weight naturally alters which values are punished and preferred. The weights are multiplied by the values from the confusion matrix as shown in equation 3.1.. . true negatives. 3.6.5. false negatives. false positives.  weighttruenegative  weightf alsenegative   true positives ×   weightf alsepositive  (3.1) weighttruepositive. Custom Cross Validation. The hyperopt implementation is paired with cross validation as mentioned in chapter 3.7. This cross validation takes an argument that decides how many folds or describes which folds it should use. A fold is a subset of the data. 1 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy. 38. score.html.

(41) If a number is given, the default StratifiedKFold 2 function from sklearn is used, but other fold/split functions can be specified. Because of the special data structure where the process data has a time series element (time stamps from stations and activities), it is important to not test wheel sets on a model containing data that was collected after the wheel set came through the factory. This is because the model could then have information about the future. There is an implementation from sklearn called TimeSeriesSplit 3 which does just this, however, this implementation would not work at first. Because of this, a custom time series split was implemented, which will be discussed below. After determining that the probability of a wheel set’s being scrap should be predicted rather than its being scrap or not in order to enable probability limit tuning as mentioned in chapter 3.5.3, it was not possible to use sklearn’s cross validate function anymore. Instead, sklearns cross val predict 4 function was used, however, this function cannot handle the custom time series split nor sklearns TimeSeriesSplit 5 function. This is because it demands that all data points in the set have to receive a prediction. A simple custom cross validation function was implemented because of this and used whenever probability tuning and a cross validation object with a time-based split was used together. Custom Cross Validation The custom cross validation function shown below works as such: the different folds in the cv object are extracted from the training and test sets. These subsets are used to train and predict the data for this fold. The predictions together with their correct values are stored in a list for later use. This is done for all folds in the cv object. When the loop is completed, the probability limit is tuned using all predictions. After the best probability limit is found, this is used to convert the probability predictions stored in the predictions list into Boolean values. This is done for all the folds and their score is added to a score list. All interesting information is stored into a dictionary which is returned.. def custom_cross_validation(self, pipeline, X, y, cv_object, cost_function): pipeline.fit(X, y) return_dict = {} probability_predictions = [] true_values = [] 2 https://scikit-learn.org/stable/modules/generated/sklearn.model. selection.StratifiedKFold.html selection.TimeSeriesSplit.html 4 https://scikit-learn.org/stable/modules/generated/sklearn.model selection.cross val predict.html 5 https://scikit-learn.org/stable/modules/generated/sklearn.model selection.TimeSeriesSplit.html 3 https://scikit-learn.org/stable/modules/generated/sklearn.model. 39.

(42) score_list = [] probability_list = [] fold_probabilities = [] fold_true_values = [] fold_max_scores = [] for train_indexes, test_indexes in cv_object: train_X = X.iloc[train_indexes, :] test_X = X.iloc[test_indexes, :] train_y = y.iloc[train_indexes] test_y = y.iloc[test_indexes] pipeline.fit(train_X, train_y) temp_predictions = pipeline.predict_proba(test_X) temp_probabilities = temp_predictions[:, 1] probability_predictions += list(temp_predictions[:, 1]) true_values += list(test_y) fold_probabilities.append(temp_predictions[:, 1]) fold_true_values.append(test_y). score, best_probability = self.tune_probability_limit(probability_predictions, true_values, cost_f for probabilities, true_value in zip(fold_probabilities, fold_true_values): predictions = probabilities > best_probability score_list.append(cost_function(true_value, predictions)) fold_max_scores.append(cost_function(true_value, true_value)) return_dict['test_score'] = score_list return_dict['test_probability'] = probability_list return_dict['max_score'] = fold_max_scores return_dict['total_score'] = score return_dict['best_probability'] = best_probability return_dict['probability_predictions'] = probability_predictions return_dict['true_values'] = true_values return return_dict. 3.7. Hyperopt Implementation. As mentioned in chapter 2.6, the python library hyperopt is a central part of the machine learning implementation in this thesis. Further information about it and its usage is therefore explained below.. 40.

(43) 3.7.1. Search Space. The search space in our implementation is filled with pipelines from the sklearn library 6 . Each of these pipelines describes the execution space for a specific model. An example structure for a pipeline can be a first step with a feature scaling algorithm, followed by a dimensionality reduction algorithm, to end up at the machine learning model. Each step in the pipeline can have several hyperparameters which need to be optimized.. 3.7.2. Trial Object. One of the most important parts of hyperopt is the Trial object [3]. The trial object is where all pipelines from the search space that already have been trained are stored. The trial object enables continued searching through an already partly explored search space. This is especially useful if the implementation crashes, since the history will be saved.. 3.7.3. Training Models. Since the hyperopt, as discussed in chapter 2.6, is not meant for continuous training in an environment where the data changes, some workarounds had to be implemented. These implementations will be explained below. Retraining Models Because the training data in our script changes when the prediction model iterates over the data set, the model which got a score for the last prediction will not have the same score now. Therefore, the score for all models must be updated to reflect the score for the current training data. If this were not done, this would mean that a model that was very good for one subset of the training data would be used for several months afterwards because no new models can get a better score. However, this model might just have a good result because the training data was easier to predict than for the later models. Because of time constraints, retraining all models is not within the current scope. Thus, to limit the time spent while still retraining the best models, an implementation in which the n best models were retrained was implemented, with n being a user-defined number. Training Models with Trial Object The next step after updating the score of old models stored in the trial object is to add more models to the trial. This is how hyperopt was intended to be used and means that it simply continues the search where it stopped the first time based on trials stored in the object. Here, compromises have been made to save time such that only the hundred best runs are stored in the trial object. This is partly because of limitations in MongoDB, namely, that a document’s maximum size can be 16 Mb. Therefore, we are not able to store thousands of trials in the MongoDB. This would be possible, however, if the trial object is stored in a file. 6 https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html. 41.

(44) Training Models without Trial Object A hyperopt object without a trial object is also trained for each data subset. This ensures that the hyperopt object with its corresponding trial object does not get stuck in a local maximum. Training a hyperopt object from scratch means that it will, without any prior knowledge, try to find the best parameter for this data set. If it finds a model and parameters which score better than the one in the hyperopt object with a trial object, then this model will be added to that trial object.. 3.8. Machine Learning Models. The hyperopt space is filled with pipelines, as explained in chapter 2.6. Each of these pipelines have a machine learning model in charge of making predictions. The hyperopt space for this project was filled with the following predictors: RandomForestClassifier [17], DecisionTreeClassifier [17], XGBoost [6], LogisticRegressionCV [22], MLPClassifier [18], GaussianNB [4], MultinominalNB [16] and KNeighborsClassifier [7] After training and predicting the entire data set, only three of these predictors were selected by hyperopt. These predictors and how many times they were selected are displayed in table 3.7. Table 3.7: Which predictors where choosen by hyperopt and how many times. Predictor. Used Number Of Times. XGBoost MLPClassifier RandomForestClassifier. 35 6 5. These results coupled with an interest in increasing performance led to narrowing the search space. Initially, only the predictors in the table were used (XGBoost, MLPClassifier, and RandomForestClassifier). Later, this was restricted to only using XGBoost, largely because the MLPClassifier is a neural network that takes an especially long time to train. RandomForestClassifier was similar to XGBoost, yet was only selected 5 times by hyeropt and was also discarded.. 3.9. Data Visualization. A visualization tool was implemented to display and explore the trained models. The tool is web-based and was developed using python flask for the back end. 42.

(45) and JavaScript D3 for visualisation. This generates an interesting visualization with many interactive features.. 3.9.1. Results Visualization. The visualization of the prediction results consists of two bar plots and one confusion matrix. The bar plot on top shows all of the latest models trained for each station. The height of the bar and the number above it represent the model’s score. The second bar plot shows the historical data for the station that was selected in the first bar plot. Currently, it shows all models ever trained for this model, but this can be limited to a set number. The confusion matrix simply shows the confusion matrix for the clicked bar in the second bar plot. Hovering over the different classes in the confusion matrix will display the recall for that class. A print screen showing an intermediate implementation of this visualization can be seen in figure 3.13. Figure 3.13: Visualization of the trained machine learning models. 3.9.2. Process Visualization. The second visualizations shows the different paths and their distribution of scrap from the first station in the process. This was used when exploring the paths approach described in chapter 3.5.1. Using this visualization, it was possible to follow the scrap through the process and see which paths were most likely to contain bad wheel sets. It also served as a way to grasp the complexity of the process since numbers, although large, can be somewhat abstract. The visualisation, shown in figure 3.14, consists of nodes and edges combining them. The nodes are interactive and clicking them will fold/unfold all nodes after the one being clicked. The web page also contains 2 filter sliders which fold/unfold nodes depending on their values. The sliders control values for the number of occurrences on the path and the number of scrap wheel sets on the path.. 43.

No results found