Machine Learning Methods for Fault Classification

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2014,

Machine Learning Methods for Fault Classification

MARKUS FELLDIN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

(3)

(4)

Abstract

This project, conducted at Ericsson AB, investigates the feasibility of implementing machine learning techniques in order to classify dump files for more efficient trouble report routing. The project focuses on supervised machine learning methods and in particular Bayesian statistics. It shows that a program utilizing Bayesian methods can achieve well above random prediction accuracy. It is therefore con- cluded that machine learning methods may indeed become a viable alternative to human classification of trouble reports in the near future.

(5)

Referat

Maskininlärningsmetoder för felklassificering

Detta examensarbete, utfört på Ericsson AB, ämnar att undersöka huruvida maskininlärningstekniker kan använ- das för att klassificera dumpfiler för mer effektiv problemi- dentifiering. Projektet fokuserar på övervakad inlärning och då speciellt Bayesiansk klassificering. Arbetet visar att ett program som utnyttjar Bayesiansk klassificering kan upp- nå en noggrannhet väl över slumpen. Arbetet indikerar att maskininlärningstekniker mycket väl kan komma att bli an- vändbara alternativ till mänsklig klassificering av dumpfiler i en nära framtid.

(6)

Preface

I would like to extend my gratitude to my superb supervisors at Ericsson AB; Elina Meier, Per-Olof Gatter and Johnny Carlbaum. Leif Jonsson, the resident machine- learning guru, also deserves a big thanks for paving the way with his research and providing easier access to the data required for my research. I would also like to thank the remaining members of my team at Ericsson AB for their help and support throughout the duration of this project; Fredrik Tengblad, Tomas Jonek, Peter Kerwein, Michael Hedström och Stefan Sundin. Finally I would like to thank my supervisor at KTH, Olov Engwall, my examinator, Olle Bälter, and my guidance team members Jens Arvidsson, Olle Hassel, Anton Lindström and David Nilsson.

(9)

(10)

Glossery

ASIC Action-specific integrated circuit Bugs Software errors

CDA Crash dump analysis DSP Digital signal processor EM Expectation maximization OS Operating system

MHO An Ericsson design unit ML Machine learning

RBS Radio base station SVM Support vector machine TR Trouble report

P(A) The probability of an event A occurring

P(A|B) The probability of an event A occurring if the event B has occurred.

(11)

(12)

Chapter 1

Introduction

This chapter serves as an introduction to the goals and intents of this degree project.

It also provides some background as to why this research is being conducted, and elucidates the conditions of the research conducted.

1.1 Introduction

As one of the world’s largest telecommunication companies, Ericsson AB, among other things, develops, sells and maintains a large amount of base stations across the globe. When faults occur within one or more of these, a large amount of text must be analyzed in order to determine the cause. This process is currently performed manually, which can be both slow and imprecise, and must be performed for every new trouble report. Part of Ericsson’s vision is that there will be 50 billion devices, each connected to a global network, by the year 2020. The sheer amount of data that could potentially be generated by these devices means that manual error identification may very well become a severe bottle neck in the very near future. If an automated initial classification could be performed based on the text files, the amount of manual labor required per trouble report could be greatly reduced. This in turn could facilitate both quicker debugging for developers and faster customer support for clients.

1.2 Background

In today’s reality the capacity to store and produce data is vastly in excess of what the human mind is able to decipher quickly. There has been a lot of research in the fields of AI, machine learning and data-mining. This research is more interesting now than ever before as computational power has reached a level where such methods may finally be applicable in an increasing amount of real world scenarios. As stated in the background section, Ericsson, among others, currently rely heavily on non-automated error identification. A process that becomes more and more costly

(13)

CHAPTER 1. INTRODUCTION as the amount of data that needs to be analyzed increases. Ericsson has conducted some research in the field, but have not yet reached an applicable solution.

There are several concurrent research project within the organization that are looking at different methods of incorporating some degree of atomization within the trouble report (TR) routing process. Each of these projects differ in terms of data-sources or methodologies. The, to date, most successful method has been utilizing human TR descriptions together with a compound machine learning approach. Other approaches are, for example, looking at changes in the codebase between errors.

A successful project could lead to improved trouble report routing within Er- icsson, and may be incorporated as a part of a more complex TR routing tool.

A failure to implement algorithms may also help illustrate various difficulties with log file analysis as a basis for classification that may serve as a reference to future attempts. Furthermore, automatic error identification from log files could be applicable to almost any system that has crash persistent logging there will likely be external interest in the project as well.

1.3 Thesis Objective

The primary objective of this project is to test the viability of automatically routing TRs using machine-learning techniques based on the data generated by fatal errors in Ericsson’s radio base stations. The viability of the methods found by this project will be determined by their performance in relation to random classification. A method is therefore deemed viable if it can be shown that it performs better than what would be statistically probable for a random classifier. The approach utilized, if deemed viable, will serve as a baseline for future development with the intent of constructing a system that is able to perform at human levels of TR classification.

Internal empirical data has shown that humans perform TR classification at a 77%

accuracy.

1.4 Current Workflow

At the moment there is no easy way to classify software errors based on causality.

Instead errors must be dealt with without prior knowledge pertaining to their causes.

Initial classification can only be achieved through the intuition of the individual responsible for handling or reporting it. At the moment errors are most often handled in one of two ways:

1. The error is noticed during testing/debugging of code internally by the developers. In this case the developers will often be tasked with debugging their own code with assistance from colleagues. This scenario relies heavily on the experience and intuition of the developers involved.

6

(14)

1.5. CONSTRAINTS

2. The error occurs post-delivery and is reported to a customer service technician.

In this case the technician files a TR based on their own experience and the information supplied by the customer. The TR is then assigned to a team of developers who have to debug the product based on the information given in the TR.

Both of these process could see great reductions in complexity if some degree of error pre-classification can be performed.

1.5 Constraints

This project will only test the viability of supervised machine learning techniques on static log files from Ericsson’s business unit networks (BNET) that have been generated by Ericsson’s radio base stations (RBS) and processed by Ericsson’s Crash Dump Analysis (CDA) tool. Therefore the format and contents of the data cannot be controlled or changed within the scope of this project. Furthermore, due to the complexity of the system responsible for generating the dump files, it is very difficult to recreate a statistically representative spread of issues artificially. Therefore no new data can be artificially generated for the purpose of this project.

1.6 External Factors

Insight into the contents and meaning of the dump files will also be limited by the amount of documentation and knowledge that is available within Ericsson.

1.7 Intended Audience

In order to gain insight from this report one should have some degree of under- standing of algorithm design as well as working knowledge of basic calculus and statistics. The intended readers are primarily Ericsson employees working with TR routing and software developers working with similar problems. The project may also be of interest to researchers working with feature extraction and data aggrega- tion.

1.8 Ethical Considerations

This project will, in all likelihood, not cause any ethical dilemmas, as it is a project delving into theoretical problems. However, if the project is successful, it could potentially reduce the amount of human labor required to perform TR routing in the future. This in turn could lead to a reduction in the amount of support staff hired by Ericsson. However, this is not a likely outcome as TR routing is only one of their responsibilities. Furthermore, the purpose of this project is merely to result

(15)

CHAPTER 1. INTRODUCTION in a tool used as an aid to increase the throughput of the TR routing process, not to replace the process entirely.

A higher throughput in the TR routing process may however increase efficiency, which may, in turn, result in shorter down times. Shorter down times would improve quality of service and utility of Ericsson’s radio base stations. Shorter down times and higher quality of service of cellular networks would have positive socioeconomic impacts.

1.9 Choice of Methodology

The approach should be divided into different phases. First and foremost the dump files should be deconstructed and the current method of identification inspected.

Then available research, both internal and external, should be studied for possible approaches. Additionally, test cases must either be found or constructed. After this, a baseline should be constructed or found as a source for performance comparison.

A program comprised of appropriate methods should then be constructed based on speed and accuracy. This program should be improved upon, based on performance in relation to the baseline and previous attempts.

1.10 Evaluation

The accuracy should be evaluated based on comparisons with both a random algorithm and the current manual system. The current accuracy of manual classification of trouble reports is approximately 77%. Approaching this accuracy would be considered a great success, but is not a requirement. Instead the focus lies on demonstrating that classification is possible through machine learning algorithms, which would be evident through a statistically significant improvement over random classification. The accuracy of both random classification and the prototype classification programs will be based on results of runs on a validation set with known answers.

1.11 Related Research

Ericsson has undertaken previous research into implementing machine learning algorithms in order to classify trouble reports. However, these attempts have focused on different stages of the trouble reporting process and no previous attempts have been made to classify dump files without any manual pre-processing. These attempts have had various degrees of success but have in general been deemed too error prone to yield reliable results. Most attempts have been made at sorting the reports filed by customer service technicians based on keywords. The usage of human reports introduces an additional level of entropy as the customer service representatives’ individual anachronisms can make general classification more difficult.

8

(16)

1.12. IMPLICATIONS OF RELATED RESEARCH

Another thesis worker at Ericsson, Weixi Li, recently explored utilizing machine learning techniques for automatic log file analysis [11]. Whilst her findings are not entirely applicable to this research they are very relevant as the data comes from a related source with similar content. Her research dealt with automatically differ- entiating between abnormal and normal logs. She was able to show that machine learning techniques yielded a better than random prediction percentage.

Outside of Ericsson there has been an increasing interest in utilizing machine learning algorithms for software debugging, noticeable through an increased amount of publications on the subject as of late. Research by Alice X. Zheng[3] demonstrated that algorithms could be used to cluster program data in order to identify underlying bugs. The results of Zheng’s research [3] illustrated a key problem with classifying software errors; there may be more than one underlying cause for any one failure which often results in an ambiguous classification of that particular data point.

The research paper titled “How Bayes Debug” by Chao Liu, Zeng Lian, and Jiawei Han looked at how Bayesian classifiers could facilitate software debugging [9]. They found that their particular implementation of a Bayesian classifier was indeed able to make relatively accurate predications based on code input, i.e. given a section of code the algorithm returns either true, (yes there is a bug), or false, (no there is not a bug), for that particular section of code. Whilst not entirely comparable to the intended research subject of this paper it is still a promising result.

As for text file analysis in general the term ‘Data-Mining’ is nearly part of everyday vocabulary nowadays in large due to the amount of research conducted in the field. Understandably Google is among the pioneers in this research and papers like “Experience Mining Google’s Production Console Logs” [10] show that machine learning techniques can be successfully applied even on console log data.

1.12 Implications of Related Research

Previous research, like that by Chao Liu, Zeng Lian, and Jiawei Han [9], indicate that the chosen method has been successfully applied to problems of a similar nature with positive results. Whilst not all of their findings were applicable due to slightly different data and goals, many of the key concepts remain the same. Research, like that by Weixi Li [11], shows that patterns can be found in data that is closely related to that used for the purpose of this project. Her paper also underlies some of the problems that she faced during her research that are similar to those faced by this project during the feature extraction phase.

(17)

(18)

Chapter 2

Machine Learning

This chapter provides theoretical background for the remainder of the report and serves as a basis for the concepts utilized throughout this project.

2.1 Machine Learning

Machine learning is, as the name suggests, a term used to describe the field associated with learning algorithms. Learning in and of itself is a broad term that is hard to define. In the field it is normally defined as the ability of a particular algorithm or system to acquire knowledge or skill through analysis of related data . A machine learning system will typically consist of several distinct phases; a training phase, a validation phase and actual operation. During the training phase the system is given test data with known associations so that it can identify patterns in order to correctly make predictions within that set. The validation phase is meant to grade the performance of the system using related but disjoint data, also with previously known answers. Finally, if the program is deemed to be up to specifications, it can be used on real data with unknown answers.

The idea of machine learning originates in trying to make computers able to make predictions based on what may seem like random data. This is something that humans are very good at whilst computers remain rather inept. One of the reasons for this is that we have a large library of preexisting knowledge. Examples of this include speech and handwriting recognition; within these fields humans can make predictions with a high degree of accuracy based on both previous knowledge and the ability to collate scattered data quickly. The difficulty lies in describing how this process actually works for humans; if the process cannot be broken down it is hard to represent it using a conventional algorithm.

2.2 Types of Learning

Machine learning is a diverse field in computer science and algorithms can be divided into several different groups. First they can be divided based on the type of

(19)

CHAPTER 2. MACHINE LEARNING problem they are designed to solve. Some of the most common problem types are;

classification, clustering, regression, and anomaly detection [1]. Then they can be categorized based on their learning methodology. Some of the more common learning methods are; supervised learning, unsupervised learning, reinforcement learning, and association rule learning [1].

The primary focus of this project will be researching and implementing supervised classification algorithms as the degree of success of an algorithm depends heavily on selecting the right approach for the data at hand. In this particular case the goal is to classify system dumps into different types of errors based on their content.

2.3 Perceptrons and Neural Networks

The perceptron algorithm is one of the very first machine learning algorithms [1].

It aims to roughly emulate the way neurons work in a biological entity in order to linearly classify an input vector. In much the same way as neurons in our own brain work, the perceptron algorithm takes an input vector, let us call it X, multiplies this vector by a weight vector, let us call it W, and if this value exceeds the threshold value a signal is sent. The size of the input vector depends on the dimensionality of the available data. Each dimension is assigned a weight, i.e. how much values from said dimension impact the results in the data set. In its most basic configuration the perceptron algorithm is a linear classifier, this in short means that the space of data will be separated by a linear function that denotes whether or not the input vector for each point exceeded the threshold.

Figure 2.1: A graphical illustration of the perceptron algorithm¹.

The perceptron algorithm is a form of supervised learning and is trained by changing the weights iteratively in so that the algorithm will achieve the correct result for a given input. There are different methods of learning but the general idea is to iterate over the training data and when the algorithm gets it wrong manipulate the weights according to a predetermined method until it gets it right. Of course, if

1By Mayranna (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], via Wikimedia Commons

12

(20)

2.4. SUPPORT VECTOR MACHINES

the input data cannot be separated by a hyper plane, there is no perfect solution.

In this case convergence can be guaranteed, for linearly separable data, using the following method [1].

1. Set the weights and threshold values. The weights should be small, either zero or close to it.

2. For each data point in our training set, the following steps should be performed:

a) Determine the current output (the dot product of the input and weight vectors).

b) Update the weights based on the chosen learning rate. The learning rate is a multiplicative constant of the weight vector that is applied.

2.4 Support Vector Machines

Support vector machines (SVM) are supervised learning models in which the algorithms are designed to analyze data in order to perform pattern recognition for classification or regression [7]. The original model is a linear classifier that aims to find a separating hyperplane. Later research has introduced non-linear SVMs through usage of kernel functions [7]. The general idea behind the model is to maximize the shortest distance between any data point and the separating hyper plane, this is often called the functional margin [1].

Figure 2.2: A graphical depiction of the goal of SVM algorithms².

2By Cyc (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], via Wikimedia Commons

(21)

CHAPTER 2. MACHINE LEARNING

2.5 Expectation Maximization

Expectation maximization (EM) is an iterative algorithm used to find the maximum likelihood estimates of parameters. Its primary usage is when the model depends on unobserved variables [7]. The name EM is derived from the two ‘steps’ of the algorithm. Each iteration consists of;

1. An expectation step. This step calculates the expected value of the log likelihood function. The log likelihood function is a statistical method for comparing two models.

2. A maximization step. This step computes parameters that maximize the log- likelihood function from step 1.

3. The resulting parameter estimates are used to determine the distribution of the unobserved or latent variables that will be the starting point for the sub- sequent E step.

Figure 2.3: A graphical illustration of the expectation maximization algorithm³.

3By Chire, (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)], via Wikime- dia Commonss

14

(22)

2.6. COMPOUND CLASSIFIERS

2.6 Compound Classifiers

Compound classification utilizes at least two distinct classifiers in order to make predictions [7]. The idea behind this is that utilizing different classification methods together can reduce the error rates of the system as a whole by attempting to mask the impact of failures of individual classifiers with the success of others. There are many different ways of doing this; voting ensembles give each classifier the right to place a weighted vote on the predicted value for given input and the prediction with most votes is selected, and cascading classifiers perform algorithms in series and will result in one final prediction.

2.7 Naïve Bayes

Naive Bayes is a type of supervised learning algorithm used for classification of data. It is based on Bayes’ theorem which describes the relationship between the probabilities of A, B and their respective conditional probabilities. The reason that it is called naïve is because it inherently assumes that all feature variables are independent of one another which is very seldom the case in reality [1].

Figure 2.4: The underlying principle of a naïve Bayesian classifier⁴.

4Saed Sayad (Own work) [CCBYSA3.0 (http:creativecommons.orglicensesbysa3.0)]

(23)

(24)

Chapter 3

Method

3.1 Data Analysis

The initial data selection was conducted based on the existing human expertise of Ericsson employees experienced with crash dump analysis and TR routing. This was followed by empirical analysis of the selected sources in order to deem their respective independence as well as the type of data they contained. Methods used for analysis where counting unique entries associated with each data source and MHO as well as adjacency matrices and plots used to indicate their viability in classification.

3.2 Feature Selection

Once data sources are selected, features must be selected from the data. This is one of the most important parts of the project. Features should minimize loss in significance whilst maximizing loss in noise. Anything that is part of a given data source without providing added significance can be considered as noise.

The selection of meaningful features was achieved through consultation of Er- icsson employees together with empirical analysis achieved by testing the accuracy of classifiers with different features and counting the amount of unique entries pro- duced by the given feature implementation. The viability of features was also evaluated based on their performance as data sources for individual classifiers.

3.3 Bayesian Methods

The selected machine learning method for this project is the Bayesian classifier.

The features and practical application of this method will be based on empirical analysis in combination with existing theory on the subject. The underlying idea of Bayesian statistics is utilizing the joint historical probability of a variety of features occurring for any given class. The statistical basis for Bayesian methods makes it a good choice for a proof of concept design for a variety of reasons. First of all it is

(25)

CHAPTER 3. METHOD widely used and proven method [7] and has been used for similar projects [9]. It is also rather simple to implement which allows for more readily available testing.

3.4 Classification

The set of data utilized throughout this project contains twenty distinct classes, each represented by at least one crash dump. Implementing Bayesian statistics for this many classes is slightly atypical [7], but the statistical basis for the methods remains unchanged.

3.5 Viability of Disregarded Algorithms

Perceptrons and neural networks are viable alternatives to Bayesian classifiers for similar projects. However, they require more data than was available for this particular project to achieve viable performance [7]. Support Vector Machines have been implemented in similar projects [2] but their practical application is slightly less trivial than a Bayesian classifier, especially when there is not an abundance of data.

Compound classifiers are, in essence, several classifiers working symbiotically to make predictions and are, as such, more of a potential path for future development than an initial attempt.

18

(26)

Chapter 4

Implementation

This chapter describes the data utilized for the experiments as well as the various methods used in order to achieve the results of this project.

4.1 Data

This project uses dump file data generated by Ericsson’s base stations during a fatal error. These files are then run through internal software called CDA (crash dump analysis) which generates a structure of HTML documents. These documents contain a vast amount of data pertaining to the state of the system at the point of failure, all of which will not be utilized for learning. Only the most pertinent sections of data will be included in the feature vectors, as including more would raise the dimensionality beyond what is reasonable for the amount of training data available. There are 210 data sets available for training and validation with 20 classes represented by at least one dump. The dump files used for training and testing are from cases that have already been solved and thus have known routes.

4.2 Data Sources

Features were extracted from a subset of the various HTML documents available from each dump. The subset was chosen based on recommendations from Ericsson employees familiar with TR routing and some rudimentary testing.

The following documents were chosen:

• LPPShowFatalError.html – A short human-readable error message describing the fatal error that caused the crash.

• StackTraceUnwind.html – A snapshot of the contents of each DSP stack at the point of failure. Each entry contains an address and a source code reference.

(27)

CHAPTER 4. IMPLEMENTATION

• LPPShowSem.html – A semaphore table where each entry contains a semaphore ID as well as the name of the program that had it reserved at the point of failure.

• LPPZipLog.html – A complete event list per DSP.

The content of each of these documents has been parsed and the data aggregated in order to make their respective contents comparable. Comparability and consistency of data was tested independently by plotting each feature in MatLab¹.

4.3 Feature Extraction

Feature extraction is primarily performed in order to reduce the dimensionality of the problem and to reduce the amount of noise. The dimensionality of the problem is simply the amount of rows in the feature vector, or the amount of features extracted per element of data. Every piece of data that does not correlate with classification is considered noise and should be removed so that it does not negatively impact performance or accuracy. In addition to selecting relevant features and extracting them from their sources, they must be processed into something interpretable by the chosen machine learning approach. Often this means reducing the data to a normalized numerical value.

One potentially relevant data source from LPPShowFatalError.html is the error message, which is displayed in the form of a human readable descriptive sentence.

For the purpose of this type of analysis the actual meaning of the sentence has very little importance, instead the sentence must be reduced into representative data that clearly differentiates different messages from one another. One way of comparing strings quickly is calculating checksums, which results in a unique value for every unique string. The checksums can then be counted and assigned a numerical value;

the trivial way of doing this is simply incrementing a counter by one for each new unique hash value. Each unique hash is then assigned a probability of occurrence per MHO.

1MatLab – Software for numerical analysis and graphical representation of data.

20

(28)

4.3. FEATURE EXTRACTION

Figure 4.1: The amount of unique fatal error messages for each MHO.

Figure 4.2: An adjacency plot where a line between two nodes indicates at least one shared error message.

Some of the MHO’s in figure 4.1 as well as in the following diagrams will show

(29)

CHAPTER 4. IMPLEMENTATION zero unique values for certain features. This is due to the fact that not all logs are completely consistent in content and sometimes parts of the logs contain corrupt data which is omitted during classification.

Each LPPShowFatalError.html file also contains an error code which is a kind of error ID. The code is a short hexadecimal entry that is typically not used for human classification unless two logs with the same error message are being compared. In that case the error code can be used to differentiate between the two. Much like with the error messages each unique code can be assigned an incrementing value where each code is given a numerical ID. Once each error code has been assigned an ID the probability of each error code occurring per MHO can be calculated.

Figure 4.3: The amount of unique fatal error codes for each MHO.

22

(30)

Figure 4.4: An adjacency plot where a line between two nodes indicates at least one shared error code.

As can be seen in figure 4.4, these error codes can be very powerful for identifying the source of an error as the error codes are linked almost exclusively to one MHO.

There is however a large amount of unique codes which means that, unless there is more data available at a later stage, there is a significant probability of the occurring code not having been encountered during classification.

Finally each LPPShowFatalError.html contains the line of code at which the fatal error was triggered. This is presented in the form of a path and line. Testing showed that specific line at which the code failed had little impact on the performance of the classifiers, this was thus omitted when identifying unique error sources. The error sources were identified in the same manner as above, given an incrementing ID and a probability of occurrence per MHO.

(31)

Figure 4.5: The amount of unique fatal error sources for each MHO.

Figure 4.6: An adjacency plot where a line between two nodes indicates at least one shared error source.

StackTraceUnwind.html contains a, per processor, list of active processes. Each entry has a process ID, associated memory address, processor ID and a path to

24

(32)

the source code. Everything except the source path is intended to be assigned on a priority basis and is thus not a deterministic source of identification. Each unique source, with the specific line omitted, was given an incrementing ID and its frequency of occurrence per MHO was calculated.

Figure 4.7: The amount of unique stack trace entries for each MHO.

(33)

Figure 4.8: An adjacency plot where a line between two nodes indicates at least one shared stack trace entry.

As is made apparent by the adjacency plot in figure 4.8, every MHO shares at least one entry with every other MHO. Ideally, there should be as little correlation between features of different classes as possible since these differences are what is being used for classification. Since the diagram above only shows if there is a correlation, and not how strong it is, using these features may still be viable.

However, this data is a strong indicator that their individual performance should be scrutinized further.

LPPShowSem.html contains a list of the semaphores reserved at the time of the fatal error. A semaphore is a construct used to reserve a portion of memory for a specific process and the allocation of these can give rise to a variety of errors if done improperly. Examples of such occurrences are race conditions and deadlocks.

Each row of the file is assigned an ID and its frequency of occurrence per MHO is calculated.

26

(34)

Figure 4.9: The amount of unique semaphore entries for each MHO.

Figure 4.10: An adjacency plot where a line between two nodes indicates at least one shared semaphore entry.

LPPZipLog.html contains a complete list of runtime events per processor. Each entry contains a processor ID, process ID, buffer number, micro second of occurrence

(35)

CHAPTER 4. IMPLEMENTATION and the event itself. Each event is assigned an incrementing ID and its frequency of occurrence per MHO is calculated.

Figure 4.11: The amount of unique zip log entries for each MHO.

Figure 4.12: An adjacency plot where a line between two nodes indicates at least one shared zip log entry.

28

(36)

4.4. FEATURE SELECTION

4.4 Feature Selection

Based on the figures in this section and the results of the independent classifiers the human readable error messages and semaphores were omitted from the final (combined) classifier as they only served to decrease its accuracy for the existing circumstances. This means that the combined classifier utilized the following features in order to make predictions; zip log entries, stack trace entries, error sources and error codes.

4.5 Remarks on Feature Extraction

The selection of features is heavily based on recommendations from personnel with experience from manual classification and extensive knowledge about their content and origin. In addition to this features were tested individually for prediction accuracy, entropy and MHO adjacency. These tests, conducted by counting unique entries per MHO and feature and drawing adjacency plots were the basis for de- ciding how to best reduce noise and select representative data. Noise is data that contains no significant information about which class the dump is from and thus dilutes the quality of information. The adjacency plots show that almost all data sources have significant overlap between different classes, there is however a differ- ence in frequency of occurrence as well as features that are distinct in nearly every case.

4.6 Deviations from Bayesian Methods

The traditional application of Bayesian statistics would be to treat each unique entry in the zip log, stack trace and semaphore files as independent features. However, this would make for very large feature vectors in relation to the amount of data and testing showed a significant drop in performance utilizing this approach compared to utilizing aggregated features. Instead of treating each entry independently, an average data file is constructed for each MHO which is then compared to that of the unclassified crash dump. If more data were to become available this aspect could be revisited in order to reevaluate its performance.

The average data file is created by counting the occurrence of each type of entry per MHO, and subsequently dividing these counters by the number of times that particular type of MHO was encountered during training. The segment of the feature vector that represents each of these data sources is thus a series of counters, one per unique type of entry that has been encountered during training.

Additionally some of the features will not have been encountered during training, this would mean that their historical probability of occurrence is 0%. Since the amount of unique entries is large in relation to the dataset these cases are reassigned a very low positive probability (0.005%). This ensures that viable results are not omitted due to a small portion of data being new to the program.

(37)

4.7 Data Processing

The frequency of occurrence for each unique entry is calculated per MHO. When all data in the training set has been processed, the count per entry is divided by the amount of MHOs of that type. This results in a probabilistic distribution of entries per MHO.

30

(38)

Chapter 5

Results

The results in this section were generated by randomly dividing the available data into two distinct subgroups based on a predetermined ratio of 1:3. One for training the classifier and one for validating the results. The randomization can be performed on demand but is not changed between runs and comparisons between classifiers are always based on their results on the same test data. There was a total of 210 available dump hierarchies which were split into 78% training data and 22%

validation data. The discrepancy between the desired ratio and the actual ratio is due to a portion of dumps being corrupt, these were disregarded upon detection.

The answers to the training data are known to the classifiers whilst the answers to the validation data is unknown. Performance is gauged by comparing the predictions of the classifier to the correct answers.

Three of the twenty MHOs where omitted from the tests due to insufficient data;

less than three crash dumps per MHO. Meaning that there were a total of seventeen classes during the tests. The random baseline is assumed to be 1/17 (approx. 5.9%) and the human baseline of 77% prediction accuracy is based on anecdotal evidence from Ericsson’s own internal research.

Figure 5.1: A simple diagram depicting the ratio of training and validation data used for obtaining the results of this project.

(39)

CHAPTER 5. RESULTS

5.1 Independent Results of Classifiers

This sections shows the results of running Bayesian classification with only one type of data source. Each diagram has one entry per MHO in the validation set.

The values are either one, which indicates a correct classification, or zero, which indicates an incorrect classification.

Figure 5.2: Accuracy of predictions per MHO in the validation set based on the human readable error messages.

Figure 5.3: Accuracy of predictions per MHO in the validation set based on the error codes.

32

(40)

5.1. INDEPENDENT RESULTS OF CLASSIFIERS

Figure 5.4: Accuracy of predictions per MHO in the validation set based on the error sources.

Figure 5.5: Accuracy of predictions per MHO in the validation set based on the contents of the semaphore lists.

(41)

CHAPTER 5. RESULTS

Figure 5.6: Accuracy of predictions per MHO in the validation set based on the contents of the zip logs.

Figure 5.7: Accuracy of predictions per MHO in the validation set based on the contents of the stack traces.

5.2 Combined Results and Comparisons

The following section contains comparisons between the accuracy and predictions of the respective classifiers as well as the results of the final Bayesian classifier utilizing the combination of stack traces, zip logs, error sources and error codes.

34

(42)

5.2. COMBINED RESULTS AND COMPARISONS

Figure 5.8: Accuracy of predictions per MHO in the validation set for the combined classifier.

Figure 5.9: The prediction accuracy of each classifier.

(43)

CHAPTER 5. RESULTS

Figure 5.10: The respective hits and misses of each respective method of classification per dump in the validation set.

Removing all entries that existed for more than one MHO was attempted as a result of the findings during the data analysis phase. This attempt showed that classifiers with sparse adjacency matrices performed nearly identically with exclusive features, whilst those with dense adjacency matrices showed a significant decrease in prediction accuracy.

Figure 5.11: The prediction accuracy of each classifier with exclusive features.

5.3 Comments on the Results

The resulting Bayesian classifier had a 59% prediction accuracy on the validation set for its first guess and an additional 16% on its second guess. The best performing individual classifiers where those based on stack trace entries (30%), zip log entries (38%), error codes (35%), and error sources (30%). The individual classifiers with poorest performance were those based on semaphore entries (8.1%) and human readable messages (2.7%).

36

(44)

Chapter 6

Conclusion

6.1 Discussion

The methods implemented during the course of this project show great potential and exceeded the initial goals of a better than random accuracy. In fact the results of the final classifier, 59% prediction accuracy, is approaching human levels of accuracy. However, given the small size of the dataset, it is difficult to predict exactly how well this accuracy will hold up over time. The best performance was achieved by utilizing error sources, error codes, zip logs and stack traces as data sources. Adding the human readable error messages or semaphores both yielded a measurable decrease in performance. This does not come as a surprise given their poor individual performance illustrated in the results section above.

Some of the individual classifiers performed very well by themselves and it is possible that further improvements could be made through further research into different methods of utilizing their contents. Furthermore, a portion of the log files remain unexplored as a result of the time constraints of the project and lack of insight into their content. A more exhaustive study may try to utilize some of this content to further improve prediction accuracy.

The results indicate that the human readable messages are, at least from a machine learning perspective, extremely poor data sources for classification. This is strange as they are intended to be meaningful to human classifiers. This result may, on its own, impact the way these error messages are written and how much trust they are given upon crash dump analysis.

6.2 Conclusion

The Bayesian classifier implemented as a result of this project far surpassed the random baseline of 5.8%, but was unable to reach human levels of accuracy. The limited amount of data is likely to have negatively impacted the prediction accuracy for certain classes where only a few historical examples were available for training.

Despite this, the results of this project clearly show that machine learning techniques

(45)

CHAPTER 6. CONCLUSION are a feasible alternative to human processing of trouble reports and may be able to replace or aid human classification of trouble reports in the very near future.

6.3 Recommendations

The accuracy of this particular approach has not yet reached human levels of prediction accuracy but could be used as an aid to point out probable error sources which can then be analyzed by a human. In order to increase accuracy more historical data should be stored and utilized for training. If this approach is to achieve human levels of prediction accuracy more historical data needs to be assembled.

As is indicated by this project, machine learning is indeed a viable approach for TR routing and could, in all likelihood, be improved further by implementing more advanced classifiers.

6.4 Future Work

Increasing the amount of training data would enable the addition of more features which may increase the prediction accuracy. It would also improve the accuracy of existing methods by providing a greater data base for each class. The methods presented in this project could also be implemented as a part of a compound classifier utilizing several different machine learning methods, such an approach may yield a more robust classifier with greater accuracy. Finally it could be used as a supplement to human classification. The methods and resulting program could be used to point out probable routes which could then be evaluated by a human, thus potentially reducing the burden on employees responsible for TR routing.

38

(46)

Bibliography

[1] Marsland, Stephen. Machine Learning: An Algorithmic Perspective. Boca Ra- ton: Chapman & HallCRC, 2009. Print.

[2] Fronza, Ilenia, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. "Failure Prediction Based on Log Files Using Random Indexing and Support Vector Machines." The Journal of Systems and Software 86 (2013):

2-11. Web.

[3] Zheng, Alice Xiaozhou. Statistical Software Debugging. Diss. Thesis (Ph.D.

Engineering–Electrical Engineering and Computer Sciences)–University of Cal- ifornia, Berkeley, Fall, 1999. Berkeley: University of California, 2005. Print.

[4] Noorian, Mahdi, Ebrahim Bagheri, and Wheichang Du. "Machine Learning- based Software Testing: Towards a Classification Framework." University of New Brunswick, Fredericton, Canada. Print.

[5] Roychowdhury, Shounak. "Ensemble of Feature Selectors for Software Fault Lo- calization."IEEE International Conference on Systems, Man, and Cybernetics, COEX, Seoul, Korea (2012). Department of Electrical and Computer Engineer- ing, The University of Texas at Austin. Web.

[6] Rish, Irina. An Empirical Study of the Naive Bayes Classifier. Tech. no. RC 22230. Yorktown Heights: IBM Research Division, 2001. Print.

[7] Bishop, Christopher M. Pattern Recognition and Machine Learning. New York:

Springer, 2006. Print.

[8] Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from Data: A Short Course. United States: AMLBook.com, 2012. Print.

[9] Liu, Chao, Zeng Lian, and Jiawei Han. How Bayesians Debug. University of Illinois-UC,Brigham Young University, Print.

[10] Xu, Wei, Huang, Ling, Fox, Armando, Patterson, David, and Jordan, Michael.

Experience Mining Google’s Production Console Logs. Uinversity of California at Berkeley, Intel Labs Berkeley. Print.

(47)

BIBLIOGRAPHY [11] Li, Weixi. "Automatic Log Analysis Using Machine Learning." Thesis. Uppsala

Universitet, 2013. Print.

40

(48)

Machine Learning Methods for Fault Classification

Machine Learning Methods for Fault Classification

MARKUS FELLDIN

Abstract

Referat

Maskininlärningsmetoder för felklassificering

Contents

Preface

Glossery

Chapter 1

Introduction

1.1 Introduction

1.2 Background

1.3 Thesis Objective

1.4 Current Workflow

1.5 Constraints

1.6 External Factors

1.7 Intended Audience

1.8 Ethical Considerations

1.9 Choice of Methodology

1.10 Evaluation

1.11 Related Research

1.12 Implications of Related Research

Chapter 2

Machine Learning

2.1 Machine Learning

2.2 Types of Learning

2.3 Perceptrons and Neural Networks

2.4 Support Vector Machines

2.5 Expectation Maximization

2.6 Compound Classifiers

2.7 Naïve Bayes

Chapter 3

Method

3.1 Data Analysis

3.2 Feature Selection

3.3 Bayesian Methods

3.4 Classification

3.5 Viability of Disregarded Algorithms

Chapter 4

Implementation

4.1 Data

4.2 Data Sources

4.3 Feature Extraction

4.4 Feature Selection

4.5 Remarks on Feature Extraction

4.6 Deviations from Bayesian Methods

4.7 Data Processing

Chapter 5

Results

5.1 Independent Results of Classifiers

5.2 Combined Results and Comparisons

5.3 Comments on the Results

Chapter 6

Conclusion

6.1 Discussion

6.2 Conclusion

6.3 Recommendations

6.4 Future Work

Bibliography