• No results found

Modeling Trouble Ticket ResolutionTime Using Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Modeling Trouble Ticket ResolutionTime Using Machine Learning"

Copied!
58
0
0

Loading.... (view fulltext now)

Full text

(1)

Modeling Trouble Ticket Resolution

Time Using Machine Learning

Master Thesis in Statistics & Machine Learning

Asad Enver

(2)

Abstract

Telecommunication companies have huge volumes of data stored in their databases in the form of trouble tickets created from connectivity related issues. A trouble ticket is an e-record that contains information about the network disturbance. Whenever there is a network outage, customers face interruption in the usage of company services and are unaware of how long it would take before their issue gets resolved. This leads to uncertainty and dissatisfaction on the customer part. The amount of data is so huge and also in a very raw, semi structured format that carrying out manual analysis is not possible. Hence, there is a need to develop fast, robust, efficient and easy to implement models that could automate the handling of trouble tickets.

This thesis work, conducted at Telenor Sweden, aims to build a model that would try to accurately predict the resolution time of Priority 4 Trouble Tick-ets. (Priority 4 trouble tickets are those tickets that get generated more often i-e in higher volumes per month). It explores and investigates the possibil-ity of applying Machine Learning and Deep Learning techniques to trouble ticket data to find an optimal solution which performs better than the current method in place (which is explained in Section 3.5). The model would be used by Telenor to inform the end users of when the networks team expects to resolve the issues that are affecting them.

(3)

Acknowledgements

I would like to take this opportunity to thank Telenor for giving me this opportunity to carry out this thesis. I am extremely grateful to my supervisor at Telenor Jaume Rius for guiding, motivating and helping me throughout the duration of this thesis through weekly reviews and discussions.I would also like to extend my wishes to the entire Network Operation Center (NOC) team for always being very responsive and taking out time from their schedules to assist me whenever I needed their support.

I would also like to express my deepest gratitude to my supervisor at Linköping University, Filip Ekström, whose advice and guidance proved extremely valu-able in helping me carry out my thesis.

(4)

Contents

Abstract 2

Acknowledgements 3

1. Introduction 7

1.1 Problem Statement and Research Question . . . 8

1.2 Scope & Limitations . . . 9

1.3 Related Work . . . 10

2. Theory 12 2.1 Machine Learning basics . . . 12

2.1.1 Regression . . . 13

2.1.2 Bias-Variance tradeoff . . . 14

2.1.3 Regularization . . . 15

2.2 Handling Missing Data . . . 16

2.2.1 Bayesian Ridge Regression . . . 17

2.2.2 Extra Trees Regressor . . . 18

2.2.3 K Neighbors Regressor . . . 18

2.3 Feature Importance & Selection . . . 19

2.3.1 Sequential Forward Selection (SFS) . . . 19

(5)

2.4 XGBoost- Gradient Boosted Decision Trees . . . 21

2.4.1 Regularisation of Ensemble Model . . . 22

2.5 Artificial Neural Network . . . 23

2.5.1 Architecture . . . 24

2.5.2 Activation Functions . . . 26

2.5.3 Backpropagation . . . 27

2.6 Evaluation Methods . . . 28

2.6.1 Mean Absolute Error (MAE) . . . 29

2.6.2 Root Mean Square Error (RMSE) . . . 29

3. Method 30 3.1 DataSet . . . 30

3.2 Data Cleaning & Preprocessing . . . 32

3.2.1 Removing Outliers . . . 33

3.2.2 Imputing Numeric Values . . . 33

3.2.3 Imputing Categorical Values . . . 34

3.2.4 One-hot encoding . . . 34

3.2.5 Geospatial features . . . 35

3.3 Dimensionality Reduction . . . 36

3.4 Feature Importance & Selection . . . 37

3.5 Baseline . . . 37

(6)

3.6.1 Train-Validate-Test Split . . . 38

3.6.2 Hyperparameter Optimization . . . 38

3.6.3 Training XGBoost . . . 39

3.6.4 Training Neural Network . . . 40

3.6.5 Implementation Details . . . 41

4. Results 42 4.1 Imputation Techniques for Model Selection . . . 42

4.2 Model Performance without hyperparameter optimization . . . 45

4.3 Model Performance with hyperparameter optimization . . . 46

4.4 Feature Selection . . . 48

4.4.1 Sequential Forward Selection . . . 48

4.4.2 Sequetial Backward Selection . . . 49

4.5 Optimal Model . . . 51

5. Discussion & Conclusion 52 5.1 Discussion . . . 52

5.2 Future Work . . . 54

5.3 Conclusion . . . 54

(7)

1. Introduction

The Network Operation Center (NOC) at a telecommunications company is responsible for surveillance and monitoring of networks to keep the commu-nication flow up and running. The network architecture includes hundreds of hardware and software equipment that detect abnormal behavior in the sys-tems. The agents at NOC have to constantly monitor the network, analyze incoming alarms and tickets and schedule maintenance and error-correction activities into the time schedule of field-technicians to be able to solve the day-to-day problems in a quick and efficient manner. Incident monitoring and management is typically done in a trouble ticket system (TTS) which tracks and manages all the incidents. When a network outage occurs or any kind of anomaly is detected in the system, trouble tickets are automatically gen-erated which contain information about the affected network and thereafter, agents are informed about the disturbance. The ticket moves through multiple stages where agents work in parallel and log their actions in the corresponding ticket [1]. When the issue is resolved, the ticket is classified as ‘Solved’. A solved ticket means that the issue has been resolved from the company’s end and now awaits the approval of the end customer. Upon receiving positive feedback from the customer, the ticket is archived and labeled as ‘Closed’.

The underlying motivation to do this master thesis is to investigate and ex-plore if the solved tickets from previous years contain correlations and hidden patterns between ticket attributes and the resolution time that could be mined using machine learning to expedite the troubleshooting process, optimize the allocation of resources within NOC and improve the workflow [2]. The busi-ness outcome would be improved customer satisfaction and decreased customer churn.

(8)

1.1 Problem Statement and Research Question

The goal is to first understand the life cycle of a trouble ticket to identify the key attributes that are required that will have the greatest impact in determin-ing the resolution time [2]. The next step is to understand the systems from where the data will be extracted. Some features like weather data, electrical disturbances etc will have to be fetched from an external third party source. Once the data is extracted and transformed, the thesis will investigate suitable algorithms to accurately predict the resolution time of trouble tickets. Some data is present in free form text which would require applying text mining and Natural Language Processing techniques to extract relevant information from such fields.

The fundamental question that this thesis aims to address is:

“Based on past historical data about trouble tickets, can we lever-age machine learning, deep learning and NLP algorithms to predict the trouble ticket resolution (TTR) due to network disturbances to improve customer experience?”

The research question could be further broken down into two main areas:

• Identifying the most important attributes of a trouble ticket. This would involve feature selection and domain knowledge about the telecom in-dustry to be able to extract the relevant features

• Identifying the most relevant regression models that would result in the highest accuracy and predictive performance. Since we are going to predict the resolution time which is a continuous variable, we will ex-periment with different regression models that consist of the traditional

(9)

machine learning techniques and the more complex algorithms, including neural networks, to find out which model is suitable for our purpose.

1.2 Scope & Limitations

The thesis is limited to building a regression model to successfully learn from past historical data and make predictions about trouble ticket resolution time using only the features that are available at the time of ticket creation. The problem could also be modeled using classification techniques and a compar-ison drawn between regression and classification results but no such attempt would be made. The root cause at the time of ticket creation is unknown and is only entered into the system once the ticket moves further in its journey. Hence, no attempt would be made to study the root cause, although it could serve as an extremely useful feature as tickets with a similar root cause would have similar resolution times. Trouble tickets are grouped into five different categories ranging from 1 to 5 depending on their severity level and the number of customers that are impacted. In this thesis, only Priority 4 trouble tickets data will be analyzed as they are generated in the highest volume. Priority 4 tickets have a standard Service Level Agreement (A service-level agreement is a commitment between a service provider and a client agree about particular aspects of the service such as quality, availability, and responsibilities) of 8 hours (480 minutes).

Moreover, detailed information about the company’s internal systems from where the data was extracted and the company’s processes will not be de-scribed in detail. The sample data set that will be included will be for illus-tration purposes only and will not contain any actual values.

(10)

1.3 Related Work

There have been a few studies conducted and some research papers written about using machine learning to address the challenges faced by personnel working at the Network Operation Center (NOC). However, most studies have focused on predicting the root cause of the trouble ticket and hence the nature of the problem has been that of classification. Earlier, rule-based learning or expert systems were used to diagnose the trouble tickets automatically [3].

If a ticket is generated automatically, the information is often present in the form of free text and hence machine learning cannot be applied since the models can only take in numeric or categorical data. Temprado, et al. used classification as well as text mining techniques such as stemmer algorithms, entity-relationship models, frequency recount, and stop lists to extract infor-mation from free text fields to predict whether a technician was needed on-site to resolve the issue and whether a ticket would face escalation during its life-time [4]. Symonenko, et al. tried to evaluate the tickets by analyzing them manually with n-gram analysis and leveraging contextual mining. The re-sulting model was able to categorize each ticket into its respective root cause category with a 1.4% error rate [5]. Medem, et al. built Trouble Miner to clas-sify trouble tickets according to their root cause. The findings revealed that most tickets are generated due to disturbance in network cables and routers [6].

Kenneth tried to predict the resolution times by building regression and clas-sification models. Fields that contained text data were removed since the text was entered by a human every time. In this thesis, there is also a text field but the text is generated by a machine and not by a human. Hence, the text field

(11)

would be used to extract useful information. With classification,the resolu-tion time was bucketed into three classes and the resulting model achieved an accuracy of about 74.5% whereas for regression, the artificial neural network achieved the lowest MAE of 24.8 hours [7].

Löfgren applied data mining and machine learning techniques to predict the root cause of network problems which could be used to give recommendations to engineers and hence, save time in the troubleshooting process. The model could predict the root cause with an accuracy of upto 90% in the case of the most common root cause and 70% when classifying between up to 20 root causes [8].

(12)

2. Theory

This chapter gives an overview of the theory related to the machine learning algorithms that have been used and also presents the necessary terminology and formulas. A brief introduction of machine learning is presented, followed by a more detailed discussion of the specific methods and an explanation of the evaluation methods used to compare the performance of models.

2.1 Machine Learning basics

Machine learning is a branch of computer science that enables computers to learn complex and hidden patterns from data without being explicitly pro-grammed. It can be further broken down into supervised, unsupervised and reinforcement learning. This thesis, however, focuses only on supervised ma-chine learning methods.

Supervised learning is a type of learning that uses labeled data to tell the computer what to find. We have ground-truth knowledge in the form of input variables (𝑥) and output/target variable (𝑦). The goal for supervised learning is to learn a mapping function from input 𝑥 to output 𝑦.

𝑦 = 𝑓(𝑥) (1)

Instead of trying to create the function 𝑓(𝑥) manually, supervised machine learning algorithms learn from the training data set to infer the function au-tomatically. The objective is to approximate the mapping function so well that whenever there is new input data available, the output can be predicted accurately [8].

(13)

Supervised learning can be further broken down into:

• Regression: a regression problem is when the target variable is a real or continuous value such as temperature

• Classification: a classification problem is when the target variable is a category such as gender (male or female)

2.1.1 Regression

Regression uses a set of independent variables 𝑋 to estimate one or more dependent variables 𝑌 . The independent variables serve as an input to the regression model and the dependent variable is the output.

The regression model 𝑓(𝑋) uses past historical data which comprises of obser-vations of both independent and one dependent variable where 𝑥𝑖 𝜀 𝑋 and 𝑦𝑖 𝜀 𝑌 forms a pair (𝑥𝑖, 𝑦𝑖). The model predicts the dependent variable ̂𝑌 using the set of independent variables:

̂

𝑦𝑖 = 𝐸(𝑦𝑖 ∣ 𝑥𝑖) = 𝑓(𝑥𝑖) (2)

For each observation, the prediction error ̂𝑒𝑖 is given by:

̂

𝑒𝑖 = ̂𝑦𝑖− 𝑦𝑖 = 𝑓(𝑥𝑖) − 𝑦𝑖 (3)

The goal is to minimize the prediction errors. Suppose there are N observations in the data set, then the total prediction error 𝐸 can be written as:

(14)

𝐸 = 1 𝑁 𝑁 ∑ 𝑖=1 | ̂𝑒𝑖| (4) or 𝐸 = 1 𝑁 𝑁 ∑ 𝑖=1 ̂ 𝑒𝑖2 (5)

Hence,the optimal regression model will be the one that is able to minimize the error: minimize 𝑓 𝑁 ∑ 𝑖=1 (𝑓(𝑥𝑖) − 𝑦𝑖)2 (6)

The above methodology is known as the method of least squares [9]. There are different techniques to build regression models. The ones used in this thesis are gradient boosted regression trees and artificial neural network.

2.1.2 Bias-Variance tradeoff

In machine learning, a good model is one that has a low bias and a low variance. Bias is a measure of how close the central tendency of a learner is to the true function 𝑓. If the learner learns the true function 𝑓 over the training set 𝑆, then the learner is unbiased [10]. For some 𝑥 ∼ 𝐷, the bias is given by:

𝐵𝑖𝑎𝑠(ℎ𝑆) = 𝐸𝑆[ℎ𝑆(𝑥)] + −𝑓(𝑥) (7)

A model with a high bias oversimplifies the model which means the model is not able to make accurate predictions on either the training data or the test

(15)

data. This is known as under-fitting. On the contrary, a model that has a low bias complicates the model which means the model is able to make accurate predictions on the training data but fails to generalize well on the test data. This is known as over-fitting.

Variance measures the fluctuations of a learner around its central tendency. The fluctuations are a result of the different sampling of the training set [10]. For some 𝑥 ∼ 𝐷, the variance is:

𝑉 𝑎𝑟(ℎ𝑆) = 𝐸𝑆[(ℎ𝑆(𝑥) − 𝐸𝑆[ℎ𝑆(𝑥)])2] (8)

A model with high variance means the model has memorized what is in the training data but does not generalize well on test data. However, a model cannot have low bias as well as low variance. This is known as the bias-variance trade-off [8]. Therefore, the right balance needs to be struck between bias and variance to prevent the model from over-fitting so it can generalize well on unseen data.

2.1.3 Regularization

Regularization in machine learning refers to techniques that prevent a model from over-fitting. They achieve this by adding a penalty term to the objective function that puts a constraint on the estimated coefficients. This helps to reduce the variance of the model.

The two most common regularization techniques are the 𝐿1 regularization and 𝐿2regularization. In 𝐿2 regularization,the large weight values tend to get penalized. This is made possible through the introduction of the term 𝜆|𝑤|2. The 𝜆 term adjusts the ratio between small weight values and minimizing

(16)

the training loss.This penalty parameter constrains the size of the coefficients such that the only way the coefficients can increase is if there is a comparable decrease in the sum of squared errors (SSE). In 𝐿1 regularization, instead of penalizing the square of the weight values, their absolute values are penalized [8]. It is a common feature selection technique since it pushes the weight coefficients responsible for high variance to zero whereas in 𝐿1 regularization, they are only shrunk but never become zero.

2.2 Handling Missing Data

One of the major obstacles in building efficient statistical models is when the data contains a lot of missing values. The amount of missing data has a significant impact on the model performance that can result in biased estimates of predictions [14].

The easiest strategy is to drop rows that have any missing values. However, the drawback is losing information that could be valuable for the model to learn. Moreover, if some critical information does not get accounted for,the model will not be very useful in solving real-world business problems.

The second approach is to use imputation algorithms. One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension. This can encompass calculating the mean, mode or median of that i-th feature dimension and re-placing the missing values with it. However, using univariate techniques do not always reflect a true representation of the missing data. By contrast, multivariate imputation algorithms use the entire set of available feature di-mensions to estimate the missing values. One such technique that has become

(17)

quite popular is to leverage an iterative imputation model to impute the miss-ing values. In iterative imputation, each feature is modeled as a function of other features. The features are imputed sequentially, one after the other. This means that the prior missing values imputed are used subsequently in the model to predict the missing values in the subsequent features [15]. By default, the feature with the least missing values is calculated first and the one with the most missing values is calculated last.

Three techniques were used to impute the numeric missing values. They have been briefly discussed to give the reader a high level overview of how the algorithms work.

2.2.1 Bayesian Ridge Regression

Bayesian regression techniques tune the regularization parameters by intro-ducing uninformative priors over the hyperparameters of the model. The 𝐿2 regularization used in ridge regression is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the coefficients 𝑤 with pre-cision 𝜆−1 . 𝜆 and 𝛼 are treated as random variables that are estimated from the data.

The output 𝑦 is assumed to have a Gaussian distribution around 𝑋𝑤:

𝑝(𝑦 ∣ 𝑋, 𝑤, 𝛼) = 𝑁 (𝑦 ∣ 𝑋𝑤, 𝛼) (9)

In Bayesian Ridge, the prior for the coefficient 𝑤 is given by a spherical dis-tribution:

(18)

𝑤, 𝛼 and 𝜆 are estimated jointly when the model is fit and the parameters 𝛼 and 𝜆 are estimated by maximizing the log marginal likelihood [11].

2.2.2 Extra Trees Regressor

Extra Trees Regressor is based on randomized decision trees where each tree in the ensemble is built from a sample drawn with replacement. A decision tree has been explained in Section 2.4 below. A single decision tree exhibits high variance that leads to overfitting. The goal here is to introduce randomness in order to decrease the variance of the estimators. At each split, a random subset of candidate features is selected and the thresholds are drawn at random as well for each candidate feature. Thereafter, the best of these random thresholds is chosen as the splitting rule. Extra Trees Regressor combines several diverse trees and averages their probabilistic prediction to lower the variance at the cost of an increase in bias [12]. However, the reduction in variance is more significant than an increase in bias and this results in an overall better model.

2.2.3 K Neighbors Regressor

K Neighbors Regressor identifies the nearest neighbours of the missing obser-vation and averages these nearby points to fill in the value. The result depends on two factors: the choice of k (number of neighbors to select) and how to measure the distance between observations. K is an integer value specified by the user. The most naive implementation involves the brute-force computa-tion of distances between all pairs of points in the dataset: for 𝑁 samples in 𝐷 dimensions, this approach scales as 𝑂[𝐷𝑁2][13].

(19)

2.3 Feature Importance & Selection

Given a set of n features, feature selection selects a subset of d features (d < n). Feature selection is performed to improve the computational efficiency and reduce the generalization error of the model by removing irrelevant features or noise [18].

Feature selection can be done by either using filter or wrapper methods. Filter methods find a subset of all the features that have a high correlation with the target variable. The problem with filter methods is that we might get redundant features since they do not take into account the correlations among the features themselves. Wrapper methods overcome this problem by taking into account the correlations among the features in the sub set [16].

Sequential Feature Selector, available in scikit-learn library, has been used to perform feature selection. It adds or removes features from a candidate subset while evaluating the objective function It has two variants: forward selection and backward selection [19]

2.3.1 Sequential Forward Selection (SFS)

• The algorithm starts with a null model and using an objective function finds the single best feature and adds it to the model

• Using the remaining features, the next best feature is selected to form a pair of features

• Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected.

(20)

• This process continues until the desired number of features as specified by the user are selected

Mathematically, it can be described as:

1. Start with the empty set 𝑌0 = 𝜙

2. Select the next best feature 𝑥+ = arg max 𝑥∉𝑌𝑘

𝐽 (𝑌𝑘+ 𝑥) 3. Update 𝑌𝑘+1= 𝑌𝑘+ 𝑥+; 𝑘 = 𝑘 + 1

4. Go to step 2 and repeat until the desired number of features have been added

2.3.2 Sequential Backward Selection (SBS)

• The algorithm computes the objective function for all 𝑛 features

• Each feature is thereafter deleted one at a time, the objective function is calculated for all subsets with n-1 features and the worst feature is removed

• Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features

• This process is repeated until the desired number of features as specified by the user are left

Mathematically, it can be described as:

1. Start with the full set 𝑌0 = 𝑋

2. Remove the worst feature 𝑥−= arg max 𝑥∈𝑌

(21)

3. Update 𝑌𝑘+1= 𝑌𝑘− 𝑥−; 𝑘 = 𝑘 + 1

4. Go to 2 Go to step 2 and repeat until the desired number of features have been removed

2.4 XGBoost- Gradient Boosted Decision Trees

Among the machine learning methods used in practice, gradient tree boosting is one technique that shines in many applications that are highly successful in reducing the bias as well as the variance of the model [17].

A decision tree divides the training data into linearly separable, non-overlapping regions. Each sample belongs to one region only. A greedy algorithm known as binary recursive splitting is used to find the non overlap-ping regions. The algorithm selects the feature that leads to the best split (highest information gain) at each node. Once the best split and feature are identified, all samples whose value for selected feature are less than or equal to the split point are assigned to the left branch whereas all samples whose value is greater end up on the right branch of the tree.

In boosting, the trees are built sequentially such that each subsequent tree learns from its predecessors and updates the residual errors. Hence, the tree that grows next in the sequence will learn from an updated version of the residuals. The base learners in boosting are weak learners with high bias. Each of these weak learners contributes some vital information for prediction. Hence, this enables the boosting technique to produce a strong learner by effectively combining the predictions of the weak learners. Thus, the final strong learner brings down both the bias and the variance of the model [15].

(22)

2.4.1 Regularisation of Ensemble Model

For a given data set 𝐷 with 𝑛 observations and 𝑚 features, where 𝐷 = (𝑥𝑖, 𝑦𝑖) a tree ensemble model uses K additive functions to predict the output 𝑦𝑖where 𝐹 is the space of regression trees. 𝜙 represents the ensemble method that uses 𝑥𝑖 to predict 𝑦𝑖, see equation:

̂ 𝑦𝑖= 𝜙(𝑥𝑖) = 𝐾 ∑ 𝑘=1 𝑓𝑘(𝑥𝑖), 𝑓𝑘∈ 𝐹 (11)

Each 𝑓𝑘corresponds to an independent tree structure 𝑞 where 𝑇 represents the number of leaves and 𝑤 the leaf weights. Each weight 𝑤 contains a continuous score and by summing up these scores for all 𝑇 leaves, the final prediction for each tree 𝑞 is calculated. The regularised loss function is minimized according to the following equation:

𝐿(𝜙) = ∑ 𝑖 𝑙( ̂𝑦𝑖, 𝑦𝑖) + ∑ 𝑘 Ω(𝑓𝑘)𝑤ℎ𝑒𝑟𝑒 (12) Ω(𝑓) = 𝛾𝑇 +1 2𝜆𝜔 2 (13)

𝐿(𝜙) is the loss function that measures the difference between the prediction

̂

𝑦𝑖 and the target 𝑦𝑖 . The second term Ω penalizes the complexity of the model with regularization parameters 𝛾 and 𝜆 to avoid over-fitting [20].

During training, the XGBoost model is trained additively, meaning one tree is optimized at a time. Let ̂𝑦𝑖𝑡 be the prediction value at iteration t, the additive procedure is:

(23)

̂ 𝑦𝑖0 = 0 ̂ 𝑦𝑖1= 𝑓1(𝑥𝑖) = ̂𝑦𝑖0+ 𝑓1(𝑥𝑖) ̂ 𝑦𝑖1 = 𝑓1(𝑥𝑖) + 𝑓2(𝑥𝑖) = ̂𝑦𝑖1+ 𝑓2(𝑥𝑖) … ̂ 𝑦𝑖𝑡 = 𝑡 ∑ 𝑘=1 𝑓𝑘(𝑥𝑖) = ̂𝑦𝑖𝑡−1+ 𝑓𝑡(𝑥𝑖)

The loss function can be rewritten as:

𝐿(𝑡) = 𝑛 ∑

𝑖=1

𝑙(𝑦𝑖, ̂𝑦𝑖𝑡−1+ 𝑓𝑡(𝑥𝑖)) + Ω(𝑓𝑡) (14)

A second-order approximation could be used to further simplify the objective function and then determine how good the tree structure is [21].

2.5 Artificial Neural Network

The information present is a trouble ticket is continuously being updated. Since the data is continuously being added, static algorithms would not prove very effective. Neural networks can capture the continuously growing data in trouble ticket systems well and due to their capacity of batch training can yield better results [21].

(24)

2.5.1 Architecture

A neural network can be thought of as a composed chain of functions 𝑓(𝑥) = 𝑓(4)(𝑓(3)(𝑓(2)(𝑓(1)(𝑥)))), where the individual functions are called layers. Each layer has two parts: a linear part and a non-linear part. Combining the two results in a very robust function approximator that is capable of learning com-plex structures and patterns hidden in the data. Each layer consists of neurons and each neuron has an associated weight. The weights are learned during the training process to drive the output 𝑓(𝑥) close to the real output/ground truth 𝑓∗(𝑥) [8]. Any neural network with more than one hidden layer is considered a deep neural network.

Figure 1: Illustration of a simple 3 hidden layer feedforward fully connected neural network with an input layer, 4 neurons in the first hidden layer, 4 neurons in the second hidden layer, 4 neurons in the third hidden layer and an output layer. Input x is fed into the network from the left and is transformed to the output y

(25)

The computation can be described by the following set of equations:

𝑎1 = 𝑔(𝑊1𝑥 + 𝑏1) (15)

𝑎2= 𝑔(𝑊2𝑎1+ 𝑏2) (16)

𝑎3= 𝑔(𝑊3𝑎2+ 𝑏3) (17)

𝑦 = 𝑔(𝑊4𝑎3+ 𝑏4) (18)

𝑥 is the input feature vector, 𝑔 is the activation function (it does not have to be the same activation function for each layer), 𝑊 are the weight vectors, 𝑏 are the bias terms and 𝑎𝑖 represent the activation vectors that correspond to the output of the respective layers. The output is represented by 𝑦.

Each layer takes the output of the previous layer as input, multiplies it with a matrix of weight values 𝑊 , adds a bias term 𝑏 and then transforms the linear function into a nonlinear function by applying an element-wise non-linear acti-vation function to generate the actiacti-vation values as represented by the vectors 𝑎𝑖. Since we have four features as input, 𝑥 will be a column vector of size four which means 𝑊1 will be a matrix of dimensions 4* 4 where each row corre-sponds to the weight values from the input to a single neuron. The weights and the bias terms are all learnable parameters that will be learned during the training process by an optimization algorithm such as gradient descent (de-scribed below) to create a function approximator.The number of hidden layers,

(26)

the number of neurons present in each hidden layer, the optimization method, the learning rate and the activation function are all hyper parameters that need to be specified in advance before the learning process begins. The opti-mal hyperparameters can be found using grid search or random search. While grid search builds a model for every combination of various hyper parameters, random search as the name suggests uses random combinations of hyper pa-rameters and consequently, the time taken is drastically lower. Adding more hidden layers in the network can help to approximate more complex func-tions but can also lead to reduction in performance due to over-fitting since it increases the number of parameters that the model has to learn [22].

2.5.2 Activation Functions

Activation functions introduce non-linearity into a neural net, making them powerful enough to learn complex patterns. They determine the generaliza-tion ability of the model as well as the numerical properties of the learning procedure.

The standard choice of activation function 𝑔 is the rectified linear unit (ReLU) computing the function 𝑔(𝑧) = max{0, z}. ReLU has good properties for optimization and also prevent the model from over-fitting [8]. Other common activation function include the sigmoid function given by:

𝑔(𝑧) = 1

1 + 𝑒−𝑧 (19)

(27)

𝑔(𝑧) = 2

1 + 𝑒−2𝑧 − 1 (20)

2.5.3 Backpropagation

In a neural network, the final output is compared with the ground truth/ original input. This is done by a loss/cost function and the objective is to minimize this function since it estimates the error between the predicted and actual values. Neural networks minimize their loss function by updating the weights associated with each neuron using an optimization algorithm such as Stochastic Gradient Descent (SGD). The error is propagated backwards from the output layer through the whole network. The gradients with respect to each parameter in the network are calculated using the chain rule. Applying the chain rule to find the gradients of weights and biases for a fully connected dense neural network with 𝐿 layers produces the following mathematical ex-pressions [8]: 𝛿𝐿= 𝑎𝐿𝐽 ⊙ 𝑔′(𝑧𝐿) (21) 𝛿𝑙= ((𝑊𝑙+1)𝑇𝛿𝑙+1) ⊙ 𝑔(𝑧𝑙) (22) 𝜕𝐽 𝜕𝑏𝑙 = 𝛿𝑙 (23) 𝜕𝐽 𝜕𝑊𝑙 = 𝛿 𝑙(𝑎𝑙−1)𝑇 (24)

(28)

The subscript 𝑙 denotes the layer number . 𝑊𝑙 and 𝑏𝑙 are the weights and biases in layer 𝑙 respectively , 𝑧𝑙 is an intermediate quantity denoting the weighted input to layer such that the corresponding output activation of that layer is given by 𝑎𝑙 = 𝑔(𝑧𝑙) where 𝑔 denotes the activation function used. The ⊙ operator is an element-wise multiplication of vectors. The quantities 𝛿𝑙 are just intermediate quantities called the error signals that simplify the calculation of the gradients [8].

An optimization algorithm such as SGD finds the partial derivatives of weights in the network with respect to the loss. The weights are updated as follows:

𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑− 𝛼𝜕𝑊𝑜𝑙𝑑

𝜕𝐽 (25)

The weights are initialized randomly at the beginning of the learning process. The alpha is the learning rate. If the partial derivative is negative, the opti-mization algorithm increases the weight of that particular neuron in order to decrease the loss as shown by the first plot in Figure 2. On the contrary, if the partial derivative is positive, the weight of the particular neuron is decreased to reduce the loss as shown by the second plot in Figure 2.

2.6 Evaluation Methods

This section describes methods used to evaluate the performance of our models.

When building regression models, the general approach is to measure the dif-ference between the predicted output and the actual output. Here, we denote 𝑦 for the actual output and ̂𝑦 for the prediction. The mean absolute error (MAE) and the root mean squared error (RMSE) are two metrics that will be

(29)

used to describe the performance of the regression models.

2.6.1 Mean Absolute Error (MAE)

MAE is calculated by taking the absolute values of all errors and dividing by the total number of estimates N. It calculates the residual (the difference between the actual value and the predicted value) for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out and then takes the average of all these residuals. Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error[2]. MAE describes the typical magnitude of the residuals.

𝑀 𝐴𝐸 = 1 𝑁 𝑁 ∑ 𝑖=1 | ̂𝑦𝑖− 𝑦| (26)

2.6.2 Root Mean Square Error (RMSE)

RMSE is the most commonly used evaluation metric for regression models. It measures the error between a prediction ̂𝑦 and the actual value 𝑦 by calculating the difference between the two and squaring the resulting value.The resulting difference is also known as a residual. The average is calculated by taking the sum of squared residuals and dividing by the number of data points [2].The purpose of the squaring in RMSE is to punish large errors.

𝑅𝑀 𝑆𝐸 = √ √ √ ⎷ 1 𝑁 𝑁 ∑ 𝑖=1 | ̂𝑦𝑖− 𝑦𝑖|2 (27)

(30)

3. Method

This section describes the study of the data set that was used for implement-ing the models as well as the methods used for preprocessimplement-ing the data and conducting the experiments. The data about trouble tickets was gathered from multiple internal systems of the company after having detailed discus-sions with the Network Operations Center (NOC) team as to which features were present during the time of ticket creation. 95% of the Priority 4 trouble tickets were created automatically whereas only 5% were created by human intervention. This thesis focused only on the automatically generated tickets by filtering out the manually generated tickets. Only those attributes were used for modeling that were present at the time of creation. As the ticket moved through its lifecycle and event logs got updated, more information was added in the systems about the ticket but that was not used since the aim is to predict the resolution time of a ticket at the time of its creation.

3.1 DataSet

The data set consists of attributes about solved trouble tickets from previous years and their resolution times in minutes. Each ticket has several attributes consisting of both numeric and categorical attributes like the SiteID, longitude and latitude coordinates, postcode, city, hardware equipment, LAN, 2G/ 3G/ 4G, month of creation, day of creation, related alarms and case description. Some of the attributes were numerical, mostly were categorical and the case description column had text data that was created by a machine when a dis-turbance was detected. The format of the case description field depended on whether it was a fixed ticket (LAN) or a mobile ticket (2G/4G had a

(31)

simi-lar pattern whereas 3G tickets had a different pattern). For the categorical features, each feature has different amounts of unique values.

Table 1: Data Structure of a trouble ticket

Feature

Data_Type

Created Date

categorical

Created Year

categorical

Created Day Name

categorical

Created Month Name

categorical

Case Description

text

Technology(fixed vs mobile)

categorical

Host Name

categorical

No of Childs

numerical

Longitude

numerical

Latitude

numerical

Number of customers affected numerical

Resolution Time

numerical

(32)

Figure 2: Histogram showing the distribution of resolution time of trouble tickets

3.2 Data Cleaning & Preprocessing

It is the process of identifying incorrect, irrelevant, incomplete parts of a data set and then taking measures to ‘clean’ it so it is accurate, relevant, and consistent, and uniform which will help the model learn and predict better.

(33)

3.2.1 Removing Outliers

An outlier is a data point that is significantly different from the other data points.Real world data sets can contain extreme values that are outside the range of what is expected and unlike the other data. The predictive perfor-mance of machine learning models can be significantly improved by removing these outliers.

Whenever a ticket is generated, the system enters 8 hours (480 minutes) as the standard time for it to get resolved.There were some trouble tickets that had extremely high resolution times that were not in accordance with 90% of the solved tickets’ resolution time. A reason could be their information was not logged in time in the systems manually. Analyzing the histogram in 3.1, we could see that the vast majority of the tickets do get resolved within 8 hours. Therefore, all tickets whose resolution time was greater than 480 minutes were removed as the focus was to build a robust, efficient model that could make reliable predictions for tickets that fall within the 8 hour range.

3.2.2 Imputing Numeric Values

Since the data had to be extracted from multiple internal systems and then joined using the ticketID column, there were many missing values present in different columns. Different regression algorithms can be applied to impute the missing values. Each algorithm yielded different results so a comparison needs to be done and the model with the best results selected.

The following estimators were implemented for missing values’ imputation:

(34)

scikit-learn implementation and the regularization parameters lambda and alpha were fine-tuned

• Extra Trees Regressor: a number of randomized decision trees were fit on various subsamples of the data and average taken to enhance the predictive accuracy using the scikit-learn implementation

• K-Neighbors Regressor: a k-nearest neighbors regression model was fitted on the data using the scikit-learn implementation. The missing values are imputed by local interpolation of the targets associated with the nearest neighbors.

3.2.3 Imputing Categorical Values

The most frequent category was used to replace the missing values in each col-umn.There were some categorical features with a very high number of unique values and hence only the most common classes were considered to reduce the number of distinct classes.

3.2.4 One-hot encoding

Humans can understand all kinds of data be it numbers, strings, letters, or text of any kind. Machines, no matter how fast and intelligent they are, can only understand numbers. Hence, the need arises to convert textual data into numeric so it could be processed by the machines. Categorical data are variables that contain label values rather than numeric values. They can be both nominal such as gender (male, female), or ordinal such as education level (Bachelor, Master, Ph.D.).

(35)

One hot encoding is a technique to convert categorical variables into numeric variables.The categorical variable is encoded as a vector where only one ele-ment is ‘hot’ or non-zero. With one-hot encoding, a categorical feature be-comes an array whose size is the number of possible choices or categories for that feature.

Many categorical features had a huge number of classes. To reduce the number of distinct classes, only the most common classes were considered for such features, and the rest were assigned to a separate category as ‘Others’. We did not discard the examples that were outside the most common classes to get a model that would not be inflated since these classes would be present in unseen data.

Table 2: Unique values in categorical features

Feature

Unique_Values

SiteID

14385

Postcode

3825

Model

48

Hardware

18654

Location

918

Case classification

18

3.2.5 Geospatial features

The data contains geospatial features represented as longitude and latitude. These features were transformed by converting them to radians so they have the same range and hence, one does not dominate the other when fed into a machine learning model.

(36)

3.3 Dimensionality Reduction

Trouble tickets have many attributes and not all of them contain useful in-formation that the model can learn. For instance, features like ticketID serve as unique reference numbers that do not contain any valuable information as they are generated automatically. There was a Case Description column that contained textual data about the trouble ticket. The text was generated by a machine at the time of creation of the ticket and followed a similar pattern depending on whether it was a fixed network ticket (related to fixed services such as broadband) or a mobile network ticket (related to mobile services such as 2G/3G/4G).It contained valuable information such as the city, postcode, longitude, latitude and access switch. One option was to use a technique such as TF-IDF or Word2Vec to convert the text into numeric vectors and use them as features in the model but that would have resulted in a very high dimen-sional sparse data set containing hundreds of numeric features. Therefore, the idea of leveraging regular expressions to extract the desired information was adopted. A regular expression is a special sequence of characters that matches or finds other strings or sets of strings, using a specialized syntax held in a pattern. The regular expression (REGEX) library in Python was used to iden-tify the sequences in the case description text column and extract the relevant information.

Table 3 lists the features that were extracted from the Case description col-umn. They include city name, postcode, longitude, latitude, site ID and access switch. Using this method helped to reduce the feature space significantly.

(37)

Table 3: Features extracted from Case Description column using Regex

Feature

Data_Type

Technology(LAN/DSL/2G/3G/4G) categorical

Postcode

categorical

Longitude

numerical

Latitude

numerical

Hardware

categorical

SiteID

categorical

Access switch

categorical

3.4 Feature Importance & Selection

The Sequential Feature Selection method available in scikit-learn was used which has the parameter K that selects the desired number of features for our model. Different values of k were tried for both forward selection and backward selection. Those features were then used to build models and their performance were compared with the models built using the entire feature set to determine if feature selection led to improvement in results.

3.5 Baseline

Currently, the company has no system in place that calculates the resolution time of a ticket upon its generation. Whenever a ticket is generated automat-ically, the system puts 8 hours as its resolution time to the user. If the ticket does not get resolved in 8 hours, it again enters an additional 8 hours.

(38)

3.6.1 Train-Validate-Test Split

The data was randomly partitioned into a training set (80%), validation set (10%) and test set (10%).The splitting is done randomly to reduce interference from potential patterns in the data.

The training set is used to train the model. When building a neural network, for example, the model would use the training set to adjust the weights of the neurons. The validation set, also known as holdout set, is used to fine tune the model hyperparameters. In the case of neural networks, the validation set would be used to tune the number of hidden layers, learning rate etc. One of the major reasons for using a validation set is to ensure the model does not overfit to the data in the training set. Finally, the test set is used to get an unbiased estimate of the model performance after the hyperparameters have been tuned.

3.6.2 Hyperparameter Optimization

The performance of a machine learning model can be significantly improved by tuning its hyper parameters. A hyper-parameter is that parameter that is to be set in advance manually before the learning begins i.e. its value cannot be estimated from the data.

Hyperparameter optimization refers to the process of finding the best combi-nation of values that yield the best model performance. The best values can never be known in advance so it involves a heuristic approach to figure out the best estimates. The more hyperparameters a model has, the more time it will take to fine tune them.

(39)

search and random search. Grid search tries out every possible combination of the hyperparameters whereas random search tries random combinations of the hyperparameters. Random search was used to optimize the hyperparameters. The hyperparameters and the range of values tested have been presented in table 4 and Table 5 below.

3.6.3 Training XGBoost

The XGBoost model was trained in two steps. First, we used the training data to build a baseline model with the default hyper parameter settings to evaluate the performance of the model in general. Then a second model was built by tuning various hyperparameters of the XGBoost and the results were compared with the baseline model.

Initially, a higher learning rate is chosen and the hyperparameters tuned. Then, the learning rate is lowered and the model is trained again to iden-tify the optimum number of estimators [15].

Hyperparameters for XGBoost

1. no of estimators: represents the number of rounds the model is boosted. 2. learning rate: controls the shrinkage at each iteration during boosting.

The shrinkage parameter controls the penalization of each newly added tree to the model

3. max_depth: represents the maximum depth of the tree. It is used to

prevent the model from overfitting. Increasing the value of max_depth makes the model more complex

4. min_child_weight: defines the minimum sum of weights of all

(40)

partition results in a leaf node with the sum of instance weight that is less than the min_child_weight.

5. gamma: specifies the minimum loss reduction required to make a split. It

controls regularization since a node is split only when the resulting split leads to a positive reduction in the loss function

6. subsample: denotes the fraction of observations to be randomly sampled

for each tree. Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting

7. colsample_bytree: denotes the fraction of columns to be randomly

sampled for each tree

Following are the hyperparameters of the model and the values which were tested:

Table 4: Hyperparameters and their values for XGBoost

Hyperparameter Values

no of estimators

100, 150, 200, 250, 300, 350, 400, 450, 500

learning rate

0.01, 0.10, 0.20, 0.30, 0.50

max depth

3, 4, 5, 7, 9

min child weight

1, 3, 5, 7, 9

gamma

1, 2, 3, 4, 5

subsample

0.6, 0.7, 0.8, 0.9, 1.0

colsample_bytree

0.6, 0.7, 0.8, 0.9, 0.1

3.6.4 Training Neural Network

Different model architectures with varying number of hidden layers, number of neurons in each layer, activation functions and dropout were evaluated on the validation set to find a good model.

(41)

Table 5: Hyperparameters and their values for Neural Network

Hyperparameter

Values

batch size

32, 64, 128, 256, 512

epochs

100, 500, 1000, 1500

activation function

Relu

optimizer

Adam, SGD

learning rate

0.01, 0.10, 0.30, 0.50

hidden layers

1, 2, 3, 4

nodes per hidden layer 64, 128, 256, 512

3.6.5 Implementation Details

Experiments were implemented in Python using pandas, seaborn, matplotlib for data manipulation analysis and visualization, scikit-learn for implementa-tion of machine learning algorithms, and Keras running on the TensorFlow backend for the neural networks. The Python library NLTK was also used to understand the case description column which contains textual data.

(42)

4. Results

This chapter presents the results obtained from the experiments. Firstly, the results from using various imputation techniques for missing values are dis-cussed for both XGBoost and neural network. Then the model performance before and after hyperparameter optimization is presented. After identifying the best model, feature selection (forward and backward) is carried out for different values of K only for that model to see if it can further improve the results. Lastly, the optimal model along with its tuned hyperparameters are displayed.

4.1 Imputation Techniques for Model Selection

XGBoost

The missing values were imputed using Bayesian Ridge, Extra Trees Regres-sor and K Neighbors RegresRegres-sor and then an XGBoost model was fit with the default hyperparameter settings on the imputed dataset using all three imputation techniques.The results can been visualized in Figure 3. Bayesian-Ridge delivers a model with the lowest MAE as well as the lowest RMSE. ExtraTreesRegressor performs the worst and results in the highest MAE and RMSE.

Consequently, for the XGBoost model, Bayesian Ridge is chosen as the impu-tation technique since it results in the lowest error.

(43)

Table 6: Results for XGBoost with various imputation techniques

MAE RMSE

BayesianRidge

15.07

48.47

ExtraTreesRegressor

72.54

129.35

KNeighborsRegressor

58.08

113.31

Figure 3: Results for XGBoost with various imputation techniques

Neural Network (1 hidden layer, 512 neurons, batch size= 256, epochs= 1500, activation= relu)

The missing values were imputed using Bayesian Ridge, Extra Trees Regressor and K Neighbors Regressor and then a neural network model was fit with the default hyperparameter settings mentioned above on the imputed dataset

(44)

using all three imputation techniques.The results can been visualized in Figure 4. BayesianRidge delivers a model with the lowest MAE as well as the lowest RMSE. ExtraTreesRegressor performs the worst and results in the highest MAE and RMSE.

Consequently, for the neural network model also, Bayesian Ridge is chosen as the imputation technique since it results in the lowest error.

Table 7: Results for Neural Network with various imputation techniques

MAE RMSE

BayesianRidge

23.94

68.25

ExtraTreesRegressor

87.83

129.42

KNeighborsRegressor

81.23

110.73

(45)

The results for both models, XGBoost and neural network, show that BayesianRidge is a highly effective method to impute the numeric missing values in a data set. There is a significant difference in the error values when comparing the performance of the models against each imputation technique with Bayesian Ridge yielding far superior results.

4.2 Model Performance without hyperparameter

opti-mization

An XGBoost model and a neural network were fitted on the imputed dataset using Bayesian Ridge as the imputation technique. For XGBoost, the default hyperparameters were used and for the neural network, the hyperparameters stated in 4.1 were used. XGBoost performs better than neural network as it yields a considerably lower MAE and RMSE.

Without doing any hyperparameter tuning, XGBoost outperforms the neural network in estimating the resolution time of trouble tickets.

Table 8: Model Compariosn without hyperparameter optimization

MAE RMSE

XGBoost

15.07

48.47

(46)

Figure 5: XGBoost & Neural Network comparison before hyperparametr op-timization

4.3 Model Performance with hyperparameter

optimiza-tion

Fine tuning the hyperparameters resulted in slight performance improvement for XGBoost but a drastic improvement for neural network. For XGBoost, the MAE came down to 13.70 from 16.45 and the RMSE came down to 46.27 from 49.15. For neural network, the MAE improved from 27.47 to 21.25 and the RMSE from 69.37 to 49.14.

(47)

hyperparameter tuning, XGBoost still outperforms the neural network in es-timating the resolution time of trouble tickets by resulting in lower MAE and RMSE.

Table 9: Model Compariosn with hyperparameter optimization

MAE RMSE

XGBoost

13.70

46.27

Neural Network

21.25

49.14

Figure 6: XGBoost & Neural Network comparison after hyperparametr opti-mization

(48)

4.4 Feature Selection

From 4.3, we have identified XGBoost to yield better results than neural net-work by resulting in lower MAE and RMSE. Hence, feature selection was performed for XGBoost only and not neural networks with the value of K (the number of features to include in the model) incremented in steps from 10 to 100 to discover if it can improve the model performance. The XGBoost model used is the one whose hyperparameters have been fine-tuned.

4.4.1 Sequential Forward Selection

The lowest MAE occurs when K = 100 whereas the lowest RMSE occurs when the complete data set is used. Forward Selection does not improve the model performance but can reduce the model complexity by yielding almost similar results with a reduced feature space. It will also be much easier to explain the results of the model with fewer features.

Table 10: Sequential Forward Selection for XGBoost

MAE RMSE

K=10

15.78

50.23

K=30

14.34

48.07

K= 50

14.59

48.27

K=70

15.09

48.04

K=100

13.59

47.09

(49)

Figure 7: Sequential Forward Selection for XGBoost

4.4.2 Sequetial Backward Selection

The lowest MAE and the lowest RMSE occur when the complete data set are used. Again. Backward Selection does not improve the model performance but can reduce the model complexity by yielding almost similar results with a reduced feature space. Again, it will also be much easier to explain the results of the model with lesser number of features.

(50)

Table 11: Sequential Backward Selection for XGBoost

MAE RMSE

K=10

19.86

54.65

K=30

16.31

49.11

K= 50

16.37

48.69

K=70

16.71

49.24

K=100

16.38

48.53

Full data set

13.70

46.27

(51)

4.5 Optimal Model

From 4.2 an 4,3, we conclude that XGBoost delivers better results than neural networks since it yields a lower value for MAE and RMSE before and after the hyperparameters have been tuned. XGBoost is our model of choice for estimating the resolution time of trouble tickets with the following optimal configuration:

Table 12: Hyperparameter values for optimized XGBoost

Hyperparamater Value

no of estimators

400.0

learning rate

0.1

max depth

7.0

min child weight

3.0

gamma

5.0

subsample

0.6

(52)

5. Discussion & Conclusion

This chapter summarizes the objective for which the thesis was carried out, the methodology that was followed to extract, manipulate and understand the data to build statistical models, the results yielded by them, how the scope of the work could be increased in future to further enhance the results and some recommendations to standardize the data ingestion process.

5.1 Discussion

The fundamental objective to carry out this thesis work at Telenor Sweden was to determine if machine learning, deep learning and data mining techniques could be leveraged to analyze the past historical data of trouble tickets and build predictive models to accurately estimate the resolution times. An effec-tive model would help Telenor in maintaining user satisfaction by letting the customers know how long it would take to resolve the issue they are facing.

Results show relevant correlations and hidden patterns in the data which could be harnessed to build regression models using gradient boosted decision trees and deep neural networks. The model could be deployed to give users an es-timate of the resolution times as it is much more efficient than the current system in place which enters a standard 8 hours for every ticket that is gener-ated. XGBoost achieves better results than deep neural networks by yielding lower Mean Absolute Error as well as Root Mean Square Error.

Approximately 38,000 tickets were analyzed and evaluated for the regression models that were built. The data was extracted from multiple internal systems to enrich the tickets. However, a large portion of the gathered data had missing

(53)

values and other inconsistencies which had to be taken into account. Hence, the amount of data was adequate but the quality of data could be improved by ensuring consistency across all fields. Upon analyzing the distribution of the resolution times, it was discovered that the vast majority of tickets get resolved within 8 hours and hence only those tickets were taken into consideration after having a discussion with the NOC team. One of the biggest challenges was presented by missing data for which various iterative imputation techniques were tested on both XGBoost and neural network models to evaluate the performance. Bayesian Ridge delivered the lowest RMSE for both models and hence was selected as the imputation technique for the analysis.

The trouble ticket information gets updated as the ticket moves through its lifecycle and completes its journey. However, the focus of this thesis is to use only those ticket attributes that are available at the time of its creation since Telenor is interested in giving customers an estimate as soon as some disturbance is detected in the networks. In the first stage, the most significant attributes of tickets were identified from the data set. This required domain knowledge and was made possible after holding multiple meetings with the networks team at Telenor Sweden. Most of the features were categorical that had to be one-hot encoded. One-hot encoding resulted in the creation of a high dimensional feature space since some attributes contained too many distinct values. To reduce the feature dimension, only the most frequently occurring classes were taken into account and the rest were all clubbed as ‘Others’.

In the second stage, the most relevant machine learning algorithms were ex-plored that would result in the highest accuracy and predictive performance. XGBoost and neural networks were investigated as both tend to perform well on large data sets. XGBoost model results in lower RMSE as well as MAE.

(54)

Hyperparameter optimization was done to fine-tune the models and improve their predictive performance. With the XGBoost model, it is possible to lower the learning rate and significantly increase the number of estimators to get a slight improvement in the results than the one currently obtained but that would increase the computational complexity of the model. Hence, there is a trade-off between performance and complexity. Consequently, this means the possibility of deploying a model to accurately estimate the resolution times.

5.2 Future Work

The thesis studied the attributes for fixed tickets and mobile tickets in the same data set. To further improve the results, the data could be segmented and separate machine learning models built for fixed tickets and mobile tickets. There is also weather data and access switches’ data that has not been yet explored. The data for access switches is available in Telenor’s internal systems but the weather data needs to be fetched from the Swedish Meteorological and Hydrological Institute website through an API. In the future, additional data could be explored and new features incorporated in the models to see if it enhances the model performance.

5.3 Conclusion

Machine learning and deep learning approaches could be used to build pre-dictive models to estimate the resolution time of trouble tickets generated automatically due to network outages. Two models investigated were XG-Boost and Neural Networks and both deliver results that can help Telenor build a model and deploy it into production to make predictions in real-time.

(55)

Experiments were conducted to test different algorithms for missing value im-putation and hyperparameter optimization for both models. Feature selection was only carried out for XGBoost once it was determined it yielded better results than neural networks. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were used as the evaluation metrics to compare and evaluate the performance of the models.

The inconsistencies in the data highlight the importance of defining standards so all the tickets generated have the same attributes which can be used for modeling purposes. This would improve the quality of data, thus leading to models with better performance.

(56)

References

1. Lindberg, M., 2013. Decision Support Systems: Diagnostics and Expla-nation methods: In the context of telecommunication networks

2. Laurentz, H., 2016. Feasibility of using network support data to predict risk level of trouble tickets

3. Lewis, L. and Dreo, G. 1993. Extending Trouble Ticket Systems to Fault Diagnostics, IEEE Network, 7(6), 44–51.

4. Temprado, Y., Molinero, F. J., García, C. and Gómez, J. 2008. Knowl-edge Discovery from Trouble Ticketing Reports in a Large Telecom-munication Company, 2008 International Conference on Computational Intelligence for Modelling Control and Automation, CIMCA 2008, 37– 42.

5. Symonenko, S., Rowe, S. and Liddy, E. D. 2006. Illuminating Trouble Tickets with Sublanguage Theory, Proceedings of the Human L

6. Medem, A., Akodjenou, M. I. and Teixeira, R. 2009. Trouble Miner: Mining Network Trouble Tickets, 2009 IFIP/IEEE International Sym-posium on Integrated Network Management- Workshops, 113–119.

7. Sample, K.R., Lin, A.C., Borghetti, B.J. and Peterson, G.L., 2018, May. Predicting Trouble Ticket Resolution. In The Thirty-First International Flairs Conference.

8. Löfgren, J., 2017. Data Mining of Trouble Tickets for Automatic Action Recommendation.

(57)

9. Ekström, L., 2018. Estimating fuel consumption using regression and machine learning.

10. Neal, B., 2019. On the bias-variance tradeoff: textbooks need an update. arXiv preprint arXiv:1912.08286.

11. Bishop, C.M., 2006. Pattern recognition and machine learning. Springer. (Bayesian Ridge)

12. Geurts, P., Ernst, D. and Wehenkel, L., 2006. Extremely randomized trees. Machine learning, 63(1), pp.3-42. (extra trees)

13. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tib-shirani, R., Botstein, D. and Altman, R.B., 2001. Missing value estima-tion methods for DNA microarrays. Bioinformatics, 17(6), pp.520-525. (K Neighbors Regressor)

14. M Mostafa, S., S Eladimy, A., Hamad, S. and Amano, H., 2020. CBRL and CBRC: Novel Algorithms for Improving Missing Value Imputa-tion Accuracy Based on Bayesian Ridge Regression. Symmetry, 12(10), p.1594.

15. Zeng, Y., 2011. A study of missing data imputation and predictive modeling of strength properties of wood composites.

16. Gupta, C., 2019. Feature Selection and Analysis for Standard Machine Learning Classification of Audio Beehive Samples.

17. Chen, T. and Guestrin, C., 2016, August. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

(58)

18. Arif, A. and Wang, Z., 2018, June. Distribution network outage data analysis and repair time prediction using deep learning. In 2018 IEEE International Conference on Probabilistic Methods Applied to Power Systems (PMAPS) (pp. 1-6). IEEE.

19. Kumar, V. and Minz, S., 2014. Feature selection: a literature review. SmartCR, 4(3), pp.211-229.

20. Ahlin, M. and Ranby, F., 2019. Predicting Marketing Churn Using Machine Learning Models.

21. Salam Patrous, Z., 2018. Evaluating XGBoost for user classification by using behavioral features extracted from smartphone sensors

22. Gómez, J., Temprado, Y., Gallardo, M., García, C. and Molinero, F.J., 2009. Application of neural network to predict adverse situations in trouble ticketing reports. INTELLIGENT SYSTEMS AND AGENTS 2009, p.204.

References

Related documents

For water leakage detection and localization, a procedure for obtaining training data is proposed, which serves as a basis for recognition of patterns and regularities in the data

The other approach is, since almost always the same machine learning approaches will be the best (same type of kernel, number of neighbors, etc.) and only

Three machine learning models have been applied to the process data from Forsmark 1 to identify moments when the power production ends up below the value predicted by the cooling

Consider an instance space X consisting of all possible text docu- ments (i.e., all possible strings of words and punctuation of all possible lengths). The task is to learn

You can then use statistics to assess the quality of your feature matrix and even leverage statistical measures to build effective machine learning algorithms, as discussed

More trees do however increase computation time and the added benefit of calculating a larger number of trees diminishes with forest size.. It is useful to look at the OOB

To recap the data collection: from a primary instance are generated a lot of secondaries instances; these instances are solved repeatedly with a random permutation heuristic in order

One face detection algorithm and a face recognizer network, namely Mtcnn and InceptionResnetv1 are implemented on the video in order to illustrate the system.. Figure 4.6: