Understanding Customer Problems through Text Categorisation

(1)

IT 18 055

Examensarbete 30 hp Oktober 2018

Understanding Customer Problems through Text Categorisation

Fatimah Ilona Asa Sabsono

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Understanding Customer Problems through Text Categorisation

Fatimah Ilona Asa Sabsono

Customer problem is a common problem that needs to be handled in the company that provides support to their customer. Abundant data that it produced makes it inefficient to do it manually, which makes machine learning as an approach that could help to solve it. This project achieved a suitable approach of classifying a customer problem using text categorisation. This particular dataset is solvable when using Term Frequency-Inverse Document Frequency and one-hot encoding to generate the feature and use Logistic Regression as the classifier. Three measurement metrics, named F1 weighted score, Geometric Mean, and Indexed Balance Accuracy, was used to measure this imbalanced dataset.

Ämnesgranskare: Christian Rohner Handledare: Melodi Nergis Demirag

(4)

(5)

Acknowledgement

I would like to thank everyone involved in this master thesis project, especially my supervisor Melodi Nergis Demirag, my partner in discussion Victor Chima, and others from the company that this thesis was taken. I would also thank my reviewer, Christian Rochner who has been patient and meticulous in helping me to make this report through his positive constructing feedback. All of this will never happened if I did not study in Uppsala University, which was fully supported by Swedish Institute through Swedish Institute Study Scholarship (SISS) programme. Therefore, my big gratitude for Swedish Government and Swedish Institute’s sta↵. I am also very grateful for my friends who has been with me through out the past 2 years in Sweden, Jody Handoko, Nadhira Seraphine, Ruth Priscila, Michael Wijaya, Rahmanu Hermawan, Nurudin Kamil, Sujata Tamang, Rahul Setty, Fillipos Lanaras, Suraj Murali, Nayada, and Aleksandra Obeso. The last but not least, I am very grateful for my family whose support is not lessen by the distance, my mother Evi Rivayanti, my father Sabsono Ananto, and my siblings Ibnu Ramadhan and Lathifah Indah.

Finally, I thank my God, Allah SWT for all the blessing.

(6)

List of Figures

1.1 Traditional customer support service workflow [20] . . . 2

2.1 Steps of text categorisation . . . 5

3.1 Random Forest workflow . . . 16

3.2 Stacked classifier workflow . . . 17

3.3 Illustration of cross validation . . . 19

3.4 Confusion matrix for binary classification . . . 20

4.1 General workflow . . . 25

4.2 Data distribution per class . . . 27

4.3 Change company class value into customer class value . . . 29

4.4 Testing the model on company labeled data . . . 29

4.5 Amount of unlabelled data . . . 30

4.6 Workflow of TF-IDF . . . 32

4.7 Workflow of TF-IDF and One-hot encoding . . . 33

4.8 Workflow of Doc2Vec . . . 34

4.9 Workflow of Doc2Vec and One-hot encoding . . . 35

5.1 F1 weighted score comparison . . . 39

5.2 Geometric mean score comparison . . . 40

5.3 Index Balanced Accuracy comparison . . . 40

5.4 F1 weighted score comparison . . . 42

5.5 Geometric mean score comparison . . . 42

5.6 Index Balanced Accuracy comparison . . . 43

5.7 Result of classifying unlabeled data . . . 44

6.1 Confusion matrix for description as feature . . . 46

6.2 Confusion matrix for description and category as feature . . . 47

(10)

A.1 Logistic Regression classifier: bias versus variance . . . 57

A.2 Linear SVC: bias versus variance . . . 58

A.3 Multinomial Naive Bayes: bias versus variance . . . 58

A.4 Random Forest: bias versus variance . . . 59

A.5 Stacked classifier: bias versus variance . . . 59

B.1 Screenshot of performance table with F1 as the metric for using dataset A and dataset B as train set and test set . . . 60

B.2 Screenshot of performance table with Gmean as the metric for using dataset A and dataset B as train set and test set . . . 60

B.3 Screenshot of performance table with IBA as the metric for using dataset A and dataset B as train set and test set . . . 61

B.4 Screenshot of performance table with F1 as the metric for using dataset A and dataset B as train set and company labeled data as test set . . 61

B.5 Screenshot of performance table with Gmean as the metric for using dataset A and dataset B as train set and company labeled data as test set . . . 61

B.6 Screenshot of performance table with IBA as the metric for using dataset A and dataset B as train set and company labeled data as test set . . . 62

(11)

List of Tables

4.1 Illustration of the data. It is not the real value of the data. All values were made up to give a general view. . . 26 B.1 A detailed data distribution after unlabelled data was classified. It is

ordered by big amount of data to small amount of data. . . 62

(12)

List of Acronyms

IBA Indexed Balance Accuracy

Gmean Geometric Mean

TF Term Frequency

TF-IDF Term Frequency - Inverse Document Frequency

tp true positive

tn true negative

fp false positive

fn false negative

SVC Support Vector Classifier

(13)

Chapter 1 Introduction

Customer support is an area where various media communication is used for explaining and describing customer problem and feedback about a company’s service. It is very common for service providers or companies to provide customer support through a phone line or email at the minimum. Nowadays, more companies seek a better and more efficient way of channelling these problems. They build online chatting platforms that could help solve the problem faster and contact form submission embedded on their website. If the company grows fast, the amount of those feedbacks will grow too. This master thesis project was conducted in a company which gives customer support as a service that supports their primary business activity.

Figure 1.1 illustrates a traditional customer support service as a hot-line phone communication between the customer and service engineer. The service engineer solves the problem by searching for the right advice in the Advisory System and passes that onto the customer [20]. This is a typical and straightforward structure of customer support service which consists of people with skills to advise customers, a knowledge base system that they could use, and a hot-line as a media to capture customer problems.

(14)

Figure 1.1: Traditional customer support service workflow [20]

Previous researches showed that the traditional way of handling and serving customer support is not scalable and efficient. Thus, the use of data mining or machine learning was proposed. Researchers from Nanyang Technological University utilised data mining to extract knowledge from day-to-day customer data to help company activities in decision support and machine fault diagnosis [20]. A machine learning approach to classify customer problem through speech-to-text were implemented in order to build a real-time voice classifier [40].

Big leading companies are using machine learning in di↵erent aspects [13]. Uber uses machine learning to estimate arrival times, pick-up location and meal deliveries [29]. Facebook uses machine learning as part of facial recognition in suggested tagging for pictures [9]. Amazon and Lyst use machine learning to build a good recommendation system by matching customers’ previous data to suggest related items [45]. Spotify and Netflix build personalised playlists and suggestions based on customer’s previous streaming activities [13] [34]. Paypal uses machine learning to detect fraud in transactions [22].

Ocado¹, an online-only grocery supermarket based in the United Kingdom, has been using machine learning to improve their customer support service. They provide a call centre, email, and social media as the medium of communication between customers and employees in order to be able to support the customers appropriately.

It is deemed necessary by them to integrate machine learning into the process of

1https://www.ocado.com

(15)

delivering customer support, especially in handling email data. They built a model that could identify the content of an email and tag each email to help contact centre workers determine the priority of each email [19].

1.1 Problem Definition

The company that is in collaboration for this thesis project has similar data with what was described by Ocado, but it has a di↵erent way of acquiring it. Available data was acquired by collecting the data that customers submitted through the company’s website where they need to go through several problem identification steps and pick specific topic beforehand. The difficulty of picking the right topic for their problem resulted in the unfit topic being picked by the customer. Thus, a significant amount of data was collected without a proper topic attached to it.

This condition is a problematic situation for the company because it will take some time for them to understand the problem.The company must manually analyse each problem description in order to give a proper solution.

Text categorisation is a technique of grouping text based on its topic or category.

It has been used for long to manage documents more efficiently as mentioned in this research [39]. This project will utilise text categorisation to group the available data based on their topic problem and finding out the best approach to do it. This is a multiclass classification problem, that contains three or more classes. A good performance is usually based on the measurement of accuracy, but this project used a combination of precision and recall. A detailed description of measurements that were used in this project will be described in Section 3.5.

1.2 Objective and Goals

The objective of this master thesis project is to understand the customer problem by utilising machine learning to categorise those feedbacks. The first and important step in this project is to understand what kind of data that we have in the most used platform. Thus, this project will achieve its objective through these goals:

• Find the most suitable approach to get a good performance.

• Find the most suitable classifier for this problem.

(16)

• Evaluate the trained classifier with suitable metrics.

1.3 Delimitation

The data that was used for this project is limited to the English language only. This project has many challenges, and two main challenges are critical in this, such as the possibility of class noise and imbalanced data. The data described in this report are generalised to avoid exposing the real company data. The source code also cannot be shown in this report as it is company’s property. A previous experiment was done internally within the company that has some aspects that is used in this project, which includes a process to clean dataset, a ratio of the training set and testing set, and a value of F1 score as a baseline. The project itself cannot be described in this report due to its confidential nature.

1.4 Methodology

This project consists of study literature to gather the knowledge that is available from related researches and previous experiment that has been done. Data gathering and analysing was done in parallel with study literature and went for the whole project duration. The next step was designing the workflow and implementing it.

This process goes through evaluation and improvement continuously.

1.5 Thesis Structure

This report consists of several sections that start with the background of this project.

The problem has been defined, and few goals were set as part of the introduction in Chapter 1. Chapter 2 contains related works that have been done by others and could help as the reference for this project. Chapter 3 is a general theory of standard machine learning techniques for classification, while Chapter 4 is a detailed explanation for the implementation. Chapter 5 is the result gained from this project followed by Chapter 6 which further discuss it. The summary and future research will be presented in Chapter 7 and 8.

(17)

Chapter 2 Text Categorisation

Text categorisation is an ever-growing field with many researchers generalising the problem as a multiclass classification. Some previous researchers showed the comparison of di↵erent classification model [39] or even found out the most suitable model for text classification and the reason behind it [21]. There is always a need to be better in organising textual document, especially in an enterprise environment. The process of using machine learning for this kind of problem starts from the beginning of preprocessing the text, generating the feature (a measurable characteristic of an item) from a text format into a vector, training the classifier by feeding the feature and label, and evaluate the performance of trained model using suitable metrics.

Figure 2.1 shows those steps.

Figure 2.1: Steps of text categorisation

Text pre-processing is the first step in working with textual data in machine learning.

Simple tokenisation of words is enough for most datasets, and the consistency of preprocessing test set is important [8]. It is also essential to take into account that a proper feature selection and representation of text will also improve the performance of a classifier, which is proof that it works for Support Vector Machine (SVM) [16].

It is also important to care more about text representation than the configuration

(18)

of a classification model. E. Leopold and J. Kindermann shows that changing ways of weighting the word is impacting the performance more than changing the kernel configuration of the model [26]. A combination of words is also a suitable way of having a di↵erent representation of a text. Using a combination of 2 words called bi-gram for text representation will result in a good performance for sentiment analysis with 2 classes as the label, while it will perform worse if being used for multiclass classification problem [42].

The transformation of text into a vector in this process is usually handled using natural language processing, such as using Doc2Vec² where the context of the word is the key of putting the word in the vector space or based on the frequency of the word by implementing TF-IDF as the term-weight scheme. Both techniques will be explained in section 3.2. A continuous vector representation of words gives a good performance in accuracy using the proposed model named as Continuous Bag-of-Words (CBOW) and Continuous Skip-gram in previous research [31] [30].

A term-weighting approach is more suitable for text retrieval compared to a more elaborate text representation based on this research [38].

The pre-processing and feature generation is the crucial part of text classification since the information in the text itself is essential. If the way of representing the text in the vector space is suitable for that dataset and problem, then the classifier could work better in learning the trend from the feature. A research mentioned that removing stopword and stemming could mean decreasing the quality of word representation [36], while another research said that di↵erent types of word representation could improve classification in patent domain [12].

All those steps above will be worthless if the metric measurements are not suitable.

An incorrect metric would not give a proper measurement of a model. It is because each problem is unique and bound by the data that was currently being used. A di↵erent characteristics of data will change the model that was trained by it. Accuracy would be enough as a measurement metric if the data only needed to be classified into two classes while classifying more than two classes are more familiar to use the f1 score as a measurement metric. The F1 score would be able to tell the performance by combining precision and recall, thus give more insight into how the model performs for each class. Other than that, the confusion matrix is one of the measurement metrics that could give an even more detail performance of each class compared to others. These metrics are more useful to gain insight into performance from a model trained using imbalanced data and a possibility of class noise. Section 3.5 will explain

2https://radimrehurek.com/gensim/models/doc2vec.html

(19)

those metrics above.

2.1 Variety of Noise

Noise is a common problem in real-world data that is used in machine learning. The degree of how much the noise could a↵ect a model performance is something that needs to be minimised since noise is not a good thing to have in this project. The higher the amount of noise that we have in our data would cause the model to learn in a wrong way. Noise in text categorisation could be defined in 2 di↵erent ways, feature noise and class noise. Feature noise is a noise that appears within the text and distorts the content [1]. This kind of noise could explode the feature dimension. As an example, the word email is a potential feature that could relate to email problem, but it could also appear in other documents as e-mail. This condition makes those two words as a di↵erent feature, hence the feature dimension getting bigger. It could lead into the curse of dimensionality problem, a high amount of dimension makes it difficult to detect group with similarity because the data become sparse.

Another type of noise is called class noise. There are two di↵erent types of class noise, the one which has the same feature value with di↵erent label value which was described in this research [17] and the one which has the wrong value of label (mislabeling) in this research [47]. There was research of measuring the impact of noise in the data for the model performance to recognise which kind of noise is impacting the most and which type of classifier is most robust for handling noisy data.

The result of that research is that Naive Bayes is the most robust among the tested classification model and the impact of noise varies, but noise in training dataset, in general, have the most impact [32].

2.2 Imbalanced Data

Imbalanced data in classification is a condition where it has an uneven distribution of data for each class. There is the dominant class(es) with a high amount of data compared to others. Imbalanced data is also one of the most common problems in real-world data. The model usually has tendencies to perform better for the dominant classes compared to less represented classes. The critical part of working with imbalanced data is not only when preparing the data used for training the model,

(20)

but also the type of metric that is used for measuring that model. Related research points out that oversampling or under-sampling to make the data more balanced could have resulted in a possibility of overlearning or information loss [46]; therefore this project did not use any resampling method. There are a few previous works which suggest that using many measurements to evaluate the performance better is better than relying on 1 or a few common metric measurements. The research mentioned that accuracy as the measurement metric is less reliable and bias toward the dominant class while mentioning several other metrics with its focus on evaluation [18]. Another research proposed a metric called weighted-AUC to measure imbalanced data for a cost-bias situation [43]. Another research mentioned a combination of measurement and graphical performance assessment [3] which inspires this project to use F1 score Geometric Mean, and Balanced Accuracy as the metric measurement.

(21)

Chapter 3 Theory

This chapter contains a general definition of available techniques in text categorisation, especially for this project. It starts with relevant pre-processing techniques for the text, engineering the feature from available data, explanation of various classifiers, basic technique of tuning the classifier, and relevant evaluation metrics.

3.1 Pre-Processing

Pre-processing is a process of making the data ready to be used which would then be fed into the model. It is a good start to change the character into lowercase before applying any function to avoid any case-sensitive problem. After that, all unimportant things could be removed such as number, symbols and stop words. Stop words is a list of words that does not contain a specific meaning or important information, for example, the word is does not have any significant meaning as itself. There is also a process of fixing misspelling word such as: log-in or log in are corrected into login, e-mail or e mail are corrected into email. There is also a process of anonymised email

address into a general word of email address.

All of those steps are usually good enough to produce excellent and clean text data, but there are at least two more types of pre-processing that could be done. Lemmatisation and stemming are a more complicated technique to prepare the word.

(22)

3.1.1 Lemmatisation

Lemmatisation is converting a word into its canonical (dictionary) form, based on the use of vocabulary and morphological analysis of the word [27]. Lemmatisation depends heavily on vocabulary and morphological analysis. This project used lemmatisation with NLTK WordNet Lemmatizer³ and list of stop words from NLTK corpus [4].

3.1.2 Stemming

Stemming is a heuristic process of cutting the end of a word in hope to get a consistent word out of word with affixes. Stemming is heavily depended on the language because it is based on how each language changed the word when affixes were applied.

3.2 Feature Engineering

Feature engineering is also known as feature extraction or feature generation. It is a process of extracting information from raw data into a specific form and constructing a feature which could be new and could be constructed of several components of data [5].

In this master’s thesis project, available raw data are in the form of textual data, and categorical data with each own form has several techniques of feature generation.

Those techniques are TF, TF-IDF, one-hot encoding, and word embedding which will be explained in this section.

3.2.1 Term Frequency (TF)

Term frequency is giving each term a weight based on its appearance in the data.

The term itself could be defined as a word in the data. Below is an example of how term frequency works.

Document 1 (D1) = ”a blue book on the table”

Document 2 (D2) = ”a blue pen, a brown book, and a table”

3https://www.nltk.org/index.html

(23)

Based on term frequencies of each word, the document would be represented as below:

X a blue book on the table pen brown and

D1 1 1 1 1 1 1 0 0 0

D2 3 1 1 0 0 1 1 1 1

3.2.2 Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency will favour the term with the highest frequency and risk the loss of getting the meaningful information that probably lies in the less frequent terms. A weighting scheme named Term Frequency-Inverse Document Frequency (TF-IDF) is a way to give a better value for each term. It took the importance of a term in the document which resulted in higher value for the most frequent term that appears in few documents and the lower value for a more frequent term in many documents or less frequent term in a document [28].

tf idf (t, d) = tf (t, d)· idf(t) (3.1) The equation of TF-IDF shown in equation 3.1 with tf (t, d) as term frequency function of term t in a document d and idf (t) as inverse document frequency of term t. The equation of inverse document frequency itself is explained at equation 3.2 with N as total number of documents and df (d, t) as document frequency function of number of documents d that contain term t.

idf (t) = log⇣ 1 + N 1 + df (d, t)

⌘+ 1 (3.2)

3.2.3 One-Hot Encoding

Categorical value as a feature needs to be extracted with specific technique to maintain its information without confusing it with ordinal value. Categorical value is a discrete unordered value [15] that should not have a representation of numerical data where category A is higher than category B. One-hot encoding works as a mapper to transform each categorical value into an array of binary values with at most one value activated [7]. A suitable example of transforming categorical value into a vector is displayed below.

(24)

C ol o r = [ ' Blue ' , ' Green ' , 'Red ' ]

oneHot ( Co l or ) = [ [ 1 , 0 , 0 ] , [ 0 , 1 , 0 ] , [ 0 , 0 , 1 ] ]

3.2.4 Word Embedding

Word embedding is a set of language modelling that captures information which reflects the structure of the word [33]. Word2Vec is one of the methods that particularly mapped word into vector based on its context. It uses Continuous Bag-of-Words (CBOW) or Skip-gram as the model that transform a word into a vector. CBOW is a model that could predict a word based on a context word as an input, while Skip-gram will predict the context of a word [37]. Those model focus on the word level while there is another model called Doc2Vec which focuses on a higher level of the text, either a sentence, paragraph or document. Doc2Vec extends the functionality of Word2Vec by adding one feature vector which is document-unique. This model can transform the whole document as a unit and relate them to other documents through each document-unique vector [24].

It maps each document into a single vector and each word into another vector.

Document vector represented by a column in matrix D and word vector represented by a column in matrix W. Those vectors are then concatenated, or averaged, or summed and classified using softmax or hierarchical softmax to predict the next word in the same context. This algorithm is called Paragraph Vector - Distributed Memory (PV-DM). There is another algorithm that ignores the context and produces a set of vector words based on sampled documents. This algorithm is called Paragraph Vector - Distributed Bag of Words (PV-DBOW) with advantages of fewer store data compared to PV-DM. PV-DM will store softmax weights and word vectors while PV-DBOW only needs to store softmax weights [24]. There is not much di↵erence in performance between those two di↵erent techniques, but this project used PV-DBOW with summed vectors because it needs less memory while running and it was used for transforming text into a vector instead of predicting the next word in the same context.

3.3 Classifier

A classifier is a model or algorithm that will be trained by consuming curated labelled data and learning from it to be able to classify unlabeled future data. These classifiers

(25)

have their way of learning and classifying based on the implemented algorithm. A classifier could be suitable for solving a specific problem and would be wrong on another; there is not a single model that could solve every problem. Naive Bayes (section 3.3.1), Logistic Regression (section 3.3.2), Support Vector Machine (section 3.3.3), Random Forest (section 3.3.4), and Stacked Generalization (section 3.3.5) are the ones that were used in this project.

3.3.1 Naive Bayes

Naive Bayes is a supervised algorithm that is based on the Bayes theorem with a naive assumption that every pair of features is independent [44]. It measures the probability of a class based on the value of the features. Equation 3.3 shows that a probability of class c, P (c|x)), is happening based on the probability of feature x appearing, P (x|c), with their own prior probability P (x) and P (c). It requires a small amount of training data and extremely fast compared to a more sophisticated algorithm. It is most well-known to perform well on document classification or spam detection problem.

This project specifically used Multinomial Naive Bayes and Gaussian Naive Bayes, which is a classifier that implements Naive Bayes that could work on a multiclass classification problem. This type of classifier that relies on probabilistic value will have a problem of encountering unknown features and will result in zero value as the probability. Multinomial Naive Bayes resolve this problem through a smoothing algorithm, either by using Laplace smoothing or Lidstone smoothing [11].

P (c|x) = P (x|c)P (c)

P (x) (3.3)

3.3.2 Logistic Regression

It is a linear classifier and usually called as logit regression, maximum-entropy classification, or log-linear classifier. The model is implemented using a logistic sigmoid function which could be seen in equation 3.4. It could have a di↵erent type of learning/solver with optional regularisation implemented. There are several learning algorithms such as Stochastic Average Gradient (SAG), liblinear with coordinate descent, Limited-memory BFGS (LBFGS), Newton-CG, and Stochastic Gradient Descent (SGD) [10]. SAG, LBFGS, and Newton-CG could learn the real multiclass model and able to use L2 regularisation. Liblinear could use L1 or L2 as regularisation,

(26)

but it can only use one-versus-rest to be able to learn the multiclass model. L1 regularisation will produce a sparse vector while L2 will produce a dense vector.

Hyperparameters are used to determine the best value of regularisation strength.

(a) = 1

1 + e ^a (3.4)

A logistic regression for binary classification could be defined as equation 3.5 which implemented logistic sigmoid function 3.4. The equation 3.5 models a probability of class C1 a↵ected by (a feature vector) which equals to (.) (a logistic sigmoid function). It sets the output as 1 if the instances belong to that class and 0 if they belong to the other class [44].

p(C1| ) = y( ) = (w^T ) (3.5)

3.3.3 Support Vector Classifier (SVC)

Support Vector Classifier is a model that optimises the way of linear model works to handle a non-linear problem. The main component of this concept is making a hyperplane that separate classes based on their support vector. Support vector is a data point which has the closest distance to the nearest class, thus the use of it to build the hyperplane by maximising the distance between them and it is called Maximum-hyperplane [44]. A hyperplane could be described as:

x = w0+ w1a1+ w2a2 (3.6)

with a1 and a2 as the attribute values, while w1, w2, and w3 are weights. The Maximum-margin hyperplane can be written as below:

x = b + Xl

i=1

↵iyia(i)· a (3.7)

The explanation for that equation above is, yi is the class value of training instance a(i), while b and ↵i are numeric parameters that have to be determined by the learning algorithm [44].

(27)

The classifier has a di↵erent way of learning a multi-class problem; some use one-vs- one, and some use one-vs-rest. One-vs-one will make n class·(n 1)

2 classifier, and each classifier will learn the data that has two di↵erent class, while one-vs-rest will make n class classifier and learn in each class compared to other classes. SVC optimisation using SGD were also implemented in this project.

3.3.4 Random Forest

Random forest is an ensemble classifier that expands the decision tree classifier. It built trees where each of them is coming from sampling training set with replacement (bootstrap) which means that each tree handles the same size of a dataset with di↵erent values [6]. Each tree will give their prediction of a label, and all predictions will be averaged to get the final label. The randomness that is implemented through resampling with replacement will increase bias and average the result will increase variance which means the model is supposed to be better.

(28)

Figure 3.1: Random Forest workflow

3.3.5 Stacked Generalisation

It is another structure of ensemble classifier where there are a few classifiers as a basic classifier that gives intermediate result and passes that to the final classifier (meta-classifier) [2]. Each classifier is trained on the full training set and produce a prediction. The prediction could be in the form of a final prediction or a probability of the prediction. This model could use any classifier for either the meta-classifier or not; thus it gives flexibility of building the stack.

(29)

Figure 3.2: Stacked classifier workflow

3.4 Hyperparameter Tuning

Hyperparameter tuning is a method to find the best configuration for an estimator’s parameter, either a classifier, a feature vectorizer, or a feature selector. The main focus in this step for this project is to get the best configuration for the classifier. The training set was used in this process and divided into di↵erent partition using K-fold cross-validation and searching through a combination of value from the parameter grid.

(30)

3.4.1 Parameter Search

Hyperparameter tuning requires a certain set of value for each parameter where it goes through all the combination using a GridSearch or randomly choose a certain amount of combination using RandomSearch. A parameter is an argument or variable to configure the classifier, while the parameter grid is a dictionary of predefined values of each parameter.

3.5 Evaluation Metrics

Evaluation metric is a way to measure the trained model performance. A trained model consists of pre-processing until training the classifier. Each time there is a di↵erent way of implementation, the trained model should be measured. A lot of di↵erent metrics could be used in this project, but only the relevant metrics are being explained in this section.

3.5.1 K-fold Cross-Validation

K-fold cross-validation is a method of validating the result of tuning a classifier parameter by using only the training set. K-fold represents how many partitions and iteration of a dataset that could be used to validate a certain configuration. A partition is used as a validation set while the rest is used to train the classifier. The validation set will be changing from 1 partition to another each time cross-validation run as illustrated in the figure 3.3. This project uses 3-fold for each configuration produced by combining available values.

(31)

Figure 3.3: Illustration of cross validation

3.5.2 Confusion Matrix

The confusion matrix is a matrix that mapped out the error of classification done by the model. It shows the amount of correct classified data and incorrect classified data with the distribution of where it is misclassified into. It contains the number of true positive (tp), true negative (tn), false positive (f p), and false negative (f n) as measurement term. True positive and true negative is the amount of correct classified data as a positive and negative class in that order. False positive and false negative is the amount of incorrect classified data as a positive and negative class in that order.

(32)

Figure 3.4: Confusion matrix for binary classification

3.5.3 Classification Metrics

Classification metric is a common parameter of measurement for classification problem whether it is a binary problem, multiclass problem, or multi-label problem. Accuracy is the most common metric used to measure the performance of a model as it measures how close a predicted value to its true value. Precision is the measurement of how good the model could get the true value among all the predicted value for that class.

A recall is a measurement of how many predicted value is true. An F1 score is the harmonic mean of precision and recall and an F1 weighted is the extended version of it that takes class imbalance into account by using the number of true instances as a weight for each class. All of the measurement above could easily be described for binary classification in these equations 3.8, 3.9, and 3.10 with the highest score is 1 and the lowest is 0.

precision = tp

tp + f p (3.8)

recall = tp

tp + f n (3.9)

F1 score = 2precision⇥ recall

precision + recall. (3.10)

(33)

A multiclass problem will need to be measured with consideration of all classes measurement. The common way to measure multiclass classification is by using the binary measurement for each class and average them to get the total score. It is also important to take the amount of data for each class if there is an imbalanced distribution in the dataset. Thus the measurement will be weighted w by the support of each class c from the list of classes C. Equation 3.11, 3.12, and 3.13 describe those measurements for each class in multiclass classification.

precision weighted = P

c2Cwc tpc

tpc+f pc

P

c2Cwc

(3.11)

recall weighted = P

c2Cwc tpc

tpc+f nc

P

c2Cwc

(3.12)

F1 weighted = P

c2Cw_cF_c P

c2Cwc

(3.13)

3.5.4 Imbalanced Metrics

The most crucial measurement in imbalanced data is an F1 score which considers the mean between precision and recall and also could take the amount of data of each class into account. More various measurement will be better to correctly get the correct performance of a model, hence geometric mean and indexed balance accuracy.

Geometric Mean (Gmean)

The geometric mean (Gmean) is a mean that uses a product of values to determine the central tendency of them. It could measure a di↵erent type of variable with di↵erent range of values without favouring the one with higher range. It produces a square root of specificity and sensitivity in binary classification [3]. Sensitivity is another name for recall; it is a measurement of correctly classifying positive class. Specificity

(34)

is a measurement of correctly classifying negative class. Specificity and geometric mean for binary classification are shown in equation 3.14, and 3.15 respectively.

specificity = tn

tn + f p (3.14)

Gmean =p

specif icity.recall (3.15)

There are di↵erent ways of computing geometric mean for measuring imbalanced data for multiclass classification [25], but it is easier to compute each class using equation 3.15 and then average them using support as weight as shown in equation 3.16.

Gmean averaged = P

c2Cw_cGmean_c P

c2Cwc

(3.16)

Index Balanced Accuracy (IBA)

Index balanced accuracy is a measurement that balances the global performance measurement (Gmean²) with the dominance of a class (dom). A high value of IBA will be achieved if all classes have a high balanced performance among all classes.

Index balanced accuracy was implemented to work by using any scoring function and prepare it to be able to handle imbalanced data [25]. This project used geometric mean as the scoring function and to compute index balanced accuracy. There is a new index introduced through IBA, and it is called Dominance dom. It measures the prevalence of the dominant class to other classes and could be computed using recall and specificity as shown in equation 3.17.

dom = recall specif icity (3.17)

Dominance value will have a range between -1 to +1, where +1 represents the classifier works perfectly for positive class and fails in negative class, and -1 represents the classifier works perfectly for negative class and fails for positive class. A perfect classifier would then have a Gmean score of 1 and Dominance of 0, thus making the equation for IBA as shown in 3.18.

IBA = (1 + dom).Gmean² (3.18)

(35)

It is also vital to put ↵, which can take a value between 0 ↵  1, to weight the value of Dominance as shown in equation 3.19. The purpose of this variable is to make IBA more robust as explained in research [14] where ↵ is 0.1 give the most suitable result for the experiment.

IBA↵= (1 + ↵(recall specif icity)).(recall.specif icity) (3.19) Thus, IBA is described in equation 3.20 for multiclass classification.

IBA averaged = P

c2CwcIBAc

P

c2Cwc

(3.20)

(36)

Chapter 4 Method

This project implemented a workflow that consists of pre-processing, feature engineering, training and evaluating the model as shown in figure 4.1. All implementation was done using Python in Jupyter Notebook [41] as it serves to display analysis and exploration of di↵erent approaches. These approaches were based on literature studies as mentioned in section 2 and section 3. There is a cleaning process that happens outside of this project (represented by a dashed line) and producing dataset B that is used in this project. It filters out instances with a null description, takes instances that are in English with the score below a threshold, limit the date, and takes only those instances with character length between a specific range. All of those configurations are made from the previous internal experiment within the company.

Another dataset was produced after preprocessing, called dataset A. Data analysis is always a part of the whole workflow.

(37)

Figure 4.1: General workflow

This chapter will describe data analysis (section 4.1), di↵erent ways of generating the feature (section 4.2), the configuration for each classifier (section 4.3), and the idea of classifying unlabelled data (section 4.4).

4.1 Data Analysis

This section consists of an illustration of the available data, a way to measure the data volume needed for this project, data distribution for each class, an explanation of suspected class noise, a specific dataset that has label assigned by professional, and the unlabelled data that needs to be classified.

4.1.1 Available Data

Each instance consists of features and a class. The features are description and category, while the class is topic. The description is a textual explanation about the problem, while the category is a standard generic label that is used as a higher level of topic. The available dataset is a subset of the total available data in the company which contains descriptions, four categories and 21 classes. The dataset was divided into training set and testing set with ratio 70:30 with the intent of matching the ratio used in the previous experiment mentioned in section 1.3. A generalised illustration of the data would look like table 4.1.

(38)

Description Category Class Help! I need to change my personal

information.

Category D Label 1 Something is wrong. I already paid the

bill but I keep getting an invoice.

Category C Label 2 I have a discount and I should have pay

less, but I still get charged the same.

Why?

Category D Label 2

Table 4.1: Illustration of the data. It is not the real value of the data. All values were made up to give a general view.

4.1.2 Data Volume

There is a high amount of data available to be used, but it needs to go through a pre-processing which will take a long time to be able to process everything. Thus, arise the need to know if the amount of data in the dataset is enough. The way to analyse it came in the form of making an error comparison between the testing set and training set while increasing the data size. This method usually called bias versus variance analysis and the error measurement came from 1 f 1score. This analysis is done for each classifier as shown in appendix A figure A.1, figure A.2, figure A.3, figure A.4, and figure A.5.

A condition where those lines seem to meet at some point is called underfitting, which means feeding more data into the classifier will help to increase the performance.

Another condition that both lines seem to continue without the possibility of ever meeting each other is called overfitting, which means adding more data will not increase the performance of the classifier. The main objective is to see if the lines that represent the train set and test set will have a probability of overlap with each other. All of the classifier overlearn the data including learning from noise, as shown in the appendix A figure A.1, figure A.2, figure A.3, figure A.4, and figure A.5. All classifiers are showing that increasing the data resulted in overfitting, thus concludes that increasing the data is not likely to help the classifier to perform better.

(39)

4.1.3 Imbalanced Data

There are 21 classes with di↵erent amount of data for each class. Data distribution for each class is shown in the figure. There is 1 class with almost 20% data and 1 class that contains less than 1% data among the total as shown in figure 4.2. This condition of imbalanced data is a↵ecting how classification metric is used. Accuracy metric could only tell how good it is to correctly classify the data and left out the incorrect classified data. It works best if the data has the same amount of false positive and false negative. The imbalanced data that was used in this project implied that the ratio between false negative and false positive are most likely to be imbalanced too.

Since it is not enough to rely on the accuracy metric for measuring the performance of a model trained on imbalanced data, more measurement needs to be used. Thus, this project used F1 weighted score, geometric mean, and index balanced accuracy as mentioned in section 3.5.

Figure 4.2: Data distribution per class

Each class has a di↵erent amount of data where some classes have a high amount compared to other classes, and it is a problem of imbalanced data. The problem will a↵ect the model to tend to classify data toward the dominant classes more than other classes. It makes the model’s objective of recognising all those classes harder and the model performance will be skewed towards the class with highest data. Thus, it is

(40)

essential to be able to correctly measure the performance using the correct metric related to imbalanced data. The way of measuring imbalanced data was discussed in section 3.5.

4.1.4 Class Noise

The dataset has 21 di↵erent values of the label which will be used as target class.

These labels were assigned by the customer along with the description of the feedback when it is submitted to the system. It is achieved through a process of the customer selecting 1 of 21 predefined values in the system. There is a possibility of mislabeling which resulted in the problem of class noise.

There is also a possibility of a label being used for di↵erent type of problem. This possibility happened because the problem coverage of each label is not the same, some label is broader than others. The work done in this project will discuss how to handle the class noise.

4.1.5 Data Labelled by the Company

There is a small amount of data that is classified manually by professionals within the company and having the same structure which contains description and category.

Although it has the same structure, the class has a di↵erent value. If the class assigned by the customer has the value of A, B, C, and so on, the class assigned by professionals has the value of 1, 2, 3, 4, and 5. These five classes are assigned by professionals who based heavily on the description and disregard category. Their advanced knowledge about the problem domain makes it possible to do it in that way. A limited amount of problem that was handled by the professionals also makes a di↵erence. Thus, these five classes of professionally assigned values only cover three classes of customer assigned values.

This particular data is a way of making sure that the class noise that was mentioned at the beginning is not distorting the model’s performance. It needs to be mapped into customer assigned values to make it usable for this project. There are 5 classes named Company label 1, Company label 2, Company label 3, Company label 4, and Company label 5 which could be mapped into 3 classes of customer assigned values.

This mapping process was based on professional knowledge within the company.

Figure 4.3 illustrates the conversion of value.

(41)

Figure 4.3: Change company class value into customer class value

Unfortunately, the amount of this dataset is small. It contains three classes out of 21 classes due to the manual process that it took. This dataset would become another test set for measuring the model performance as well as measuring the impact of class noise. The model was trained using customer labelled data and evaluated using the company labelled data 4.4.

Figure 4.4: Testing the model on company labeled data

(42)

4.1.6 Unlabelled Data

Among the data that was submitted to the system, there are data without a label submitted by the customer, and the amount of it is quite significant. For the same period as the dataset that is used for training and testing the model, these unlabelled data is almost half as big as the dataset (labelled data) itself as shown in figure 4.5.

This condition happens because the customer does not know what the topic for their problem is or possible topic that is presented in the system is not sufficient for their problem. Thus, re-classifying as many of these data into the possible label using the model could help identify customer problem.

Figure 4.5: Amount of unlabelled data

4.2 Feature Engineering

The feature that is used in this project comes from the description and the category.

The category is a categorical value transformed into vector through one-hot encoding.

The description is a freeform text transformed into vector through the frequency of words or through the context of words as described below. The description went through several pre-processing mentioned in section 3 before being transformed into the feature.

(43)

This project compares two di↵erent ways of transforming text into a vector, using a word frequency-based algorithm implemented through TF-IDF vectorizer and using word embedding based on the context of the word implemented through Doc2Vec.

Those components make four di↵erent workflows of feature generation that were compared in this project. Those di↵erent ways are represent by figure 4.6, 4.7, 4.8, and 4.9 which illustrate a workflow for each of them.

4.2.1 TF-IDF workflow

The first workflow to generate a feature is by transforming the description into a vector using TF-IDF as illustrated in figure 4.6. It was easily implemented through CountVectorizer [7] and TfidfTransformer [7]. Both combined as TF-IDF and trained using the description from the training set. Then it can transform description from training set or testing set into a vector for each document. CountVectorizer was configured to remove any ASCII accents, make each term like 1 or 2 words combination, remove the term that occurs in less than ten documents, remove stop words if there is any, and remove the term that occurs in all document. TfidfTransformer was configured to use cosine or Euclidean norm as normalisation represented by norm=’l2’.

This workflow produced features with a dimension from 10577 up to 12081.

(44)

Figure 4.6: Workflow of TF-IDF

4.2.2 TF-IDF and One-Hot workflow

The second workflow of generating features is by adding category into the feature as shown in figure 4.7. It works exactly like the previous workflow for description, but the category is using One-hot Encoding as the transformer. One-hot encoding was implemented using LabelBinarizer instead of OneHotEncoder from scikit-learn. It was because the OneHotEncoder needed to be combined with LabelBinarizer and both did not work well when combined in a FeatureUnion. FeatureUnion is a pipeline that could combine di↵erent feature transformers. It is more straightforward and more reliable to use rather than transform each component with di↵erent transformer and combine the result in an array. Array form of the feature would limit the dimension

(45)

and the amount of data that could be processed. Thus, LabelBinarizer was used, and it could easily be combined inside a FeatureUnion. A trained FeatureUnion can transform description and category from the training set or testing set. This workflow produced features with a dimension from 10575 up to 11865.

Figure 4.7: Workflow of TF-IDF and One-hot encoding

4.2.3 Doc2Vec workflow

The next workflow of generating the features is by using word embedding to build a feature vector from the description as shown in figure 4.8. It was implemented using Doc2Vec [35] where each description is treated as a document. Each document needs to be tagged using unique value, and the words are made into an array. Those tagged documents were used to build a vocabulary for Doc2Vec while the process of training Doc2Vec only uses the array of words from the documents. It took any available word, producing 500 dimensions of a feature vector, randomly downsampled

(46)

words with a frequency higher than 1e-3, using 8 as the maximum distance between the current and predicted word within a sentence, iterates 100 times through the data, using 5 threads to speed up the training process, and implement PV-DBOW.

Maximum distance, the length of a feature vector, number of threads and number of iteration are the best value gained from trying di↵erent configuration while other parameters are left to its default value. A trained Doc2Vec could transform an array of words from a training set or testing set into features. This workflow produced features with 500 dimensions.

Figure 4.8: Workflow of Doc2Vec

4.2.4 Doc2Vec and One-Hot workflow

The last workflow of generating feature is by adding category while still using Doc2Vec to transform description into feature vector 4.9. Category and description were transformed using a di↵erent transformer, and each feature vector that it produces was concatenated to build the full feature vector. Both transformers failed to be

(47)

implemented into the same pipeline in this project, thus transforming category and description is done separately. This workflow produced features with a length of 504 dimensions.

Figure 4.9: Workflow of Doc2Vec and One-hot encoding

4.3 Training Classifier

This section will explain how cross-validation was used and a configuration of each classifier. Cross-validation was used when the hyperparameter tuning process is done for finding the optimum value for a classifier’s parameter. It was either by using GridSearchCV or RandomSearchCV. It was run once when using the first workflow of feature generation to shorten the amount of time instead of doing it for each workflow of feature generation. Some classifiers were not run through hyperparameter tuning because it takes much time to train the classifier iteratively and the default value was used for those classifiers. Moreover, only Naive Bayes, Logistic Regression, Linear

(48)

SVM, and Random Forest that went through hyperparameter tuning.

Naive Bayes was implemented through Multinomial Naive Bayes and Gaussian Naive Bayes. Multinomial Naive Bayes was used for the feature that contains only positive values; thus it was only used when TF-IDF was used. Gaussian Naive Bayes was used when Doc2Vec was used because it could process the negative value produced by Doc2Vec. Multinomial Naive Bayes uses alpha = 1.0 and f it prior = F alse. This configuration means that it uses Laplace smoothing (alpha) and the uniform prior (f it prior). Gaussian Naive Bayes uses adjusted prior based on the data.

Support Vector Machine was implemented through SVC with linear kernel and one- versus-rest as the decision function shape. The linear kernel was used because it was the best among other kernel and it was faster. One-versus-rest was also chosen because it was much faster in training than one-versus-one. Hyperparameter tuning for Linear SVM resulted in a default value of penalty error C = 1.0, a trade-o↵

between the size of the margin and accumulated error sum. Small C represent a large margin, and large C represent small margin. Logistic Regression and Random Forest use the best value gained from hyperparameter tuning for its parameter. Logistic Regression uses liblinear as the solver function, L2 as regularisation, and 1.0 as the strength of regularisation. Random Forest uses two di↵erent configurations, the first configuration based on hyperparameter tuning was used for workflow 1 and workflow 2, while the rest of the workflows used non-hyperparameter tuning configuration. The first configuration uses 100 trees, GINI as measurement of splitting quality, maximum 3 features were used for splitting, nodes were expanded until all leaves are pure or less than 10, minimum sample of 10 for splitting internal node, minimum sample of 1 for leaf node, using as many as it can for number of leaf nodes, and using bootstrap when building the tree. The second configuration is the same as the first configuration except it uses ten trees, a minimum sample of 2 for splitting, and maximum feature used for splitting is a square root of the number of features.

The last three classifiers that were used are a combination or an optimisation of previous classifiers. Linear kernel SVC and Logistic Regression were optimised through the implementation of SGD and combination of various classifiers are implemented through Stacked classifier [2]. The Stacked classifier uses Naive Bayes classifier, Logistic Regression, Random Forest and Logistic Regression optimised by SGD as the primary classifier and Logistic Regression as the meta-classifier. That configuration was used when TF-IDF was used, and it removes Naive Bayes classifier when Doc2Vec was used. That is because adding Naive Bayes classifier resulted in lower performance.

(49)

4.4 Classify Unlabeled Data

Based on previous knowledge from internal experiment within the company, there is a possibility that the current label will not be enough to hold all the data. Thus, classifying the current 21 classes that are available needs a certain threshold. Based on the previous internal experiment, those 21 classes are frequent problems, and it suggests to use 0.5 as the threshold. The threshold itself represents the probability of a class to be predicted by the model. The best approach achieved in this project was used along with the suggested threshold.

(50)

Chapter 5 Result

This section will show the comparison for each classifier and its combination with di↵erent ways of feature generation. Measurement metrics that were used are F1 weighted score, Geometric Mean (Gmean), and Index Balanced Accuracy (IBA) visualised by their charts. Each chart will show the comparison of di↵erent ways of generating features, di↵erent datasets used for training classifier, and various classifier that were used.

5.1 Performance Measurement

The result of 4 di↵erent workflows, two di↵erent datasets, and eight di↵erent classifiers mentioned in chapter 4 will be shown in this chapter. The datasets that were used were dataset A and dataset B with customers’ label. A and B represent dataset A and dataset B respectively, which were mentioned in chapter 4. For example, TF-IDF A means TF-IDF was used for feature generation from dataset A, while TF-IDF B means TF-IDF was used for feature generation from dataset B.

The baseline for this project is a previous work within the company which is called previousClassifier and a random classifier. previousClassifier only has the F1 score;

thus the line representing it only appear in figure 5.1 as a black flat line for every condition. Random classifier is represented by a dark blue line at the bottom of each figure 5.1, 5.2, and 5.3. For each figure, there is Random Forest classifier represented by a light blue line, Linear SVC represented by a red line, Stacked Classifier represented by a green line, Logistic Regression represented by an orange

(51)

line, Linear SVC with SGD represented by pink dashed line, and Logistic Regression with SGD represented by purple dashed line.

Figure 5.1: F1 weighted score comparison

(52)

Figure 5.2: Geometric mean score comparison

Figure 5.3: Index Balanced Accuracy comparison

(53)

The best performance was achieved when using TF-IDF to transform description and one-hot encoding to transform category as features. It does not show a significant di↵erence between various classifiers. All classifiers were performing better than Random Classifier, but not always with previousClassifier as shown in figure 5.1. All classifiers performed worse than previousClassifier when using the description as the feature and transformed it using Doc2Vec. Logistic Regression, Linear SVC, Stacked Classifier, and Linear SVC with SGD are showing similar performance across di↵erent feature engineerings and datasets.

Doc2Vec is performing worst than TF-IDF as opposed to what was proposed by Le and Mikolov [24]. As it was mentioned in section 3.2.4, Doc2Vec learned from the sampled data and stored the softmax weights to transform unknown data. The weak performance caused by too little data that was fed into Doc2Vec, and it could not capture the right value for the data. This unexpected outcome was discussed in research by Lau and Baldwin [23]. They conclude that using large external corpora, a collection of textual data, to train Doc2Vec to make it performs robustly and advanced improvement shown by using pre-trained word embeddings. Doc2Vec is also a↵ecting Random Forest the most as the features do not contain enough information to be learned. Adding one more dimension through One-hot encoding resulted in a better performance. It is because the fixed variety of values from the category make the data less scatter than before. It works as if the vectors being pulled into a specific direction based on the category and the data that have the same category value will be closer to each other. Thus, the classifier classifies them easily.

5.2 Predicting Company Labelled Data

The performance of the same models when it was used for predicting company labelled data will be visualised in this subsection. It displays di↵erent feature engineering workflows, di↵erent classifiers, and di↵erent datasets used for training. The previous experiment has never been tested using this dataset. Hence there is no comparison with previousClassifier. The baseline for this is a random classifier represented by a dark blue line. For each figure 5.4, 5.5, and 5.6, there are Random Forest classifier represented by light blue line, Linear SVC represented by red line, Stacked Classifier represented by green line, Logistic Regression represented by orange line, Linear SVC with SGD represented by pink dashed line, and Logistic Regression with SGD represented by purple dashed line.

(54)

Figure 5.4: F1 weighted score comparison

Figure 5.5: Geometric mean score comparison

(55)

Figure 5.6: Index Balanced Accuracy comparison

This part of the experiment showed various performance for di↵erent classifiers.

The company labelled dataset that shows a di↵erent trend compare to customer labelled dataset. There is a similarity of performance for using description as the feature and using a combination of description and category as the feature. While customer labelled dataset showed an improvement when a combination of description and category was used as the feature. This result is a↵ected mainly by way of assigning the classes for each dataset. The company labelled dataset is considering the description mainly, while the customer labelled dataset is considering the category and then the description.

Random Forest and Naive Bayes classifier seemed more a↵ected by the di↵erence of feature transforming. They appeared to perform best when TF-IDF was used and reduced the performance significantly when Doc2Vec was used. The rest of the classifiers performed similarly given di↵erent ways of feature transforming. All measurement metrics show that Random Forest and Naive Bayes fail to achieve a good score when Doc2Vec was used. They even perform worse or the same when category was added as a feature. They failed to correctly classify the classes which means they have a low value of recall that is part of the component for the F1 weighted score, Gmean, and IBA.

(56)

Naive Bayes is an independent feature model; it learns the features without considering the connection between them. This context is the opposite of how Doc2Vec works, where each component of the description feature relates to each other. Thus, both techniques is not performing well when combined. Random Forest also showed incompatibility with Doc2Vec which has been explained in the previous section.

5.3 Classifying Unlabeled Data

Unlabeled data in this project was classified using stacked classifier with description and category as the features and dataset B as the train set. Total unlabeled data that succeed to be classified into available classes is 80% from the complete unlabeled data. The data distribution after classification was done is shown in figure 5.7 as well as a more detailed data distribution in table B.1. This result is higher compare to the previous experiment done within the company.

Figure 5.7: Result of classifying unlabeled data

(57)

Chapter 6 Discussion

6.1 Optimum Features

A simple classification to detect a pattern from a vector description is not enough for this particular situation, because the information is not only stored in the description but also in the category. Some data has the same class label with a di↵erent category, which means that the description would be di↵erent for those data. Thus, the vector of description for those data would be di↵erent although it has the same class. The description vector could be spread across the vector space although it has the same class. The classifier would try to learn every instance and if the instances are spread then a class would be covering a spacious area. It is difficult to isolate and solve a problem by relying on the description because each problem that was represented by an instance is spreading out in the vector space. Thus, it is difficult to find a group of data that belongs in the same class.

(58)

Figure 6.1: Confusion matrix for description as feature

Figure 6.1 is a confusion matrix for stacked classifier when using only description as the features. Figure 6.2 is a confusion matrix for stacked classifier when using description and category as the features. Figure 6.2 is better than figure 6.1 shown by the increasing amount in the diagonal values and decreasing amount in other places.

Both figures are illustrated using a heat map which has a darker colour to represent big amount of data and a lighter colour to represent small amount of data. Label 5, label 6, and label 18 are the classes that have two di↵erent values of a category that was shared among them. It means that each class will have di↵erent description producing sparse data. This condition makes it difficult to find a group of instances that belong in the same class. Those classes are distinct from one to another after category was used as shown in figure 6.2. It could be seen in confusion matrix 6.2

that using description and category improved all classes.

This improvement could be seen in the amount of data for similar classes. In the real data, Label 1 and label 8 are similar based on their class name if the name is not masked. Figure 6.1 shows that 123 instances were predicted as Label 8 although it belongs to Label 1. It is also shows that 109 instances were predicted as Label 1 although it belongs to Label 8. Furthermore, both classes are more distinguishable in figure 6.2 compared to figure 6.1 shown by 0 instances were misclassified between them.

Understanding Customer Problems through Text Categorisation