Automation of support service using Natural Language Processing: - Automation of errands tagging

(1)

Engineering Degree Project

Automation of support service using Natural Language

Processing

- Automation of errands tagging

Author: Kristoffer Haglund Supervisor: Jonas Lundberg Semester: VT/HT 2020 Subject: Computer Science

(2)

Abstract

In this paper, Natural Language Processing and classification algorithms were used to create a program that automatically can tag different errands that are connected to Fortnox (an IT company based in Växjö) support

service. Controlled experiments were conducted to find the best classification algorithm together with different Bag-of-Word pre-processing algorithms to find what was best suited for this problem. All data were provided by Fortnox and were manually labelled with tags connected to it as training and test data.

The result of the final algorithm was 69.15% correctly/accurately predicted errands using all original data. When looking at the data that were incorrectly predicted a pattern was noticed where many errands have identical text attached to them. By removing the majority of these errands, the result was increased to 94.08%.

Keywords: Natural Language Processing, Naïve Bayes, Support Vector Machine, Neural Network, Pre-processing

(3)

Preface

I would like to thank Fortnox that provided me with this interesting thesis project. I have learned a lot throughout this project because Natural Language Processing was a new field that I never worked with before I conducted this project. I would especially like to thank Johan Hagelbäck that has guided and helped me through this project.

(4)

1.1.1 Machine learning______________________________________5 1.1.2 Naïve Bayes__________________________________________6 1.1.3 Support Vector Machine________________________________ 6 1.1.4 Neural Network_______________________________________ 7 1.1.5 Natural Language Processing_____________________________7 1.1.6 Document classification_________________________________8 1.2 Related work __________________________________________ 8

1.3 Problem formulation ____________________________________ 9 1.4 Motivation ____________________________________________ 9 1.5 Objectives _____________________________________________ 9 1.6 Scope/Limitation _____________________________________ 10 1.7 Target group _________________________________________ 10 1.8 Outline ______________________________________________ 10 2 Method __________________________________________________ 12

2.1 Approach ____________________________________________ 11 2.2 Reliability and Validity _________________________________ 13 2.3 Ethical Considerations____________________________________ 13

3 Implementation ___________________________________________ 14 3.1 Scripts_________________________________________________ 13

3.1.1 Process.py__________________________________________ 13 3.1.2 Machine_learning.py_________________________________ 13 3.1.3 Evaluate.py_________________________________________ 13 4 Results and Analysis ________________________________________ 15 4.1 First prototype ___________________________________________ 14 4.2 Second prototype_________________________________________ 16 4.3 Third prototype__________________________________________ 17 4.4 Forth prototype___________________________________________19 5 Discussion and Conclusion __________________________________ 21

6 Future work ____________________________________________ 21 References ___________________________________________________ 22 7 Appendix___________________________________________________25

(5)

1 Introduction

Fortnox is an IT-company based in Växjö that delivers accounting programs.

Their main target customers are small business owners. Many companies today including Fortnox have a support service connected to their business.

Fortnox support service handles many errands annually that are connected to the programs that Fortnox is selling. The support personnel manually assign tags to errands that tells which program an errand is about. This information is later used to evaluate how many errands each program is generating and can give insights into where improvement needs to be made. The problem is that this process is done manually today which causes 30 % of the errands is left without a tag, which is over 100 000 errands annually. Natural Language Processing (NLP) is a subfield of linguisticsthat can be applied within machine learning that makes it possible to classify different texts [1]. By using NLP and classification algorithms that automatically assign program tags to errand, the number of errands left without a tag could hopefully be reduced.

1.1 Background

For this Engineering Degree Project, the focus has been on the Naïve Bayes, Support Vector Machine, and Multi-Layer Perceptron Neural Networks (MLP) as the classification algorithms. These algorithms have been proven to perform well on document classification [2][3]. Also, these algorithms can easily be implemented using machine learning library Scikit-learn in Python.

1.1.1 Machine learning (ML)

Machine learning is a part of AI (Artificial intelligence). Machine learning can be divided into three subfields: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning uses labelled or classified data to train models.

These models can then be used to classify or label new data.

Unsupervised learning is used on unlabelled data, i.e. data without ground truth. Therefore, unsupervised learning is used to explore the data and see if there are hidden patterns or structures that can be used to describe the data.

In reinforcement learning, the system interacts with its environment and uses penalties and rewards to improve the system. That way it automatically maximizes its performance on a task.

Machine learning can be summarized in that it uses data to look for patterns and learn from it and improve in order to make better decisions. The main purpose of this is to allow computers to automatically learn without

(6)

human intervention or assistance [4]. In this project supervised learning has been used, hence the data that Fortnox is providing contains a ground truth.

1.1.2 Naïve Bayes

Naïve Bayes algorithm is a powerful algorithm for classification and is suited for NLP [5]. Naïve Bayes is based on Bayes theorem to predict the probability for each class the new data is believed to belong to.

• P(A|B) = posterior

• P(B|A) = likelihood

• P(A) = prior

• P(B) = marginal likelihood

• Bayes theorem = P(A|B) = (P(B|A) P(A)/ P(B)

The data is then given the class that has the highest probability. This is called Maximum A Posteriori (MAP) [6]. Naïve Bayes simple design, speed, reliability and accuracy of applications of NLP makes it one of the most popular classification algorithms for this type of classification problems [7][8].

1.1.3 Support Vector Machine (SVM)

Support Vector Machine is a linear classification/regression algorithm. The goal for SVM is to find a hyperplane in a N dimensional space that distinctly separate the different classes of the data points. Different classes that are not linearly separable in one space, can be linearly separable in a higher- dimensional space using a technince called kernel trick. The kernel finds a suitable space and transforms the data points to the new space, and it may then be possible to separate the classes with a line [9].

Figure 1.1: Shows how Support Vector Machine separate different classes

(7)

1.1.4 Neural Network (NN)

A Neural Network (NN) is inspired by how the human brain works when it processes information.

NN consist of neurons that are formed into layers. Each neuron receives an input and produces an output that is passed on to the next layer of neurons.

It continues likes this until it has reached the last layer that gives the final output [10]. The NN used in this project is Multi-Layer Perceptron Neural Networks (MLP). The neurons are built of an input, weights, bias, activation function, and an output. The input comes from the previous layer in the network. The weight is often initialized to small random values such as values in the range 0-0.3. Every neuron also has a bias that always is set to 1, this also must be weighted. The summed weights are passed through the activation function that is a threshold at which the neuron is activated. This makes it possible to train non-linear problems [11][12].

1.1.5 Natural Language Processing (NLP)

Natural Language Processing is a subfield of linguistics that can be applied within ML and makes intervention between humans and computers possible.

The main goal of NLP is to read, decipher, understand, and make sense of the human language [13].

Because computers do not understand text data, the list of words needs to be encoded into an integer or float values. One way to solve this is to use a Bag-of-Words representation. The “bag” contains every word that occurs at least once in the dataset. This is then represented as a vector with all the words (Word count). Since each text is likely to only contain a small subset of the available words, the word vectors are typically very sparse (containing many 0). The drawback of this method is that longer texts will have a higher average

Figure 1.2: The structure of a NN

(8)

count of words than shorter texts. This can be solved by using relative frequencies called Term Frequency - Inverse Document Frequency (TF-IDF).

This works by counting the number of times a word occurs in the dataset divided by all the words, this is called Term Frequency (TF). Inverse Document Frequency (IDF) is then used to calculate how informative a word is, for example, a word that occurs often is considered as less informative than a word that occurs less frequently [14]. Both techniques have been tested in this project.

1.1.6 Document classification

Document classification is a field within NLP that uses a training set with labeled data [15]. The difference between NLP and document classification is that NLP is more focused on understanding, analyzing, and processing of texts while Document classification is focused on classifying text by using NLP [16].

1.2 Related work

For document classification, there are many different classification algorithms, Bag-of-Word representations, and pre-processing options of the text.

In an article by S.L. Ting, W.H. Ip, Albert H.C. Tsang the authors describe how they classify 4000 documents and apply different pre-processing steps on the text. Their goal was to prove that Naïve Bayes is the best classification algorithm to solve their problem. In order to prove this, they also compared their result with several other different classification algorithms in order to find the fastest and most accurate algorithm to classify different documents. They concluded that Naïve Bayes was the best and fastest algorithm [17].

L. Manevitz and M. Yousef created a One-class document classification program by using a NN. Their goal was to make a filter that could be used to examine a corpus of documents and pick out those of interest. They also discuss the different pre-processing algorithms [18].

T. Joachims argues in his article that SVM is a well-suited classification algorithm for text classification. He gives several theoretical arguments about why SVM is promising for this classification problem. When he is pre- processing the text, he is removing unnecessary words like stop words and transforms it into a TF-IDF Bag-of-Word representation. To prove that SVM is a suitable classification algorithm he is comparing its performance with Naïve Bayes, Rocchio, C4.5, and k-NN [19].

(9)

1.3 Problem formulation

More than 100 000 errands annually are left without a tag, because of this Fortnox is missing out on valuable information that is needed to improve their programs. The errands are provided in text format and therefore it is possible to use NLP in order to classify them. The data provided by Fortnox are errands that already have been assigned program tags which will be used as training data.

By investigating different NLP and pre-processing algorithms this study will investigate the possibility to create a program that can connect the errands with the right tag.

1.4 Motivation

Many companies have a support service connected to their business. The system described in this paper can be used to automate parts of the support services. With this automatization, errands will be assigned to a program or a specific problem. This way, companies will be provided with important information in order to improve their business.

1.5 Objectives

O1 Prepare the data provided by Fortnox to be used for NLP O2 Implement different classification algorithms starting with

Naïve Bayes, Support Vector Machine and Neural Network (MLP) and compare the speed and accuracy

O3 Evaluate the data and test different pre-processing methods to see if that increase the result

O4 Try other classification algorithms to see if the result can be improved

O5 Evaluate all results and decide which pre-processing method that works best and which algorithms to use

O6 Make a prototype

(O1) Prepare so that the data can be used for NLP, which means investigate which data that provides information that can be used to classify the errands.

After that, the data needs to be pre-preprocessed in a format that can be used by NLP. (O2) See which algorithm performs best. (O3) Try different pre- processing techniques such as removing stop words, special character to see if results can be improved, also try different Bag-of-Word. (O4) If the algorithms tested in O2 cannot provide a result within a scope that is set (90%), other

(10)

classification algorithms will be tested. (O5) (O6) decide which pre-processing method and algorithm to use for the prototype.

The result of this program will rely heavily on the data that Fortnox provides.

1.6 Scope/Limitation

The possibility to create a fully working and optimized program that can be intergraded into Fortnox systems is not possible within the scope of this thesis project. Therefore, limit the number of tags that can be classified is necessary.

The focus will be on three classification algorithms explained in section 1.1.

These algorithms have been proven to be good for this type of classification problem. In order to see if the accuracy and speed of the algorithm can be improved, the text will be pre-processed in different ways and steps.

A challenge is a fact that Fortnox has more than 300 different tags for errands. Due to the time limit and data quantity that is needed, it is not possible to create a program of that scale within the scope of this project. The accuracy of the program must be around 90% in order to be considered useful in production. Therefore, the focus will be to reduce the number of tags and create a proof-of-concept system using the top ten most used tags. This is necessary since the number of errands assigned to tags below the top ten is greatly reduced, which has a negative impact on the trained models.

1.7 Target group

The target group is companies that has a support service connected to the business and wants to connect a problem or programs to the errands that comes in.

1.8 Outline

The chapter Method describe which methods and approaches that were used to create and evaluate the accuracy of the program.

The chapter Implementation describe the different scrips that were implemented for the program.

The chapter Result and Analysis will present the results and the analysis for the different prototypes that were created during the development of the program.

(11)

The chapter Discussion and Conclusions will contain a general discussion about the result and the conclusions that can be drawn and about this thesis project.

The chapter Future work will present the steps that can and will be implemented in the future work of this thesis project.

.

(12)

2 Method

Experiments were conducted with different amounts of data and pre- processing of the text. The different classification algorithms, explained in Section 1.1, were then tested to find the most suitable one for this problem.

Stemming and different n-grams were applied on the dataset but were showed to affect the result in a negative way (1% on average) and therefore was removed in the pre-processing steps.

2.1 Approach

Before the data can be applied to any classification algorithm it needs to be pre-processed. Pre-processing a text means that it is translated into a form that can be used by a classification algorithm. The pre-processing task was done in different steps:

1. Make the entire text into lower case. Otherwise, “Fortnox” and

“fortnox” would be considered as two different words when generating the Bag-of-Word presentation.

2. Tokenize the text i.e. separate words so that each word is a single entity. For example, the text [“institute for computer science”] will be [“institute”, “for”, “computer”, “science”] after the tokenization process.

3. Cleaning the text from stop words. This means that words such as

“att”, “och”, ”så”, etc. that are frequently used in many texts and therefore have low information value will be removed.

4. More cleaning of the text by removing other non-informative words for example “gärna”, ”själv”, ”honom”, ”utan”, etc.. By removing stop word and other none informative words the program can focus more on the words that are more important [20]. The process in this step is done by manually investigating the data to find words to remove and add them to a word removal list. Removing words also have the benefit that the model will be faster to train [21].

5. Removal of special characters. This entails the same impact and benefits as in step 1. For example, “#fortnox!!” and “fortnox” would also be considered as two different words, if special characters are not removed.

(13)

These different pre-processing steps were applied in different steps during the development of the system in order to evaluate if they had a positive impact on the result.

To find the best parameters to use for each classification algorithm, grid search tests were conducted. These tests were done every time there were changes in the data. Training data and test data was truncated with different random seeds to see if the result converged. Cross-validation was performed using cross_val_predict library in Scikit-learn instead of dividing the set into a validation set [22]. Cross-validation is to prefer when the dataset is small in order to have a larger trainset [23]. This was performed on all the prototypes for consistency.

2.2 Reliability and Validity

In order to reproduce the same result as in this project, the same pre-processing steps and algorithms must be used. Note that the amount of data can have a major impact on the result. With more data, the pre-processing of the text had a bigger positive impact than with fewer data. Also, the data is created daily and can therefore differ from the data that was used in this project.

2.3 Ethical Considerations

The data that was used as training data contains names and addresses. This sensitive data will be removed if used in a production system.

(14)

3 Implementation

The system was written in Python using the data and machine learning libraries Scikit-learn and NumPy. Scikit-learn has lots of excellent tutorials on how to use different classification algorithms such as Naïve Bayes, SVM, and MLP on different types of data, for example, text data [24]. NumPy provides a fast and efficient implementation of N-dimensional arrays and useful linear algebra operations useful in machine learning projects [25].

3.1 Scripts

This part describes the different scrips that were created for the system.

3.1.1 Process.py

The script reads the data from a JSON file that was provided by Fortnox. It takes out the columns that contain the information that is needed in order to classify the errand. When the data is extracted the text runs through the pre- processing steps explained in Section 2.1 and is then converted into a CSV file.

3.1.2 Machine_learning.py

This script reads the CSV file that was created by the process script and split it into a training set and test set. The text in the different sets was then converted into a Bag-of-Word representation. A grid search is then conducted on the data in order to fine-tune the parameters on the different classification algorithms.

The program then calculates the accuracy of the different classification algorithms. The errands that were incorrectly classified are put into a new CSV file for possible manual evaluation.

3.1.3 Evaluate.py

This script reads the CSV file containing the incorrectly classified errands.

Then the script calculates how many errands that were connected to each specific tag and shows which tag that the errand was given instead. The script also prints out which text the errand has connected to it.

(15)

4 Results and Analysis

In the tables below the best result in every table will be highlighted with boldface. The tables below the different amounts of data were used in the different prototypes. The text was then pre-processed in different steps explained in Section 2.1. In the column ”Number of unique words” can, therefore, vary depending on which steps were applied in order to create that table. The result will be presented in two different ways, one average result using ten different random seed and the best result that was produced with which random seed.

4.1 First prototype

The number of data points used in order to create this prototype was 711. In this prototype steps, 1-4 explained in Section 2.1 were used. The prototype was built to only produce a result that could be used as a reference when applying more pre-processing on the text. The best accuracy that was produced in this prototype was 66.45% using MLP together with the TF-IDF Bag-of- Word algorithm (See Table 7.3) on the test set while average result and cross- validation result where lower 60.80 % and 56.68% respectively (See Table 7.2), which show the inconsistency of the result. When more pre-processing was applied by removing words manually a slight reduction in classification accuracy was noticed on the average result, but the best results remained the same (See Table 7.4, 7.5) so did the inconsistency. The settings that were used for the classification algorithm in this prototype are shown in Table 7.1.

4.2 Second prototype

The number of data points used in order to create this prototype was

increased to 1762. When introducing new data, a worse result was noticed on both the average and the best result (See Table 7.7, 7.8) compared to the first prototype. The same pre-processing was used on both prototypes and the best result dropped from 66.45% (See Table 7.5) to 64.87% (See Table 7.8) on the test set. Both results were produced using the MLP classification algorithm.

When applying more pre-processing on the text by increasing the word removal list explained in step 4 in Section 2.1 on the text at this stage also showed worse results (See Table 7.9, 7.10). The settings that were used for the classification algorithm in this prototype are shown in Table 7.6.

(16)

4.3 Third prototype

The number of data points used in order to create this prototype was increased to 5427. Increasing the data from 1726 to 5427 errands also increased the amount of accurately predicted errands on both the average and the best result (See Table 7.12, 7.13) compared to both prototypes previously. Pre-processing on the text at this stage had a positive impact on the results by increasing it from 67.50% (See Table 7.12) to 69.15% (See Table 7.14) by including step 5 explained in Section 2.1 on the test set. The inconsistency between the best, average, and cross-validation results was also showed positive. This time SVM was the classification algorithm that produced the best result using the TF-IDF Bag-of-Word algorithm. The results were however not good enough to be useful in practice i.e. a classification accuracy of at least 90%. Table 7.16 shows the data points distribution over the dataset. The settings that were used for the classification algorithm in this prototype are shown in Table 7.11.

4.4 Final prototype

When looking at the data that were incorrectly predicted a pattern was noticed where many errands have identical text attached to them. For illustration,

Overall, 51.74% of the data were connected to only four different texts and those texts were connected to nine out of ten tags. Therefore, these texts do not contain any information that Fortnox can use in order to improve their tagging system because in order to see which improvement that is needed the text needs to be unique for just one tag. By removing those text strings, the best result increased from 69.15% (See Table 7.15) to 94.08% (See Table 4.3). The average result increased from 65.21% (See Table 7.14) to 90.54% (See Table 4.2) and the cross-validation result increased from 66.72% (See Table 7.14) to 93.28% (See Table 4.3). Table 4.4 shows how the best result from Table 7.15 changes when removing some of the macro generated text for MLP and Table 4.5 shows the changes for SVM. Table 4.6 shows how the number of errands is decreasing when removing texts that are attached to an errand using macros.

The number of data points that were left after removing some of the macro generated text was decreased from 5427 to 2619. Table 4.7 and Table 4.8 shows the data points distribution before and after removing macro generated errands over the dataset.

[”Skicka”, ”vidare”, ”detta”, ”ärende”], Tag1 [”Skicka”, ”vidare”, ”detta”, ”ärende”], Tag2

(17)

Algorithm Settings

Naïve Bayes Alpha = 0.03

SVM C=1, Kernel = linear, gamma =

0.001 used in Word count.

C=10, Kernel = linear, gamma = 0.001 used in TF-IDF

MLP activation = logistic, solver=adam,

alpha = 1e-5, hidden_layer_sizes = (20.), random_state = 0, max_iter = 2000

Table 4.1: The settings used in the final prototype by the different classification algorithms

Algorithm Bag-of-Word Number of unique words

Accuracy (%) Cross- validation (%) cv = 5 Naïve

Bayes

Word count 10261 81.11 80.30

Naïve Bayes

TF-IDF 10261 84.64 86.18

SVM Word count 10261 88.99 92.02

SVM TF-IDF 10261 90.00 87.48

MLP Word count 10261 90.46 93.09

MLP TF-IDF 10261 90.57 93.28

Table 4.2: Average result for the final prototype after removing four macro created text strings that occur often (random_state = 0-9)

Accuracy (%)

Naïve Bayes Word count 10261 83.59

Naïve Bayes TF-IDF 10261 88.36

SVM Word count 10261 91.60

SVM TF-IDF 10261 92.75

MLP Word count 10261 93.51

MLP TF-IDF 10261 94.08

Table 4.3: Best result for the final prototype after removing four macro generated text strings that occur often (random_state = 7)

(18)

Figure 4.1: Best result progression for the final prototype after removing macro created errand texts. Note that there is barely any difference between the third and the fourth macro created text is removed.

Figure 4.2: Best result progression for the final prototype after removing macro created errand texts.

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

100,00%

All data 1 removed 2 removed 3 removed 4 removed

SVM TD-IDF

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

100,00%

MLP TD-IDF

(19)

Figure 4.3: Shows how the number of errands is reducing after removing the most frequent macro generated text.

Class Number of data

points

% of the total data

Class 1 1104 20.34

Class 2 716 13.19

Class 3 898 16.55

Class 4 393 7.24

Class 5 369 6.80

Class 6 586 10.80

Class 7 291 5.36

Class 8 720 13.27

Class 9 348 6.41

Class 10 2 0.037

Table 4.4: Number of data points ever class has before removing macro generated errands

0 1000 2000 3000 4000 5000 6000

Number of errands

(20)

Class Number of data points

% of the total data

Class 1 1045 39.90

Class 2 124 4.73

Class 3 225 8.59

Class 4 393 15.00

Class 5 186 6.80

Class 6 216 7.10

Class 7 87 3.32

Class 8 242 9.24

Class 9 99 3.38

Class 10 2 0.076

Table 4.5: Number of data points ever class has after removing macro generated errands

(21)

5 Discussion and Conclusion

The expected result of this project was not intended to be a fully functional program but rather to determine if it was possible to create a program that could be accurate enough in order to be used for automatically tagging errands.

Today, the main obstacle for such a system is the data and how it is created. A big majority of the errands are created by using macros that assign standard text to an errand and therefore the program will never be efficient enough to be used for auto-tagging within Fortnox. By removing the macro generated text and train the model on more individual texts the accuracy of the program could be increased to a level where the accuracy is good enough to be useful in production. This could lead to that some errands were completely removed from the dataset since they did not have any unique text. Also, the benefit of removing these errands is that the relevant information is kept and can be used in order to improve the programs that Fortnox is providing. Because a satisfying result could be achieved with the algorithm defined in O2 explained in Section 1.5 a decision to not continue with O4 was made but can be included in future work.

The pre-processing of the data was also showed to be more efficient in a positive way for accuracy when the amount of data was increased together with the number of words.

6 Future work

Given the time limit and data, only 10 tags were investigated in this project while Fortnox has more than 300. Future work could potentially include a major revision of what kind of tags that are needed by Fortnox and train a system on all those tags as well as in the program.

Other pre-processing steps can be used in the text, such as Stemming.

Stemming is when you transform a word to is core form, for example,

“running” and “runs” both become “run” after stemming has been applied. One drawback of applying stemming is that you can remove the information value in a word that negatively affects the accuracy. Because of this stemming was excluded in the pre-processing for creating this prototype but may be included in future work [26].

(22)

References

[1] M. Ikonomakis, S.Kotsiantis, V.Tampakas, “Text Classification Using Machine Learning Techniques”. [Online] “WSEAS TRANSACTIONS on COMPUTERS vol. 4, pp. 966-974, 2005

[2] “Text Classification With Word2Vec” [Online] Available:

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with- word2vec/ [Accessed: 17-June-2020]

[3] “Text Classification using Neural Network” [Online]

Available: https://machinelearnings.co/text-classification-using-neural- networks-f5cd7b8765c6 [Accessed: 17-June-2020]

[4] “What is Machine Learning? A definition” Expert system 7-March- 2017[Online]

Available: https://expertsystem.com/machine-learning-definition/. [Accessed:

14-April-2020]

[5] S. Xu, “Bayesian Naïve Bayes classifiers to text classification” Journal of Information Science, vol. 44(1), pp.48–59, 2018

[6] R. Saxena.” How the naïve bayes classifier works in machine learning”

Dataaspirant[Online]

Available: https://dataaspirant.com/2017/02/06/naive-bayes-classifier- machine-learning/ [Accessed: 14-April-2020]

[7] “Applying Multinomail Naïve Bayes to NLP Problems: A Practical Explanation” [Online]

Available: https://medium.com/syncedreview/applying-multinomial-naive- bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf [Accessed: 22- June-2020]

[8] I. Bobriakov “Top NLP Algorithms & Concepts” Data Science Central 21- December-2019 [Online]

Available: https://www.datasciencecentral.com/profiles/blogs/top-nlp- algorithms-amp-concepts [Accessed: 22-June-2020]

[9] R. Gandhi “Support Vector Machine – Introduction to Machine Learning Algorithm” [Online].

Available: https://towardsdatascience.com/support-vector-machine- introduction-to-machine-learning-algorithms-934a444fca47

(23)

[10] O. Knocklein. “Classification using neural network.”

Towardsdatascience [Online].

Available: https://towardsdatascience.com/classification-using-neural- networks-b8e98f3a904f [Accessed 14-April-2020]

[11] J. Brownlee “Crash Course On Multi-Layer Perception Neural Network”

[Online]

Available: https://machinelearningmastery.com/neural-networks-crash- course/?fbclid=IwAR2VZCRrOzpoKKOu8BK_UZXJwSX-i7Dy- qF3wl2gOBPLFzPvyJvkBqbliic [Accessed: 17-June-2020]

[12] V. Yadav, “How neural networks learn nonlinear functions and classify linearly non-separable data?” [Online]

Available: https://medium.com/@vivek.yadav/how-neural-networks-learn- nonlinear-functions-and-classify-linearly-non-separable-data-22328e7e5be1 [Accessed: 17-June-2020]

[13] Dr. Micheal j Garbade. ”A Simple Introduction to Natural Language Processing”[Online]

Available: https://becominghuman.ai/a-simple-introduction-to-natural- language-processing-ea66a1747b32 [Accessed 15-April-2020]

[14] P. Huilgol. ”Quick Intruduction to Bag-of-Word(BoW) and TF-IDF for creating features from text” [Online]

Available: https://www.analyticsvidhya.com/blog/2020/02/quick- introduction-bag-of-words-bow-tf-idf/ [Accesses 15-April-2020]

[15] P Ghaffari. “Text analysis 101: Document classification.”[Online]

Available: https://www.kdnuggets.com/2015/01/text-analysis-101-document- classification.html [Accessed: 27-April-2020]

[16] F. Pascual. “Document classification.” [Online] Available:

https://monkeylearn.com/blog/document-classification/ [Accessed: 27-April- 2020]

[17] S.L. Ting, W.H. Ip, A. H.C. Tsang,” Is Naïve Bayes a Good Classifier for Document Classification?” International Journal of Software Engineering and Its Applications, vol. 5, No. 3, pp. 3-07, 2011

[18] L. Manevitz, M. Yousef,” One-class document classification via Neural Networks” Neurocomputing, vol.70, pp. 1466-1481, 2007

(24)

[19] T. Joachims, “Text categorization with Support Vector Machines:

Learning with many relevant features”,Lecture Notes in Computer Science”

vol.1398, pp. 137-142, 2005

[20] K. Ganesan. “What are Stop Words?” [Online]. Available: https://kavita- ganesan.com/what-are-stop-words/#.Xqg9rWgzY2w [Accessed 28-April- 2020]

[21] S. Singh “NPL Essentials: Removing Stopwords and Performing Text Normalization using NLTK and spaCy in Python” [Online]

Available: https://www.analyticsvidhya.com/blog/2019/08/how-to-remove- stopwords-text-normalization-nltk-spacy-gensim-python/ [Accessed 20-April- 2020]

[22] “Scikit-learn” [Online]

Available: https://scikit-

learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predic t.html [Accessed: 22-June-2020]

[23] D. Shulga “5 Reasons why you should use Cross-Validation in your Data Science Project” [Online]

Available: https://towardsdatascience.com/5-reasons-why-you-should-use- cross-validation-in-your-data-science-project-8163311a1e79 [Accessed: 22- June-2020]

[24] “Scikit-learn” Machine learning in python.” [Online]. Available:

https://scikit-learn.org/stable/ [Accessed: 27-April-2020]

[25] “NumPy.” [Online]. Available: https://numpy.org/doc/stable/ [Accessed:

27-April-2020]

[26] H. Heidenreich “Stemming? Lemmazation? What?” [Online]

Available: https://towardsdatascience.com/stemming-lemmatization-what- ba782b7c0bd8 [Accessed 01-June-2020]

(25)

7 Appendix

SVM C=10, Kernel = linear,

gamma = 0.001

MLP activation = logistic,

solver=adam, alpha = 1e-5, hidden_layer_sizes = (20,), random_state = 0, max_iter = 2000

Table 7.1: Showing the settings used in the First prototype by the different classification algorithm

Accuracy (%)

Cross- validation (%) CV = 5 Naïve Bayes Word count 5169 52.90 39.30

Naïve Bayes TF-IDF 5169 53.94 50.97

SVM Word count 5169 58.71 58.11

SVM TF-IDF 5169 59.10 58.11

MLP Word count 5169 59.93 57.20

MLP TF-IDF 5169 60.80 56.68

Table 7.2: Average result in the first prototype by using the pre-processing steps 1-3 explained in Section 2.1 by different (random_state = 0-9).

Accuracy (%)

Naïve Bayes Word count 5169 53,55

Naïve Bayes TF-IDF 5169 58,06

SVM TF-IDF 5169 62.58

MLP Word count 5169 64.52

MLP TF-IDF 5169 66.45

Table 7.3: Best result in the first prototype by using the pre-processing steps 1-3 explained in Section 2.1. (random_state = 8)

(26)

Cross- validation (%) CV = 5

Naïve Bayes Word count 5093 52.76 38.00

SVM Word count 5093 59.49 58.88

SVM TF-IDF 5093 57.10 57.72

MLP Word count 5093 60.06 58.63

MLP TF-IDF 5093 59.61 59.01

Table 7.4: Average result in the first prototype using pre-processing steps 1-4 explained in Section 2.1 (random_state = 0-9).

Accuracy (%)

SVM TF-IDF 5093 62.58

MLP Word count 5093 66.45

MLP TF-IDF 5093 64.52

Table 7.5: Best result in the first prototype using pre-processing steps 1-4 explained in Section 2.1 (random_state = 2)

SVM C=10, Kernel = linear,

gamma = 0.001

MLP activation = logistic,

solver=adam, alpha = 1e-5, hidden_layer_sizes = (20,), random_state = 0, max_iter = 2000

Table 7.6: Showing the settings used in the second prototype by the different classification algorithm

(27)

SVM Word count 7914 56.52 56.53

SVM TF-IDF 7914 57.31 57.26

MLP Word count 7914 58.90 58.91

MLP TF-IDF 7914 59.04 58.80

Table 7.7: Average result in the second prototype after introducing more data and using pre-processing steps 1-4 explained in Section 2.1

(random_state = 0-9)

Accuracy (%)

SVM TF-IDF 7914 62.89

MLP TF-IDF 7914 64.87

Table 7.8: Best result in the second prototype after introducing more data and using pre-processing steps 1-4 explained in Section 2.1 (random_state = 3)

SVM Word count 7415 56.66 56.58

SVM TF-IDF 7415 57.03 57.09

MLP Word count 7415 58.76 58.97

MLP TF-IDF 7414 58.67 58.97

Table 7.9: Average result in the second prototype with same data as table 7.8 but using pre-processing steps 1-4 explained in Section 2. In this case the list created in step 4 is extended (random_state = 0-9)

(28)

Accuracy (%)

SVM TF-IDF 7415 62.89

MLP TF-IDF 7414 64.31

Table 7.10: Best result in the second prototype by using same data as table 7.8 using pre-processing steps 1-4 explained in Section 2.1 with the same extended list as in table 4.7 (random_state = 3)

SVM C=10, Kernel = linear, gamma =

0.001

MLP activation = logistic, solver=adam,

alpha = 1e-5, hidden_layer_sizes = (20,), random_state = 0, max_iter = 2000

Table 7.11: Showing the settings used in the third prototype by the different classification algorithm

Accuracy (%) Cross- validation (%) CV = 5 Naïve

Bayes

Word count 11642 52.66 51.26

Naïve Bayes

TF-IDF 11624 61.91 63.22

SVM Word count 11624 64.04 65.93

SVM TF-IDF 11624 62.90 66.41

MLP Word count 11624 64.81 66.52

MLP TF-IDF 11624 64.86 66.70

Table 7.12: Average result in the third prototype after introducing even more of data and using pre-processing steps 1-4 explained in Section 2.1 with the same extended list as in table 7.8s (random_state = 0-9)

(29)

Accuracy (%)

SVM TF-IDF 11624 64.36

MLP TF-IDF 11624 67.50

Table 7.13: Best result in the third prototype after introducing even more of data and using pre-processing steps 1-4 explained in Section 2.1 with the same extended list as in table 7.8 (random_state = 0)

Cross- validation (%) CV = 5 Naïve Bayes Word count 10261 53.25 51.93

SVM Word count 10261 64.39 65.91

SVM TF-IDF 10261 63.90 66.41

MLP Word count 10261 65.02 66.72

MLP TF-IDF 10261 65.21 66.72

Table 7.14: Average result in the third prototype by using same data as table 7.12 but using pre-processing steps 1-5 explained in Section 2.1

(random_state = 0-9)

Accuracy (%)

SVM TF-IDF 10261 69.15

MLP TF-IDF 10261 68.79

Table 7.15: Best result in the third prototype using same data as 7.12 but using pre-processing steps 1-5 explained in Section 2.1 (random_state = 6)

Automation of support service using Natural Language Processing: - Automation of errands tagging

Engineering Degree Project