[THIS PAGE IS INTENTIONALLY LEFT BLANK] 2

(1)

UPTEC F 18004

Examensarbete 30 hp February 1, 2018

Using cloud services and machine

learning to improve customer support

Study the applicability of the method on voice data

Henrik Spens

Johan Lindgren

(2)

[THIS PAGE IS INTENTIONALLY LEFT BLANK]

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Using cloud services and machine learning to improve customer support

Henrik Spens, Johan Lindgren

This project investigated how machine learning could be used to classify voice calls in a customer support setting. A set of a few hundred labeled voice calls were recorded and used as data. The calls were transcribed to text using a speech-to- text cloud service. This text was then normalized and used to train models able to predict new voice calls. Different algorithms were used to build the models, including support vector machines and neural networks. The optimal model, found by extensive parameter search, was found to be a support vector machine. Using this optimal model a program that can classify live voice calls was made.

ISSN: 1401-5757, UPTEC F18 004 Examinator: Tomas Nyberg Ämnesgranskare: Dave Zachariah Handledare: Martin Lundvall

(4)

Acknowledgements

To Dave Zachariah, our subject reviewer, thank you for accepting and reviewing our master thesis.

We would like to thank our supervisor at Connectel, Martin Lundvall, who was there to help if we had any questions. We also want to thank everyone else at Connectel for being supportive and interested in our work. We are grateful for the opportunity to do a project on this very relevant and emerging technology.

(5)

Popul¨ arvetenskaplig sammanfattning

I dagens samhälle har data börjat spela en allt större roll för företags framtida över- levnad. Den kan komma i alla möjliga former som bilder, text, shoppingmönster etc, men gemensamt är att den inneh˚aller trender och mönster hos kunder. Varje g˚ang en bild laddas upp eller ett textinlägg publiceras p˚a nätet skapas ny data, totalt över 2.5 exabyte, varje dag[32]. Detta kan komma till användning för att förutsp˚a händelser som t.ex. influensasäsongen. När tillräckligt m˚anga personer söker p˚a symptom som indikerar influensa kan influensasäsongen antas vara ig˚ang vilket kan kontras med information fr˚an sjukv˚arden om hur influensaspridning kan förhindras. Mer data ger ocks˚a upphov till nya typer av tjänster inom vitt spridda omr˚aden. Ett exempel är möjligheten för en dator att reagera p˚a mänskligt tal. Idag finns ett flertal exempel p˚a företag som har implementerat röststyrda telefonköer för att kunden själv ska kunna avgränsa vad en eventuell fr˚aga handlar om, t.ex. om fr˚agan rör en faktura eller ett leveransproblem. Dessa system fungerar relativt bra men de reagerar ofta bara p˚a specifika ord och förväntar sig att kunden använder dessa, t.ex. kan systemet fr˚aga kunden vad samtalet handlar om, varp˚a kunden förväntas svara med en av fem olika kategorier.

En lösning som är mer sällsynt är att kontinuerligt under samtalets g˚ang analysera vad som sägs för att löpande f˚a mer och mer insikt i vad samtalet handlar om. Nya framsteg inom tal-till-text öppnar nya möjligheter för analys av samtal även för min- dre företag som inte har resurserna att köpa in eller träna egna system. De system som ligger i framkant inom omr˚adet har en noggrannhet p˚a 95.1% när det gäller att

¨

oversätta engelska fr˚an tal till text, vilket är i linje med en människas förm˚aga att tolka tal. Noggrannheten är inte fullt lika hög för svenska men blir bättre och bättre när mer data blir tillgänglig. Ett omr˚ade som har haft stor betydelse för framstegen inom tal-till-text är de s˚a kallade neurala nätverk som producerar text givet de fonem ett ord byggs upp av, där fonem är ordets spr˚akljud, det vill säga best˚andsdelar, som bland annat definerar dess uttal. Ny teknik, speciellt nya och mer kraftfulla grafikkort, har banat vägen för djupa neurala nätverk med alltmer komplexa struk- turer där data passerar m˚anga olika lager innan en slutgiltig prediktion ges. Givet att stora dataset finns tillgängliga för träning av dessa modeller kan de göra väldigt exakta prediktioner inom specifika omr˚aden.

Givet inspelade samtal där problemet är känt p˚a förhand kan en modell tränas med maskininlärning för att lära sig hur kunder beskriver ett visst problem. Denna modell kan sedan användas för att i realtid estimera vilket problem som beskrivs. Utifr˚an detta g˚ar det till exempel att föresl˚a en lösning, skapa en sammanfattning av samtalet eller föresl˚a relevant information kopplad till vad som sägs. Ett s˚adant system skulle kunna arbeta i bakgrunden och underlätta för kundtjänstagenten eller imple- menteras tillsammans med tal-till-text för att skapa en helt automatiserad kundtjänst.

En s˚adan lösning skulle kunna användas för att förlänga öppettiderna i telefonsup- porten genom att det automatiserade systemet tar hand om inkommande samtal under natten. Systemet skulle kunna förkorta den genomsnittliga samtalstiden genom att kundtjänstagenten f˚ar värdefull information levererad medan samtalet p˚ag˚ar och slipper göra manuella sökningar i databaser. Den text som genererades fr˚an samtalet skulle sedan kunna användas för att vidareutveckla modellen genom att träna om den p˚a fler samtal.

M˚alet med det här arbetet är att ta fram ett system för att, i realtid, kunna analysera ett samtal och klassificera vilket av ett p˚a förhand bestämt antal problem det handlar om. Kvalitén p˚a uppskattningarna kommer jämföras mellan olika modeller för att hitta den mest optimala för uppgiften. Samtalen som analyseras kommer att h˚alla sig till ett bestämt ämnesomr˚ade. Arbetet kommer fokusera p˚a de tekniker som behövs för att förädla den text som erh˚alls fr˚an tal-till-text-tjänsten, samt de modeller som

(6)

används vid klassificeringen av samtalen. En annan viktig fr˚ageställning är om det

¨

ar applicerbart p˚a svensk telefonsupport där kvalitén p˚a samtalen i dagsläget är i det lägre intervallet av vad tal-till-text-tjänsterna klarar av.

(7)

Lexicon & abbreviations

Short Definition

Affix An attachment to a stem which creates a new word. An example is the affix

“-ness”, which attached to “kind” forms the word “kindness”.

BOW Bag of words

Compounding A compound is two words merged to form a new word. An example is the word

“starlight”, which consists of the words “star” and “light”.

ConvNet Convolutional Neural Network

Derivation Forming a new word by adding affixes. An example is the word “kind”, to which the suffix “-ness” can be added to form the word “kindness”.

FF Feed-Forward Neural Network

Inflection An inflection is the transformation of a word to suit a tense, gender or any other grammatical situation. An example are the words “eat” and “ate”, where “ate”

is an inflection of the word “eat” based on tense. However, they still share the same stem.

Lemma The word representing a lexeme, for the lexeme example it is the word “run”.

Lexeme The different forms a word can have. An example are the words “run”, “ran”

and “running”.

MAP Maximum a posteriori

ML Machine Learning

MNB Multinomial naive bayes

Morpheme Smallest meaningful component of a language. An example is the morpheme

“-s”, which added to “beer” forms the word “beers”. Here, the addition “-s”

is also known as a suffix which is a morpheme.

Morphology How words are related to each other within a language. An example are the words “swim”, “swam” and “swum”, which intuitively belong together.

N-gram Depending on the context, an n-gram can be n words in succession or every combination of n successive characters in a word. An example of the former are the bi-grams (2-grams) of the sentence “How are you?”, which are “How are” and “are you?”. An example of the latter are the tri-grams (3-grams) of the word “friend”, which are “fri”, “rie”, “ien” and “end”.

NN Neural Network

Node Smallest component of a neural network NSFW Not safe for work

POS Part-of-speech

Prefix An attachment at the beginning of a stem to form a new word. An example is the prefix “un-”, which attached to “interested” forms the word “uninterested”.

ReLU Rectified linear unit

Semantic The relationship between words in text and how they convey meaning together.

Stem The stem of a word is the part to which affixes can be added to form new words. Sometimes the stem can change with the inflection. An example of the prior are the words “run” and “running”, which share the stem “run” to which the suffix “-ning” is added. An example of the latter are the words “eat” and

“ate”, which share the stem “eat”.

Suffix An attachment to the end of a stem to form a new word. An example is the suffix “-ness”, which attached to “kind” forms the word “kindness”.

SVM Support vector machine tanh Hyberbolic tangent

TF-IDF Term Frequency - Inverse Document Frequency

(8)

1 Introduction

1.1 Background

The field of AI and machine learning is the latest trend in the industry and everyone is scrambling to not be left behind in the race to capitalize on the data being generated by customers every second of every hour of every day. New advances in the manufacturing of graphics cards has paved the way for deep neural networks to, for the first time in history, be feasible and not just a theoretical representation of how an algorithm can be modeled to behave like the human brain. Neural networks was first invented in the 1950s but has been infeasible until a few years ago due to the im- mense processing power required to train them. A neural network need vast amounts of data to train properly on complex problems and was long thought of as unstable due to it being trained on to little data, thus producing subpar results. Nowadays the networks are trained on multiple terabytes of data and are modeled in an ever more complex fashion. They have become so reliable that they are being used to maneuver automated cars and predict skin cancer from images. Neural networks are also used in speech-to-text systems where they can predict a sentence from a speech. It does this by analyzing the spoken phonemes hidden in the utterance of a word. The best speech-to-text systems to date achieve a word error rate of only 4.9%, which is on par with human ability to transcribe speech. The text produced by this system can then be used to analyze speech and make predictions based on human speech. Many companies now implement some sort of human-computer interaction based on speech in their phone support. It is often used as a way to steer all customers with a specific type of problem to an agent specialized in that area. Most of these systems use simple question-answering approaches where a computer voice asks the customer to either press a specific button or to say one of a few different alternatives. This thesis will explore a way to train models to analyze real-time human speech without the need for question-answering approaches.

1.2 Project description

This project tries to answer whether or not a machine learning algorithm can be implemented in a swedish customer support to analyze incoming calls and if it will perform adequately enough to be usable for agents receiving calls.

Many companies today have old systems and routines implemented in their customer support. It is common for support calls to be recorded and stored for evaluation purposes but the process of evaluating previously recorded calls is very time consum- ing. The data is often not stored properly and it is hard to find relevant information.

There is also a slight learning curve when a new employee is trained to receive calls with problems specific for that company. By training a model to understand how customers describe a specific type of problem it could help in lowering the learning curve by suggesting a solution based on the problem description by the customer. It could also help in creating a more easily navigated database of stored calls by attaching a transcript of the call in text.

In order to train a classifier on problem descriptions a dataset of it is needed. As mentioned before, many companies store support calls but they do not store transcripts. The first thing to implement is a way to convert a call to text. This will be implemented with Google Speech API which can be used to extract transcriptions of calls with an arbitrary length (up to 180 min)[1]. It also provide streaming transcripts in order to analyze calls in real-time. The API is not 100% in its translation of speech- to-text and thus some words are interpreted incorrectly. Some of these mistakes can be fixed in the next step of the chain.

(11)

After a set of text transcriptions have been extracted they need to be preprocessed before handed to the classifier. The preprocessing stage is essentially a way to nor- malize the text in order for it to be classified more easily. A prime example of this is lower casing, where all capital letters are replaced with lower case ones. This is done because the meaning of the word is inherently the same whether it is capitalized or not. A few other techniques will be applied as well, although the exact combination is dependent on the domain of the problem descriptions. A variety of stemming, word substitution, stopword removal and tokenization will be tested in order to find the best solution for the problem at hand.

When the text is preprocessed it can be fed into a classifier in order for it to detect hidden structures in the data which can be used to separate problem categories.

A few different approaches with ranging complexity will be tested. Due to the small size of the dataset (480 calls covering four different problems) a simple model will probably achieve a higher accuracy than a complex one. The simple classifiers which will be tested are the naive bayes classifier and a support vector machine classifier.

The complex ones are a feed-forward neural network and a convolutional neural network. The performance will be evaluated on a test set extracted from the original dataset containing 480 calls. The training process will be done with 70% of the dataset as training data and iterated with different parameters until the evaluation accuracy has stabilized at some maximum.

The fully trained and tested classifier will then be implemented in code which uses the streaming function provided by Google Speech API. It will then be tested with simulated live calls where anyone can speak directly into the computer, describe a problem and get some prediction from the classifier as to what problem was described.

This thesis is co-written by Henrik Spens and Johan Lindgren.

Henrik will be responsible for the code connecting to Google Speech API, including the code which handles the live processing of speech. He will also be responsible for the complex classifiers, meaning the feed-forward neural network and the convolutional neural network, which will be implemented in TensorFlow.

Johan will be responsible for the feature extraction stage in which the text is analyzed and translated into something that can be understood by computers. He will also implement the naive bayes and support vector machine classifiers, which will be done in the python library Scikit-Learn. Johan will also try to implement an unsupervised learning approach for training the classifiers without pre-labeled data.

Both Henrik and Johan will be responsible for writing the report and presentation, the preprocessing code used when normalizing the data and the code for evaluating and finding the optimal parameters for the classifiers.

1.3 Scope

The goal of this master thesis is to have a prototype of a ready to use classifier which can help interpret and classify what type of problem a customer is describing, prefer- ably in a real-time fashion.

The system will only be able to handle problems for which it has been pre-trained.

The problems used in this project are all within the same domain and all represent common, imaginative and real, problems with coffee machines.

(12)

2 Theory

2.1 Language basics

Spoken languages is thought to have evolved some 100, 000 years ago, although the literature is divided on the matter[43]. The oldest written language is Sumerian which is dated to around 2500−3000BC depending on what is meant by ”written language”.

Speech is formed by pressing air out of the lung. This air passes the vocal cords in the throat which begins to vibrate. The rate of the vibration determines the pitch of the sound. This sound is then formed to words by moving the mouth in different ways to produce different sounds. These sounds are the building blocks of a word and are called phonemes. The English language consists of around 40-45 of these phonemes depending on the dialect. The way they are arranged to form words is called duality of patterning (or just duality )[59]. These 40-45 distinct sounds are the foundation for the around 170,000 words used in the English language. Apart from English, there exists around 7000 other languages in the world, most of which are nearly extinct and only spoken by a few members of the same tribe/community[19][2].

New words are added to languages all the time, primarily through borrowing from other languages, but also by giving different meaning to words or by creating new words. This word-formation process could include adding affixes to words, transforming them by compounding or by changing nouns to verbs, an example is the noun

”Google” which can be altered to the verb ”Google” (to google something)[31].

The area of language with its syntax and structure is a complex field currently attract- ing much attention from companies and researches. New tools and faster computers is disrupting the field and what seemed like an impossible feat just 20 years ago might come to reality in a near future, a multilingual translator (Google Translate is not there just yet). This is made possible by the increasingly more advanced parsing techniques and the introduction of fast neural nets capable of analysing sentences, understand their semantics and map it to another language, which might have another sentence structure. Each technique will be elaborated on further in their respective section.

2.2 Google Speech API

Google Speech API is an easy to use service which outputs text given some audio file containing speech as input[3][1]. The service, provided by Google, was first released in 2011 without any announcement and seems to have been discovered by accident in the chromium repository[52]. The API was recently released in its final form (as of April 18, 2017) and now supports 110 different languages. In a blog post, announcing the release of the first stable version of the API, the product manager Dan Aharon said[16]:

”Among early adopters of Cloud Speech API, we have seen two main use cases emerge:

speech as a control method for applications and devices like voice search, voice com- mands and Interactive Voice Response (IVR); and also in speech analytics. Speech analytics opens up a hugely interesting set of capabilities around difficult problems e.g., real-time insights from call centers.”

- Dan Aharon, Product Manager, Google Speech Team

The API is able to transcribe both synchronous, asynchronous and streaming audio from a microphone, which is particularly useful for real-time speech applications such as text classification in phone calls. It has a built in language filter to censor inappropriate language. Context specific terms can be added to the vocabulary in

(13)

order to improve accuracy of out-of-vocabulary words which might also be context specific.

Previous speech recognition services used HMMs (Hidden Markov Models) to predict speech data. These models make good predictions but require knowledge about the data to be trained. Recurrent Neural Networks (RNNs) on the other hand do not require this knowledge and is a more suitable option. However, they require segmented training data and as such have a problem with handwritten text and speech data[17].

Segmented data is data where the sequence of the input is known beforehand. For example, a sentence like ”Hey, how are you?”, here we know which order the input is in as it is fed word by word. Imagine the same sentence but in audio data instead. It is now tricky to know in what order the actual letters occurred since they are fed in all at once and a system must understand the order of the phonemes to predict a word.

Google Speech implements a modified version of a RNN called Connectionist Tem- poral Classification (CTC) which is capable of classifying unsegmented data. The image below show the output from such a system where each peak indicates a spoken phoneme.

Figure 1: The image above shows the output of a CTC predicting an output given the input “Museums in Chicago” in audio

These models (CTCs) replaced the older Gaussian Mixture Models as Google’s, and the industry’s, new standard for voice analysis[30][29]. Google’s approach also uses so called Long Short-Term memory[28] within their RNN in order to enhance the predictions since phonemes are formed with respect to the one preceding it. Try to say “Museums” or ”Music”, here theuis clearly pronounced ”you”, whereas the pro- nunciation of theuin the word ”suppose” is just a ”u”.

Google Speech API has an error rate of 4,9% for English as of May 2017, a big im- provement since July 2016 when the error rate was at 8,5%[4]. In reference, according to a developer post by IBM the word error rate of humans is 4%[55].

2.3 Tokenization

A computer does not read text the same way we humans read. It does not read the text as a whole, instead we can make the computer understand text better if it is broken down into smaller components which are easier to interpret than a large body of text.

The word token can mean an item, idea, person, feature, etc. which represents parts of a larger group. In the case when this larger group is a body of text this token can be a smaller component of the text. For example, single words can be tokens of a sentence, and a sentence a token of a paragraph. These components can later

(14)

be used as defining features when training a model to classify text. The process of splitting and breaking down the text data into these tokens is called tokenization. In natural language processing the two most common tokenization methods are word tokenization and sentence tokenization.

In the western world, languages usually separate words by white-spaces, for example English and most european languages. Thus a word tokenizer for these languages can mostly split by white-spaces. This would already be a quite good and simple tokenizer, but there are many exceptions in language that has to be taken into account to improve tokenization. For example; commas, quotations, question marks and other special characters have to be considered. They can be a part of the word, replaced by a whitespace or treated as their own word. Some languages, English included, contains a lot of concatenation of words, for example ”you’re”, ”I’d”, ”you’ve”. These concatenated words also require special rules from the tokenizer, either treated as one or as two words. Misspellings is another problem that one has to be aware of when using a tokenizer.

When tokenizing by sentences a first step would be to separate by punctuation, question marks, exclamation marks, new lines and sometimes colons and semicolons.

There are of course special rules which have to be applied here as well. Consider the sentence ”Hello Mr. Smith, how are you doing?”. This sentence contains a period after ”Mr”, but this does not end the sentence. Several of these types of rules for special punctuation and abbreviation cases have to be used to create an acceptable sentence tokenizer.

Tokenization may seem like a trivial task compared to other processes when working with natural language, but it is of utmost importance. Errors in tokenization can lead to big problems down the line as it is the very foundation of natural language processing.

2.4 Stemming

Stemming is a fundamental part of natural language processing and have been around for decades[5]. The process of stemming involves the transformation of a word, often through suffix-removal, in order to obtain its stem. A stem is the base form of a word to which affixes can be added to form new words.

The most influential stemming algorithm is the Porter Stemmer. It was published in 1980 by Martin Porter and uses suffix-removal to increase the speed of the stemming process. This was very important at the time due to limited computing power. The algorithm uses five (5) steps to identify and strip the suffix from a word. The steps have been left out here but can be studied in the original paper[49]. Porter described a word as a repeating pattern of vowels and consonant, as.

[C](V C){m}[V ] (1)

, where (V C) are 0 or more consonants or vowels in succession, respectively, {m} are the number of times this pattern repeats itself and [C], [V ] denotes a possible presence of 0 or more consonants or vowels, respectively. The Porter Stemmer is known for producing non-existing words but perform equally to previously published stemming algorithms. However, it is considerably faster than the stemmers of its time. Porter reported that it could process 10 000 words in 8.1 seconds on an IBM 370/165 in 1980.

The computer had a price of 4.5 million $ and a processing speed of 12.5 MHz[26][20].

The Porter Stemmer was later revisited by Martin Porter who produced the Porter2 Stemmer, also known as Snowball English Stemmer[6]. Porter felt that the rules described in his paper was misinterpreted by researches and that faulty versions of

(15)

the algorithm was used, thus discrediting him, so he defined a new language which he named Snowball[50]. Snowball can be used to describe stemming rules in natural language[7].

Stemmers can be divided into light and aggressive stemmers. Aggressive stemmers will more often produce an incorrect stem due to it stripping too much of the word, thus grouping together words which might not be related. An aggressive stemmer might also remove part of the stem producing a word shorter than its base form. The aggressive stemmers are reported to perform better than light stemmers as they tend to reduce the feature size more, thus increasing the number of “hits” from a query in an Information Retrieval System (IR) [58].

Apart from the aggression rate of the stemmer they can also be trained in different ways. The different stemming algorithms can be said to belong to one of the following[33][34].

1. Rule based stemmers: These types of stemmers are often referred to as language specific stemmers since they are trained to work with a specific language.

The training is done by a person with extensive knowledge of the language, a linguist. Rule based stemmers can further be divided as:

– Brute force or Lookup table: These stemmers use a table containing all words correlating to a specific stem. They fall short if the word is not in the table but they often do better at correctly stemming morphologically complex words. An example of this is the word ”go” which is the stem of itself as well as ”went”.

– Affix removal: The affix of a word is the part that is added to the stem when altering the morphological meaning of the word. Affixes are both prefixes and suffixes, i.e. characters can be added at the beginning of the word (prefix) or at the end of the word (suffix). An example of this is the addition of the suffix ”-ness” to the word ”kind” to create ”kindness”.

Affix stripping stemmers often produce non-existing words since they tend to over-stem or under-stem words, thus producing an incorrect stem and possibly grouping together unrelated words or failing to group together related words.

– Morphological: A stemmer that is trained to handle both inflectional and derivational variants of words. They are hard to train since they require complete knowledge about the language which they are trained for.

2. Statistical stemmers: A statistical stemmer is one that has been trained in an unsupervised or semi-supervised way by analyzing text of some sort in order to learn the language. A strong selling point of the statistical stemmers is that they are language independent and can be retrained in another language quickly by feeding them text in that language. The statistical stemmers are categorized as:

– Lexicon analysis: Learns the morphology of a language through analysis of the language’s lexicon.

– Corpus analysis: These stemmers are trained to group together words that co-occur in corpuses. Two words are said to co-occur if they are present in a text within some word count of each other and are similar with respect to their stem, which is generated by an aggressive stemmer[36].

– Character N-gram: A stemmer that has been trained on the n-gram con- sistency of words. It can learn to identify prefixes and suffixes of words by analysing character succession.

(16)

3. Hybrid stemmers: A mix of statistical and rule based stemmers.

The efficiency of a stemming algorithm can be measured by comparing a text before and after stemming and identify correct and incorrect stems. Two different mea- surements have been proposed, under-stemming index (UI) and over-stemming index (OI)[46]. They are defined as

U I = 1 − conf lation index (2)

OI = 1 − distinctness index (3)

Conf lation index = correctly stemmed word pairs

total word pairs (4)

Distinctness index = U nsuccessf ully stemmed word pairs

total word pairs (5)

Further, a relation between them called stemming weight (SW) was proposed and is defined as

SW = OI

U I (6)

Light stemmers will achieve a high under-stemming rate and low over-stemming rate and aggressive stemmers achieve a high over-stemming rate and low under-stemming rate.

Stemming often increase recall of information retrieval systems since it reduces the feature size of the document[27]. Light stemmers increase precision and aggressive stemmers increase recall, see section 2.13. Imagine a search engine which serves its users relevant documents matching their search query. If a user searches for ”power walk” the search engine should also match that to ”power walking” as it is likely to be relevant to the user. This can be done by indexing stemmed pages where walking is stemmed to walk. However, it may lead to a reduction in precision.

2.5 Lemmatization

Stemming was previously mentioned as a way to reduce the feature space of a corpus in order to classify it more accurately. Lemmatization, too, concerns itself with the process of normalizing and reducing the feature space but does so in a slightly different way. Stemmers are mostly focused on producing a stem of a word. These words are sometimes not real words and may not be the best solution for some applications[57].

An alternative is to use a lemmatizer which only produces real words in the form of lemmas[56][42]. The lemma is produced by an affix removal procedure but the word is cross-checked with a lexicon to see whether or not the word actually exists.

A lemmatizer is more advanced than a stemmer as it has some morphological knowledge as well as the stemmers knowledge about affixes. A lemmatizer contains the entire lexicon and can handle complex inflectional variations as ”ate”, which is an inflection of ”eat”. However, a lemmatizer may need an input as to what type of word it is dealing with as a word can have a different meaning depending on whether it is a noun, a verb or an adjective. For languages with complex inflections a lemmatizer provide higher accuracy than a stemmer, though the impact in performance can be high[39].

(17)

2.6 Stopwords

Which words give meaning to a language? Even though words removed sentence still probably understand meaning might seem like poorly formulated sentence first.

Have you changed your mind? Did you fill in the gaps as you read?

Stopwords are words that carry little to no real information and only serve as a glue in the language for us humans to understand it better. In fact, about 30-50% of all text is made up of stopwords, depending on the size of the corpus[40][44]. Accord- ing to Zipf’s law[60] the most common word in a corpus occurs twice as often as the second most common and thrice as often as the third[18][51] .

Stopword removal is an important step in preprocessing before classification. In a large corpus the feature space can quickly reach the range of thousands, often considerably more than the number of classes. By reducing the number of irrelevant features to consider at classification the accuracy of the classifier is likely to improve.

Another advantage of removing irrelevant words, often stopwords, is the shrinkage of the file size, often by up to 30%.

There are a number of different ways to obtain stopwords in a language. Many pro- gramming language libraries come with stopword lists bundled with them and similar lists can be found on the internet[8]. These lists are often exhaustive and may remove too much. Hence, stopword lists should be used in a trial-error fashion as research have proven that lists with just nine (9) words can reach equal accuracy as lists with 570 words in classification tasks on English text corpuses. [40]

There are automatic ways to filter out stopwords which might be preferable for context specific classification. One such way is by use of TF-IDF (Term Frequency - Inverse Document Frequency), which statistically orders the words by their importance[38].

From this list a user can, for example, use a threshold value to automatically choose the 10-100 most common words or manually pick words thought to be of little meaning to a classifier. Consider to also eliminate the least common words, since very rare words might offset a classifier, though this is context specific and can not be generalized other than words such as ”yourselves” for example.

The literature seems to be unanimous in the conclusion that removing stopwords, generally, has a positive impact on the classification accuracy, although it is highly dependent on the context considered[40][38][23][44][25].

For reference, the initial sentence which was stripped from stopwords read; “Even though some words have been removed from this sentence you can still probably understand the meaning of it and it might just seem like a poorly formulated sentence at first”.

2.7 Part-of-speech (POS) tagging

All words in a sentence belong to a specific lexical category based on their context.

These categories are called part-of-speech or POS, the most common of which are nouns, adjectives and verbs. The act of automatically classifying words to a part- of-speech is called part-of-speech tagging or POS tagging. More specific tags are usually also applied, such as noun-plural. POS tagging is done because in natural language processing the more information about a word we have the better. It can be used later to filter out unwanted words, predict neighboring words or chunk together words which may belong together.

The process of tagging words in a text with their correct POS is not always triv-

(18)

ial. Words are ambiguous, many words can mean more than one thing, and some words for example may be either a noun or a verb depending on context. One such word is call. Here it is used as a noun; I missed your call yesterday, and as a verb;

I will call you tomorrow. In english 80-86% of words are unambiguous, which may make this seem like a non-problem. But even though only 10-14% of words are ambiguous these are often very frequently used words and can make up 55-67% of words in a common English running text. This means smart POS taggers are required, which not only assigns a specific POS for each word, but also considers context. Sim- ple POS taggers will be explained first and then we move on to more complex taggers.

Most taggers are trained on pre-labeled data, but there are POS taggers which only follows specific rules. The regular expression tagger uses regular expression patterns to assign the POS of a word. For example, a word ending with ed is considered a past participle of a verb and a word ending with ’s is assigned to be a possessive noun.

Another rule-based tagger is the lookup tagger. It uses a list of the n most common words in a language and tags future text with their most common POS. These types of POS taggers always tag a specific word with the same tag, the context is not considered.

POS taggers can be built using supervised training. This means using already POS tagged corpora to train and test a tagger. The simplest tagger that can be trained is the default tagger. This tagger looks at the most common POS tag in a corpus and uses that tag as its default when tagging new words. If the most common tag is noun it will tag all words in future text as nouns. This will, as expected, not produce useful results as future words will not always be nouns.

A unigram tagger is another trained POS tagger. It will always tag the word with the POS tag which was most common for that word in the training data. For example, if the word call was considered more times than a noun in the training text, the unigram tagger will always consider call to be a noun. This model still does not take context into account, but can provide sufficient results if trained on a large corpus.

To get the POS tag which is most likely in the given context the n-gram tagger can be used, which not only considers the current word but also the n-1 previous words’ POS tags. If n = 3, the tagger looks at the two preceding words and tags the current word based on that. This means that the n-gram tagger changes its tag based on context, but it also requires a large training data set to include many different combinations of word sequences. When a combination of words that have not been trained is seen the tagger will not be able to tag the word. Even if a large training data set is used the number of context combinations are massive. If n is larger than three only a tiny part of these combinations can be trained. To solve this problem a backoff model can be used.

When combining more than one tagger a backoff model is created. If a more complex tagger cannot tag the current word it “backs off” to a simpler one. An example would be:

1. Try using an bigram tagger (n=2) to tag the word.

2. If it is not able to tag it, use a unigram (n=1) tagger.

3. If the unigram has not seen the word before it will not be able to tag it, then use the default tagger, i.e tag with the most common POS tag in the training data.

(19)

2.8 Shallow Parsing

Text spoken and written by humans is often of very high ambiguity. This is one reason why parsers can have a hard time to know which words of a sentence refers to other words in that sentence. A lengthy sentence of 20-30 words can contain thousands of different possible sentence structures[48]. The simple sentence

”Jane walked down the street in her new dress.”

is easy to understand for us humans. We know immediately this means Jane walked on a street, and she is wearing her new dress. But a parser might see this as Jane walked down a street, where the street is inside her new dress. A parser needs to consider all these possible alternatives, discard the unplausible ones and choose the one that reads as a human would read it.

Computers can easily be used to read structured data such as lists with information, for example a list which combines languages to specific countries. But free text is unstructured data and hard for a computer to interpret. Shallow parsing, also known as light parsing is the act of grouping together tokens from sentences to create higher-level phrases. The main goals of shallow parsing is to get semantically meaningful phrases, observe relations between words and generate structured data from unstructured data.

With shallow parsing a parse tree from a sentence providing an easy to understand breakdown of a sentence[41]. Here is an example of a shallow parse tree:

Figure 2: Shallow parse tree

These trees can be traversed recursively to find the roots and relations of words. The leafs are the individual tokens of the text, often these are linked to their corresponding POS-tag which is used to decide which tokens form a chunk. Each chunk also have their own POS-tag, such as noun-phrase, verb-phrase or determiner.

Specifying the rules and patterns which decide which tokens should be included in these chunks is called chunking. Most parsers are POS-tagged based. They specify rules based on the POS-tags of the words in the sentence. For example if a determiner is followed by an adjective and then by a verb they all are formed into one chunk.

But there are sentences which have the same POS-tags which should still be chunked differently. A supervised machine learned parser can be used to reduce these kinds of errors.

The n-gram parser is one of these supervised machine learned parsers. It is trained on already chunked data and works similarly to the n-gram POS-tagger. The parser looks at the n-1 previous tokens’ POS-tags to decide the chunk.

(20)

The reverse of chunking is called chinking. Sometimes it may be of greater use to specify not the POS-tags which to include in chunks, but rather the tags not to include.

An example would be to chink all tokens POS-tagged with adverb.

2.9 Feature extraction

What are features? Features in machine learning are defined as measurable attributes for each data point in a dataset. Features can either be in numerical form or categor- ical form. Features can be used in a machine learning algorithm to find patterns and create a model. This model will then be able to categorize new features fed into the model. Obtaining these features is called feature extraction. When working with text data the features may be the tokens obtained after text normalization and tokenization processes. Textual data have to be converted from text to numerical features as all machine learning algorithms are mathematical at heart and expect numerical features.

Vector space models or term vector models are often used to convert textual data to numerical features. They are popular in information retrieval and document rank- ing. A vector space model is a mathematical model, representing documents as numeric vectors of specific terms. We can write the model mathematically by denoting document D in the vector space V S as

V S = {W₁, W₂, . . . , W_n} (7) where there are n words W across all the documents considered and

D = wD1, wD2, . . . , wDn (8) where wDn is the weight for word n in document D. The weight is a numeric value and can represent things such as frequency, average frequency of occurrence or TF- IDF (more on this later) weight.

The simplest of these vectorization models is the bag of words, BOW. The model converts documents into vectors so that each document has a vector that represents the frequency of all distinct words in that document that are included in the vector space. Thus, in the document D, the weight of each word is equal to the frequency of occurrence of that word in the specific document. For example, consider the sentence

”I like apples, because apples taste really good.”

The vector created here with the BOW model would have value 2 for apples, 1 for taste and 0 for banana, if banana also existed as a word in the vector space. The features in the BOW model is limited to the document vector space in the training data, words not present in the training data will not be considered when using the model on new text documents. There are problems with the BOW-model, since it is based on absolute frequency of occurrence of words in the document. If some words occur a lot in the documents they will have big weight and may overshadow other features which may be more interesting/effective as features to identify the category of the document. Longer documents may also be seen as more important since they probably contain a higher count of many words. These problems can be handled by using the TF-IDF model[53][21]. The term frequency-inverse document frequency, TF-IDF, model is slightly more complex than the BOW model. It is defined as the multiplication of the term frequency and the inverse document frequency. Term frequency is the same as in bag of words model, i.e the frequency of occurrence of a specific term. It can be written as

tf (w, D) = fwD (9)

(21)

where fwD is the frequency of word w in document D. But as using just term frequency scales up common terms and scales down more unique terms the inverse document frequency is also used. The inverse document frequency is determined by the logarithm of the inverse number of documents in which a term/token exists, mathematically:

idf (w) = log

|D|

1 + |d : w ∈ d|

(10) Where D is the total number of documents, and |d : w ∈ d| is the number of documents in which the word w occurs. The 1 is added to avoid division by zero. So, if a token occurs a lot in a document it will have high tf, and if this token exists in many different documents it will have a lower idf. The formula for TF-IDF weight is then

tf − idf (w) = tf (w, D) ∗ idf (w) (11) The TF-IDF of a token then has a high weight if that token has high occurrence in the current document and a low document frequency across all documents.

2.10 Bias-Variance

Machine learning can be a wonderful tool when working with data prediction. How- ever, it is important that the model trained can be generalized to new data. When training a model it is custom to use part of the data for training, part for testing and another part as validation data. If the model is trained and tested on all data it might be too inflexible. A model which fits the data well is said to have low bias. It might however have a high variance, meaning that it will not generalize well to new data since it has been trained to fit a specific data set. In general, a complex model will have high variance and low bias, whereas a simple (linear) model will have high bias and low variance. This correlation between bias and variance is known as the Bias-Variance tradeoff. Bias is defined as the difference between the expected value (mean) of X and Xs’ true value, written as

Biasθ[ˆθ] = E[X] − X (12)

Variance is defined as the expected squared difference between the true value and the mean value, written as

σ_x²= E[(X − µ)²] (13)

, where µ is the mean of X. A classifier for regression problems is said to contain three types of error sources; the bias, the variance and the irreducible error. This is also known as the bias-variance decomposition and describes how much a classifier differs in its predictions from the true values. The irreducible error comes from the fact that the data itself for which the model is trained on may contain noise. This error gives the lower bound for classifier accuracy and can not be affected. A classifier for classification problems normally uses the 0 − 1 loss function instead where an incorrect prediction is a 0 and a true prediction is a 1. Bias is here defined as incorrect predictions (low bias = accurate predictions), whereas variance is defined as how much predictions varies from dataset to dataset.

2.11 Supervised learning

2.11.1 Multinomial naive bayes classifier

The multinomial naive Bayes classifier is a variation of the standard naive Bayes classifier and is used when there are more than two classes to predict from. We will first describe the naive Bayes classifier. It is a linear probabilistic classifier based on Bayes’ theorem. It is widely used and even though simple, works well in a number of

(22)

complex text classification problems. It uses the naive assumption that the features are totally independent. This assumption is usually wrong and makes the classifier overestimate the predicted class, which would be a problem if we wanted to know the probability of each class. But as the Bayes classifier is used only to state the predicted class it works well[24]. The classifier is less computationally heavy than many other machine learning algorithms but often performs equivalently[35]. Some uses are spam detection, document classification and NSFW-tagging.

The naive Bayes classifier uses the maximum a posteriori (MAP) rule to classify a document. To find the MAP, calculate the product of the probability of each word in a document given a specific class, multiplied with the probability of that class. Do this for all classes considered and choose the one with the highest probability. This can be written as

c_map= argmax(P (c|d)) = argmax(P (c)

k

Y

n

P (w_k|c)) (14) where wk the words/tokens in the document d, c is the class such that c ∈ C where C is all classes defined, P (c|d) is the conditional probability of class c given document d, P (c) is the prior probability of the class and P (w_k|c) is the probability of token wk

given class c. To avoid extremely small numbers we usually choose the maximum of the sum of the logarithms instead of the above product.

c_map= argmax(log(P (c)) +

n

X

k

log(P (wk|c))) (15)

Formula 15 outputs the same class prediction as formula 14. 1 is added to each count of P (wk|c) to avoid the logarithm of zero.

In a multinomial naive bayes classifier we also consider how many times each token appears in a document. The feature vector can be the term-frequency weight (bag of words model) or the tf − idf weight. In multinomial bayes we can represent the probability of the feature w in a document of class c as

P (w|c) = Tcw+ 1 Pn

w_i(T_cw_i+ 1) (16)

where T_cw is the number of times the feature w occurs in documents of class c and n is the total number of different features in the class c. Laplace smoothing, adding one to each count, is done to avoid issues with zero-probabilities.

2.11.2 Support Vector Machine (SVM)

The supervised machine learning algorithm support vector machine (SVM) is widely used in, for instance, text and image classification. It is a binary linear algorithm and works both for regression and classification. Consider two classes, if a SVM is fed data points as k-dimensional vectors from these classes its goal is to find the k-1 dimensional hyperplane which best separates these points.

A hyperplane is defined as a flat surface with dimension N-1 in a N-dimensional space which splits the space in half, it can be written as

w^|x = 0 (17)

where x is a set of points and w is the normal vector to the hyperplane. In IR² the hyperplane is a line, in IR³ it is a plane. The best hyperplane is the one with the largest distance/margin between the two classes. So we choose the hyperplane with the largest distance to the nearest points for both classes, this is known as the

(23)

maximum-margin hyperplane. These nearest data points to the hyperplane are called the support vectors, if they are altered or removed the hyperplane will move. This makes SVM memory efficient since it only uses the support vectors to do the decision making. When we have a new data point it will be classified based on which side of the hyperplane it appears on. The farther away from the hyperplane data points are, the more confident we are that they are correctly classified.

Figure 3: The hyperplane w best separates the two classes based on the support vectors.

If the data points are linearly separable we can achieve total separation with a hyperplane. If the data points are not linearly separable in the current dimensional space nonlinear classification can still be done using the kernel-trick. This means mapping the feature vectors into higher dimensional vector spaces, where the separation with a hyperplane may be better. This is done by transforming the training feature vectors, X, to X⁰ using the transformation φ, where X⁰ is of higher dimension. Find the hyperplane w which best separates X⁰. Then when there are new data points to classify; transform X_new to X⁰_new and linearly classify X⁰_new using w.

Figure 4: Non-linearly separable features transformed into a higher dimension feature space where they become linearly separable

This kernel-trick is not usually used when working with textual data, instead, a linear kernel is used. The feature vectors in a text classification problem can have a dimension of several hundreds or thousands with single data samples often containing information in only a small fraction of these dimensions. This means that the feature

(24)

matrixes are very sparse which makes them easily linearly separable[37].

SVM also works when there are more than two classes. Multiclass problems are solved by reducing the problem into multiple binary classification problems. Two common ways of achieving this problem reduction are the one-vs-one and one-vs-all approaches. One-vs-one compares one class against every other other class one by one. The class with most “wins” in this is chosen as the predicted class. One-vs-all compares one class against all others. In one-vs-all each class is tested and the class with the highest output, i.e. the probability of belonging to that class, is chosen as the predicted class.

2.11.3 Feed-forward neural network

Neural networks have become tightly associated with artificial intelligence and is often thought of as a black box which has the capacity to understand everything, be it natural language, driving cars, become a world champion in Go or figure out how to improve itself. In reality, a neural network is just like a circuit board, adding signals together in a predetermined fashion. But where do they get their ”intelligence” from?

The answer; from their many layers of simple, yet powerful, computation units. It is common to refer to a neural network as ”deep”. This means that the network implements multiple hidden layers, through which inputs are weighted and summed in a complex way.

A hidden layer is a set of computation nodes, or simply units. A unit has the simple task of performing a summation of its inputs, X, multiplied with some weights, W .

Figure 5: Input data multiplied with a weight matrix and used as an input to a hidden node in a feed-forward neural network.

(25)

Figure 6: Input data multiplied with a weight matrix and used as an input to two hidden nodes in a feed-forward neural network.

The input to Unit 1 in fig. 5 is

Input U nit 1 = x(1) ∗ w(1) + x(2) ∗ w(2) + x(3) ∗ w(3) + bias(1) (18)

Input U nit 1 =

3

X

i=1

x(i) ∗ w(i) + bias(1) (19)

where the bias is a trainable constant of the unit used for shifting the sum. This is then fed through a so called activation function. There are different activation functions to choose from but the most common ones nowadays are the ReLu (Linear Rectified) and tanh. The function depicted within the units in fig. 5 and fig. 6 is the tanh activation function which outputs values between -1 and 1 (−1, 1). This is very important since the sum in eq. 19 could become +inf otherwise (ReLu however, is unbounded and can in fact take on values in the range of (0, inf )). The process of activating the unit is written as

tanh(x) = 2

1 + e^−2x − 1 (20)

where x is the input to the unit, i.e. the result of eq. 19. From eq. 20 it is clear that a negative sum in eq. 19 produces a large value in the denominator, which means that tanh will be −1 for large negative sums and +1 for large positive sums. The result of tanh will be the output from the unit. In fig. 5 and fig. 6 no outputs have been drawn. This is because the layout of the network is undefined. An output from a unit in a one hidden-layer network is the final output of the network, i.e. the class prediction of the network with respect to some input. An output from a unit in a multi hidden-layer network is an input to the all units in the next hidden layer. This is illustrated in fig. 7.

(26)

Figure 7: Full overview of the structure of a feed-forward neural network. From left two right; one input layer, two hidden layers, one with two nodes and one with three nodes, and one output layer.

Every connection in fig. 7 is associated with some unique weight w and all units in the hidden layers, as well as the outputs, have their own bias associated with them. These weights and biases are called the hyperparameters of the network and are adjusted in the training process in order to obtain a desired output, y, given some known input, x.

Training a neural network, or any machine learning algorithm for that matter, involves minimizing a loss function. Common loss functions for neural networks are SVM and the Softmax Classifier. As previously mentioned, all units are associated with an activation function, whereas the output nodes are associated with a loss function. The Softmax classifier is defined as

L_i= −log e^f^yi P

je^f^j

!

(21) where the expression inside the parentheses ( . . . ) is called the Softmax function.

The Softmax classifier minimizes the cross-entropy loss. It outputs a vector with logarithmically normalized values which sums to 1. The output values can be interpreted as probabilities, though they are dependent on regularization strength (if that is used). The loss function is minimized in order to increase prediction accuracy. This is done by reducing the loss (error) between the predicted outputs and the true outputs given some known input. A method called gradient descent (there are others available as well) can then be used to indicate in which direction in subspace the weights need to be adjusted in order to reach a minima. A gradient is simply the derivative of a function, which in turn is tangent to the slope of the function at the point of evaluation. This means that by using a gradient of a loss function it is possible to get an indication of the direction for which the loss will decrease the most.

The gradient to the Softmax function is[9]

∂Li

∂fy_i

= pk− 1(yi= k) (22)

where 1 is the indicator function which is 1 if yⁱ= k is satisfied and 0 otherwise and pk

is the class probability as predicted by the Softmax classifier. By using a technique called backpropagation all weights can be updated with respect to the prediction error. The error propagates backwards through the network and is multiplied with

(27)

the derivative of the activation function along the way[54]. Multiplying the result of the partial derivative of the loss function with respect to a certain weight with a chosen, constant, step size is called gradient descent, written as

wi= wi+ σ ∗ ∂Li

∂f_y_i (23)

where w_iis a weight in the network and σ is the step size, also known as the learning rate of the network. A common technique to restrain the complexity of the model is to use regularization, which penalizes large weights. There are a few different regularization functions, one of them being the L2 norm, written as

R(W ) =X

j

X

i

W_j,i² (24)

By multiplying this with a constant called the regularization strength it is possible to adjust the complexity of the model by suppressing certain weights. The regularization loss is added to the data loss in order to obtain the total loss as

L = 1 N

X

i

Li+ λR(W ) (25)

where λ is the regularization strength (often << 1) and Li is the loss function of the network, i.e. Softmax or SVM.

2.11.4 Convolutional neural network

The field of image classification has had a great resurgence of late, much thanks to advances made with something called convolutional neural networks. It works very well for text classification as well but the concept can be more easily understood in an image classification context.

Convolutional neural networks (CNNs) are similar to standard feed-forward neural networks. They operate on their input in a feed-forward fashion, passing the input from layer to layer before arriving to a conclusion. While a standard neural network consists of hidden layers, a CNN consists of a mixture of convolutional layers, pooling layers and fully-connected layers, each with its own advantages. An image consists of pixels stacked in rows and columns with a height H and a width W . The total number of pixels in an image is H ∗ W which even for small images can reach thousands of pixels. In order to feed an image into a standard feed-forward neural network a weight vector with thousands of weights would be needed. This would then be multiplied with the number of nodes in the current hidden layer, which might be repeated multiple times for a deep network with many of these hidden layers. The resulting network would contain millions of trainable parameters and be impractical as it would require a massive amount of training data in order for the loss function to converge. It might also lead to overfitting issues. Another issue is the problem of feature locality. Words and images are fed into standard neural networks as a vector of values. Imagine an image of a red car driving down the road in the center of the image. What happens if a feed-forward network is trained on images of cars driving down the road in the center of the image and you feed in an image where the car is in one of the corners? The classification will fail since the pattern is completely different from the training data. Convolutional neural networks solve these problems with weight sharing and high level feature extraction in the convolutional layers.

Convolutional layers operate on images (and text) by performing convolutions on a spatially small region at a time with a filter. This region is known as the receptive field of the node (neuron) and is equivalent to the input to a node in a hidden layer in a feed-forward network. Common sizes are 3x3x1, 3x3x3 or equivalently with a

(28)

spatial size of 5. The dimensions represent the width, the height and the depth of the image, where 1 represents a grayscale image and 3 is a RGB image. The filter slide over the image and applies a neuron’s weights to the pixels as it passes them, usually by moving it row wise one or two pixels at a time. The resulting image is called an activation map and consists of seemingly random values with a spatial dimension equal or less than that of the original image, depending on whether zero padding was applied or not. The size of the resulting image can be computed with the following formula.

W⁰= W − F + 2P

S + 1 (26)

If the input image was square sized, W is the width or height and if the input image had a rectangular shape W would be the width or be replaced with an H for calculating the height instead. F is the filter dimension (usually 3-5), P is the padding around the image and S is the stride, i.e. the amount the filter is shifted each time (usually 1-2). The process of moving a filter over the image is repeated multiple times with a new set of weights associated with another neuron. This produces a number of activation maps of the original image, each with its own special knowledge of the features in the image. Some might be able to detect color, some vertical lines and some horizontal lines. Next, an activation function is applied to the output, similar to that of a feed-forward network. A popular choice is the ReLU function.

The result is passed on to the next layer, usually a max-pooling layer. A pooling layer is used as a way to scale down the feature map in order to reduce the processing time and the number of weights required for potential forthcoming convolutional layers. A max-pooling layer works by using the max operator on the values beneath a filter which slides across the feature map. It does this for all feature maps produced by the previous convolutional layer.

Figure 8: A representation of a max-pooling layer where the maximum value is extracted to form a new feature map. The operation reduces the spatial dimensions of the input map with 2, see eq. 27

The output size of the pooling layer can be calculated in a similar fashion to that

(29)

of the convolutional layer

W⁰= W − F

S + 1 (27)

If the input map is rectangular, W can be exchanged for H when calculating the height. W is the width of the input map, F is the filter dimension and S is the stride, i.e. how much the filter is shifted at each new calculation. In the case of fig. 8, F = 2, S = 2, W = 4, resulting in W⁰ = (4 − 2)/2 + 1 = 2, which is the width and height of the resulting output. Lastly, the remaining image is fed through a fully-connected layer with a softmax function attached to it, which is equal to the output layer in a regular feed-forward network.

The process is similar when performing text classification, but depending on the context the pooling layer is sometimes left out. Another difference is that instead of an image with a height H, a vector of words or embeddings with height one is used[47][10].

2.12 Unsupervised learning

In all the previous sections of text classification supervised learning have been described. It has been assumed that data used to train the model would have been already manually labeled. There are however several unsupervised learning techniques to classify and group together data sets which do not have predetermined labels. These clustering algorithms tries to find structures and patterns within the data based on their features alone and group them into different clusters. By manually looking into these clusters a human can then determine what kind of classes the algorithm have found.

When clustering text documents they can be grouped together based on document similarity. Their similarity can be found by calculating the distances, for example the euclidean distance, between the feature vectors for each document.

A common and one of the fastest clustering techniques is k-means clustering. It aims to group n data points into k (manually specified) clusters. It is a centroid-based clustering model, which means it has a centroid representative member in the feature space for all k classes. A data point is then assigned to the cluster of the nearest centroid. To find these centroids the following algorithm is used:

1. Place k centroids c₁. . . c_k at random locations 2. For each data point n:

– Calculate the euclidean distance from point n to each centroid c_j.

– Choose centroid cj with the lowest distance and assign point n to cluster j.

3. For each centroid cj

– Calculate the mean of all points n which were assigned to cluster j in step 2.

– Update the centroid cj to the calculated mean

4. Repeat steps 2 and 3 until the centroids no longer move in step 3.

These now labeled data points can then be used as training data in supervised machine learning to create a new model which can label new data points.

(30)

2.13 Evaluation

2.13.1 Confusion matrix

A confusion matrix can be seen as a table where the rows contains the predicted labels and the columns contains the true labels. It can be used in both multi-class and binary classification by adding classes on the respective axes. Correct predictions are located on the diagonal of the matrix while incorrect predictions are located in the upper and lower triangle. The confusion matrix is called matching matrix when evaluating a unsupervised learning algorithm.

2.13.2 Recall & Precision

Recall is also known as sensitivity and is defined as the number of true positives over the sum of true positives and false negatives. Precision is defined as the number of true positives over the sum of true positives and false positives.

recall = t_p tp+ fn

(28)

precision = tp

t_p+ f_p (29)

2.13.3 F1-Score

F1-Score is defined as a weighted mean of recall and precision, also known as the harmonic mean. F1-score is popular because it considers both the correctly predicted classes and the falsely predicted classes, something recall and precision score, individ- ually, does not. A classifier can obtain a very high recall but still have a low precision.

Consider the case where a classifier is trained on an unbalanced dataset. If 90% of the data is related to a specific class the classifier can obtain 100% recall by predicting one type of label for all data points. The precision will however be very low as multiple data points are falsely classified to belong to the other class. The function for computing F1-Score is defined as:

F 1score= 2 ∗ recall ∗ precision

recall + precision (30)

[THIS PAGE IS INTENTIONALLY LEFT BLANK] 2

Examensarbete 30 hp February 1, 2018

Using cloud services and machine

learning to improve customer support

Study the applicability of the method on voice data

Henrik Spens

Johan Lindgren

Abstract

Using cloud services and machine learning to improve customer support

Henrik Spens, Johan Lindgren

Acknowledgements

Popul¨ arvetenskaplig sammanfattning

Lexicon & abbreviations

Contents

1 Introduction

1.1 Background

1.2 Project description

1.3 Scope

2 Theory

2.1 Language basics

2.2 Google Speech API

2.3 Tokenization

2.4 Stemming

2.5 Lemmatization

2.6 Stopwords

2.7 Part-of-speech (POS) tagging

2.8 Shallow Parsing

2.9 Feature extraction

2.10 Bias-Variance

2.11 Supervised learning

2.12 Unsupervised learning

2.13 Evaluation