Word Segmentation for Classification of Text

(1)

IT 19 059

Examensarbete 30 hp

September 2019

Word Segmentation for Classification

of Text

Anusha Venkataraman

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Word Segmentation for Classification of text

Anusha Venkataraman

Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. Word segmentation is the problem of splitting a string of written language into its component words. The purpose of this research is to do a comparative study on different techniques of word segmentation and to identify the best technique that would aid in the extraction of keyword from the text. English was chosen as the language. Dictionary-based and Machine learning approaches were used to split the compound words. This research also aims at evaluating the quality of a word segmentation by comparing it with the segmentation of reference.

Results indicated that Dictionary-based word segmentation showed better results in segmenting a compound word compared to the Machine learning segmentation when technical words were involved. Also, to improve the results for the text classification, improving the quality of the text alone is not the key

Tryckt av: Reprocentralen ITC IT 19 059

Examinator: Mats Daniels

Ämnesgranskare: Michael Ashcroft Handledare: Annette Hultåker

(4)

(5)

IV

Sammanfattning

Ordsegmentering för klassificering av text

Sammansättning är en mycket produktiv ordformationsprocess på vissa språk som ofta är problematisk för applikationer för naturligt språkbehandling. Ordsegmentering är problemet med att dela en sträng skriftligt språk i dess komponentord.

Syftet med denna forskning är att göra en jämförande studie om olika tekniker för ordsegmentering och för att identifiera den bästa tekniken som skulle hjälpa till med att extrahera sökord från texten. Engelska valdes som språk. Ordboksbaserade och maskininlärningsmetoder användes för att dela upp de sammansatta orden. Denna forskning syftar också till att utvärdera kvaliteten på en ordsegmentering genom att jämföra den med en referenssegmentering.

Resultaten visade att ordboksbaserad ordsegmentering visade bättre resultat vid segmentering av ett sammansatt ord jämfört med maskininlärningssegmenteringen när tekniska ord var inblandade. Dessutom förbättrade kvaliteten på texten inte helt resultaten för klassificeringen av texten.

(6)

V

Acknowledgements

I would like to begin this thesis by thanking my supervisor Annette Hultåker (Scania) for all the support during this thesis. Thank you for giving me such a great opportunity to work at Scania and to deepen my knowledge within NLP. I would also like to thank my colleague Rithika Harish Kumar (KTH University) for providing different perspective and brainstorming different ideas with me.

Thank you, Dr Michael Ashcroft (Uppsala University) for reviewing this thesis. Your valuable inputs and ideas during this thesis were crucial. Also, thank you for arranging the Thursday meetings, they were really helpful.

Thank you, YSEA group and the team of “NLP auto-sorting of failure reports” for your help and support in sharing your knowledge during this thesis.

(7)

Table of Contents VI

List of Tables

Table 4.1: Effects of De-capitalization . . . 20

Table 5.1: Word count representation . . . 28

Table 5.2: Number count representation . . . 29

Table 5.3: Error count representation . . . 30

Table 5.4: Example of manual data-set . . . 32

Table 5.5: Evaluated results . . . 32

Table 5.6: Classification Accuracy . . . 33

(10)

List of Figures IX

List of Figures

Figure 1.1: Manual categorization . . . 3

Figure 3.1: State Diagram for 3-states . . . 11

Figure 3.2: Bayesian network . . . 13

Figure 4.1: Enumerating possible segmentations . . . 22

Figure 5.1: Word count representation . . . 27

Figure 5.2: Distribution of Numeric count representation . . . 28

Figure 5.3: Error count representation . . . 29

Figure 5.4: Word,Number and Error count Comparison . . . 30

(11)

List of Acronyms X

List of Acronyms

POS Part-of-Speech Tagging FQ Field Quality

NLP Natural Language Processing ML Machine Learning

I(0)BES Intermediate, Beginning, End and Single characters C4.5 Decision trees

HMM Hidden Markov Models CRF Conditional Random Fields SVM Support Vector Machines MLP Multi-Layer Perceptrons MDL Minimum Description Length DAG Directed Acyclic Graph IR Information Retrival DTC Diagnostic Troubles codes MCMC Markov Chain Monte Carlo DP Dirichlet Process

(12)

Introduction 1

1 Introduction

Scania usually receives a huge amount of data-failure reports and technical documents from various truck drivers in different workshops from different countries which are around tens of thousands of reports per year. These short texts are usually written in haste and therefore they are sometimes filled with mistakes to a degree where the language is difficult to evaluate by machine. The major idea of this thesis is to improve the quality of these failure reports by an efficient word segmentation algorithm which will improve the classification of these failure reports.

Classification of failure reports is based mainly on the failure description of the fail-ure report. The description is analyzed for an issue, component model or any other technical detail mentioned. If the issue is existing, these reports are assigned with the same assignment number which matches the description, but if the issue is new, a new assignment number is assigned. So, failure reports with similar issues and the same set of component types are categorized under the same assignment number.

Before the reports can be classified, the data requires pre-processing. The pre-processing stage usually contains tasks such as tokenization, stop-word removal, lowercase con-version, stemming, grammar correction, spelling correction, etc. For the extracted data source, lemmatization, stop word removal and lowercase conversion are used as major filters before word segmentation approach could be applied.[1]

When the failure description data was analyzed, the data mainly consisted of spelling error, compound words, special codes along with some special symbols, some technical terms, abbreviations (engine model, component type, part number codes, etc.) and English text with a mixture of other language words. The errors in these description data were leading to the incorrect categorization of these reports and to improve the quality and categorization of these reports, compound words should be split to give a better understanding of the issue. Along with the clean-up, vocabulary needs to be created which will consist of all the technical terms along with the special codes, and abbreviations which will be helpful as a training data-set. Word segmentation is the initial step for most higher-level natural language processing tasks, such as POS (Part-of-Speech Tagging), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. [2]

(13)

Introduction 2

Word segmentation is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the Latin alphabet, space is a good approximation of a word divider (word delimiter), although this concept has limits because of the variability with which languages regard collocations and compounds. Many English compound nouns are variably written (for example, ice box = ice-box = icebox) with a corresponding variation in whether speakers think of them as noun phrases or single nouns. There are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm. However, the equivalent to the word space character is not found in all written scripts, and without it, word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited. [3]

1.1 Problem Statement

In this thesis, focus is on these following questions:

• How does word segmentation as a pre-processing step affect the performance of the classification of the failure reports?

• Does it improve the classification accuracy?

1.2 Delimitation

To limit the scope of this thesis, text with only English language was considered. So, non English words are also considered to be error words even if they are correct words but of different language.

1.3 Background

Some relevant work is already done in classifying the failure report. Scania has an Ad-vanced Analytics Research group which is working on auto-sorting the failure reports based on keyword extraction. Another research group is also working on language translation to make the translation efficient. But the major challenge is to understand the meaning of text to provide a true translation.

Previously, the sorting of these reports was manual and Field Quality(FQ) engineers used to classify these reports based on the assignment numbers, failure description, and the component type.

(14)

Introduction 3

Figure 1.1: Manual categorization

The reports are submitted to Scania by the dealers who identify unknown problem at the workshop. These reports are submitted to FQ engineers on daily basis. The FQ engineers check for:

• new problems,

• known problems under investigation,

• extra information needed from reports to sort them.

From the gathered information, reports are sorted based on the status. (new, open, closed, insufficient information etc.)

But the process is tedious as each report needs to be checked individually for problems and around 50-80 percent of the issues that arrive are unique in nature. Also the connections made by FQ engineers are not 100 percent correct.

(15)

Literature Review 4

2 Literature Review

In this chapter, the work related to this thesis is summarized. The articles cover two important aspects. The first aspect covers different text segmentation techniques and the second aspect talks about different evaluation techniques used for segmentation.

2.1 Text Pre-processing

Word segmentation can be very challenging, especially for languages without explicit word boundary delimiters, such as Chinese, Japanese and Vietnamese. Even for space-delimited languages like English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. [2] Another challenge can be to find whether the composition is a valid word segmentation, i.e. it consists of genuine words.

Automatic segmentation is the main problem in natural language processing of imple-menting a computer process to segment text. When punctuation and similar clues are not consistently available, the segmentation task often requires techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on the text in specific domains and sources.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. [3] There are two general approaches:

• Manual analysis of text and writing custom software

• Annotate the sample corpus with boundary information and use machine learning Based on these approaches, there are three types of word segmentation algorithms [4]

-• Rule-based segmentation rules derived from the grammar

• Dictionary-based segmentation by a dictionary lookup

• Machine learning data-driven segmentation

Rule-based segmentation can be a first simple bigram method that inserts a word break between two characters c1 and c2, if a word break between c1 and c2 is more frequent than a non-break in the training corpus. Or a unigram method where segmentation

(16)

Literature Review 5

between two characters can be achieved by classifying every individual character ac-cording to the I(0)BES scheme – (preferred) intermediate (I), beginning (B), end (E) and single (S) characters – in the training corpus. [2]

Commonly used algorithms are Minimum word count matching algorithm and Max match algorithm for Dictionary based segmentation. Then dictionaries are used as standards to match the word. The Triangular Matrix approach uses nested loops and a Circular Array to store the optimum segmentation of prefix sub-strings. A Triangular Matrix of parts with increasing lengths is generated, organized as Circular Array. This allows a constant memory consumption, as intermediate segmentation variants are over-written, once they are no longer required. Instead of storing full segmented strings, only a bit vector of potential space positions is used. This reduces the memory consumption to store segmentation variants (1 bit instead of 1 char) and is Garbage Collection friendly. [5]

More advanced machine learning approaches include clustering algorithms (kNN, kMeans), decision trees (C4.5), NaiveBayes, sequence labeling models (Hidden Markov Models, HMM and Conditional Random Fields, CRF), as well as Support Vector Ma-chines (SVM), and Multi-Layer Perceptrons (MLP). [4]

There are a lot of approaches that can be used like deep learning- neural networks, induc-tive learning and classification-based rules, entropy, etc. With the rise of deep learning, neural network models have been investigated for the character tagging approach. The main idea is to replace manual discrete features with automatic real-valued features, which are derived automatically from distributed character representations using neural networks. Derived neural feature representations are fed into a CRF(Conditional Ran-dom field) inference layer.

Transition-based framework [6] is used to segment a sentence incrementally from left to right, scoring partially segmented results by using both character-level and word-level features. LSTMs( Long short term memory) is built over the input character sequence and the output word sequence and the output features are exploited for scoring. Also, beam-search is applied to reduce error propagation and online large-margin training with early-update is used for learning from inexact search. The resulting model is a word-based neural segmenter that can leverage rich character and word-level features. [7]

The approach is based on this paper [8] where word segmentation method is based on inductive learning which consists of word segments, predicting the unknown words, extracting the rules and proofreading the process. This method uses only the surface information of a character string and has an advantage that is entirely independent of any specific language. The method recursively extracts a character string that frequently

(17)

Literature Review 6

occurs in the text as word candidates, it also extracts the segmentation rule with context information to deal with the segmentation ambiguity. The method classifies those ex-tracted word candidates to different ranks according to extraction situation, segmenting a text into words with extracted word candidates.

According to paper [9], MDL (Minimum description length principle [10]) is used to eliminate the discretion in the context length and threshold parameters. At the same time branching entropy enables a constrained search through the hypothesis space, allowing high performance in terms of F-score and speed, and reduced computational time. This combination method has the potential to improve the coverage of morphological analyz-ers for languages without explicit word boundary markanalyz-ers.

Choosing a set of criteria to evaluate and quantify the result can be a difficult task. Main existing measures evaluate the quality of a text segmentation by comparing it with segmentation of reference. According to paper [11], various metrics like Character boundary evaluation, Word boundary evaluation, WindowDiff, WinPR and PK metrics can be used to evaluate word segmentation.

(18)

Theory 7

3 Theory

This chapter provides an introduction to all the artificial intelligence and machine learning techniques used in this thesis. Naive-Bayes probability model, Zipf’s law, Viterbi algorithm, Bernoulli Trial and Bayesian network are described in this section.

3.1 Naive Bayes Probability model

A simplest statistical model to calculate the probabilities of words is based on the Bayes rule. The simplicity of this model makes it easy to implement, so it comes up often in applications, hence the special name: naive Bayes. Computation of this probability works on these assumptions:

Assuming the position of the words in the document doesn’t matter.

Conditional Independence - Assume that each occurrence of the word is independent of each other.

Bayes rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The probability of a sequence of words is the product of the probabilities of each word in the given word’s context preceding words.

Let n be the length of the word sequence,

P(W1:n) =

∏

k=1:n

P(Wk|W1:k−1) (3.1)

In a generic case study, Equation 3.1 can be calculated by obtaining the conditional probability for words to the target word and then multiplying them. If we have data for sequences up to 5-grams, the probability of an 5-word sequence would be the product of each word given the four previous words (not all previous words). But, there are three difficulties with the 5-gram model. First, the 5-gram data is about 30 GB, so it can’t all fit in RAM. Second, many 5-gram counts will be 0, and we’d need some strategy for backing off, using shorter sequences to estimate the 5-gram probabilities. Third, the search space of words will be large because dependencies extend up to four words away. All three of these difficulties can be managed, with huge computational effort and resource cost. Another generic solution could be to introduce a window difference between length of word sequence dropping the probability for non-important words. But instead, a much simpler unigram language model is considered for this case study that solves all three difficulties at once, in which the probability of a sequence is just

(19)

Theory 8

the product of the probability of each word by itself. In this model, context is not that important and the probability of each word is independent of the other words :

P(W1:n) =

∏

k=1:n

P(Wk) (3.2)

One of the disadvantage of using this model is the underflow problem. Anything smaller than the smallest positive floating-point number that can be represented is about 4.9∗10˘324that rounds to 0.0, which causes underflow. To avoid underflow, the simplest solution is to add logarithms of numbers rather than multiplying the numbers themselves. Another disadvantage with this model is the data is sloppy. For instance, the segmentation of “helloworld” is just “helloworld.” It turns out that this token appears on the dictionary in various forms, and commonly enough to outweigh the product of “hello” and “world” alone. Unfortunately, this can’t be fixed by fiddling with the data we already have. Instead, the model can be extended to use bigrams (n>=2) to look at frequency counts of sequences of words. [12], [13]

3.2 Zipf’s law

Zipf’s law is an empirical law formulated using mathematical statistics. Zipf’s law states that given a large sample of words used, the frequency of any word is inversely propor-tional to its rank in the frequency table. It states that, if t1is the most common term in

the collection, t2is the next most common, and so on, then the collection frequency c fi

of the ith most common term is proportional to 1/i:

c fi ∝

1

i (3.3)

So if the most frequent term occurs c f1times, then the second most frequent term has

half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Equation 3.3 is one of the simplest ways of formalizing such a rapid decrease and it has been found to be a reasonably good model. For example, in the Brown Corpus of American English text, the most frequently occurring word, "the", accounts for nearly 7percent of all the words (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word "of" accounts for slightly over 3.5percent of words (36,411 occurrences), followed by "and" (28,852). [14]

Equivalently, we can write Zipf’s law as c fi = cik or as log c fi =log c+k log i where

k= −1 and c is a constant to be defined in Section 5.3.2 . It is therefore a power law with exponent k= −1.

In the English language, the probability of encountering the rth most common word is given roughly by P(r) =0.1/r for r up to 1000 or so. The law breaks down for less

(20)

Theory 9

frequent words, since the harmonic series diverges. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank r such that

P(r) ≈ 1

(rln(1.78R)) (3.4)

where R is the number of different words. [15], [16]

3.3 Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. Dynamic programming refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. Decisions that span several points in time do often break apart recursively.

Optimal substructure means that the solution to a given optimization problem can be obtained by the combination of optimal solutions to its sub-problems. Such optimal substructures are usually described by means of recursion. For example, given a graph G= (V, E), the shortest path p from a vertex u to a vertex v exhibits optimal substruc-ture: take any intermediate vertex w on this shortest path p. If p is truly the shortest path, then it can be split into sub-paths p1 from u to w and p2 from w to v such that these, in turn, are indeed the shortest paths between the corresponding vertices. [17], [18]

Dynamic programming can be achieved in either of two ways [19]:

Top-down approach: This is the direct fall-out of the recursive formulation of any problem. If the solution to any problem can be formulated recursively using the solution to its sub-problems, and if its sub-problems are overlapping, then one can easily memoize or store the solutions to the sub-problems in a table. Whenever we attempt to solve a new sub-problem, we first check the table to see if it is already solved. If a solution has been recorded, we can use it directly, otherwise we solve the sub-problem and add its solution to the table.

Bottom-up approach: Once we formulate the solution to a problem in an iterative way as in terms of its sub-problems, we can try reformulating the problem in a bottom-up fashion: try solving the sub-problems first and use their solutions to build-on and arrive at solutions to bigger sub-problems. This is also usually done in a tabular form by generating solutions in iterative way to bigger and bigger sub-problems by using the solutions to small sub-problems. For example, if we already know the values of F41 and F40, we can directly calculate the value of F42(F-Fibonacci) . The Viterbi algorithm provides an efficient way of finding the most likely state sequence in the maximum a posteriori probability sense of a process assumed to be a finite-state

(21)

Theory 10

discrete-time Markov process.

Suppose we have a line of text of n characters. Every character has some significance in a word which it is in. Position of the character in a particular word plays a major role in changing the context of the word. For example- The characters ’b,o,l,w’ can form two different words- ’bowl’ and ’blow’. Same way, adding or removing of the character in the word can change the meaning of the word. Significance of character in the word is stored in a feature vector. [20] Each character yields a feature vector zi, i=1, 2, ..., n. Let

p(Z|C)denote the the conditional probability that the feature vectors Z=z1, z2, ..., zn

conditioned on the sequence of identities C=c1, c2, ..., cn, where zkis the feature vector

for the kth character, and where ck takes on M values (number of letters in the alphabet)

for k=1, 2, ..., n. Also, let P(C)be the a priori probability distribution of all sequences of n characters. When the sequence of characters form a word, the probability of sequence of characters given the feature vector is maximum. The probability of correctly segment-ing the text is maximized by choossegment-ing that sequence of characters that has a maximum a posterior probability given by P(C|Z).

From Bayes’ rule,

P(C|Z) = p(Z|C)P(C)

p(Z) (3.5)

As p(Z)is independent of the sequence C, the equation becomes:

P(C|Z)∝ p(Z|C)P(C) (3.6) The amount of storage required for these probabilities is huge in practice, for this reason, assumptions are made in order to reduce the problem down to manageable size. To simplify P(C|Z)we can assume independence among the feature vectors. This means that the word that we are scrutinizing is only dependent on the word itself and not on the neighbouring words. Using this assumption and taking logarithms the discriminant function Equation 3.6 reduces to:

log(P(C|Z)) =

n

∑

i=1

logp(z_i|c_i) +logP(c1, .., cn) (3.7)

For Viterbi algorithm, it is assumed that the process is 1st-order Markov, then Equation 3.7 becomes,

log(P(C|Z)) =

n

∑

i=1

logp(zi|ci) +logP(c1|c0) +..+logP(cn|cn−1) (3.8)

(22)

Theory 11

Figure 3.1: State Diagram for 3-states

In this state diagram, the nodes (circles) represent states, arrows represent transitions, and over the course of time the process traces some path from state to state through the state diagram. Now, a length proportional to−log[p(Z|C) +P(C)]is assigned to every path. Since there is only one-to-one correspondence between paths and sequences, the path whose -log [p(Z|C)+P(C)] is minimum is found. The total length of the path corresponding to some state sequence C is

−log[p(Z|C) +P(C)] =

n

∑

k=1

l(tk) (3.9)

where l(tk)is the associated length to each transition tkfrom ckto ck+1. The shortest such

path segment is called the survivor corresponding to the node c_k, and is denoted S(c_k). For any time k>0 , there are M survivors in all, one for each ck. The shortest complete

path S must begin with one of these survivors. To get to time k+1, k survivors are extended by one time unit, the lengths of the extended path segments are computed for each node ck+1as the corresponding time-(k+1) survivor. Iteration proceeds indefinitely

(23)

Theory 12

3.4 Bernoulli Trial

In the theory of probability and statistics, a Bernoulli trial is a random experiment with exactly two possible outcomes, success and failure, in which the probability of success is the same every time the experiment is conducted. [22]

Let p be the probability of success in a Bernoulli trial, and q be the probability of failure. Then the probability of success and the probability of failure sum to unity (one), the events are mutually exclusive and exhaustive. Thus one has the following relations:

p+q=1.

Alternatively, these can be stated in terms of odds: given probability p of success and q of failure, the odds for are p : q and the odds against are q : p. These can also be expressed as numbers, by dividing, yielding the odds for, of, and the odds against, oa,

of =p/q=p/(1−p) = (1−q)/q (3.10)

oa=q/p= (1−p)/p=q/(1−q) (3.11)

These are multiplicative inverses, so they multiply to 1, with the following relations: of =1/oa, oa=1/of, of ·oa=1.

3.5 Bayesian network

A Bayesian network is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predict-ing the likelihood that any one of several possible known causes was the contributpredict-ing factor. [23]

Nodes represent variables that may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies. The nodes that are not connected to any other node i.e., no path connects one node to another, represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node’s parent variables, and gives the probability or probability distribution of the variable represented by the node.

For example, if m parent nodes represent m Boolean variables then the probability function could be represented by a table of 2m entries, one entry for each of the 2m possible parent combinations.

(24)

Theory 13

Figure 3.2: A simple Bayesian network with conditional probability tables

In this example, Node C has 2 parents(A and B) and is represented by table of 4 entries for each parent. [24]

Bayesian networks satisfy the local Markov property, which states that a node is condi-tionally independent of its non-descendants given its parents.the joint distribution for a Bayesian network is equal to the product of P(node|parents(node))for all nodes, stated below: P(X1, .., Xn) = n

∏

i=1 P(Xi|X1, .., Xn−1) = n

∏

i=1 P(Xi|Parents(Xi)) (3.12)

In larger networks, this property reduces the amount of required computation, since gen-erally, most nodes will have few parents relative to the overall size of the network. Using a Bayesian network can also save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for 210₌_{1024 values. If no variable’s local distribution depends}

on more than three parent variables, the Bayesian network representation stores at most 10·23=80 values.

Hyperparameters A hyperparameter is a parameter whose value is pre-set before it can be used for training the process. In Bayesian statistics, a hyperparameter is a parameter of a prior distribution, the term is used to distinguish them from parameters of the model

(25)

Theory 14

for the underlying system under analysis.

For example, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then p is a parameter of the underlying system (Bernoulli distribution), and α and β are parameters of the prior distribution (beta distribution), hence hyperparameters. There can be a single value for a given hyperparameter, or a probability distribution can be developed on the hyperparameter itself, which is called a hyperprior. [25]

3.6 Simulated Annealing

Simulated annealing (SA) is a probabilistic technique for approximating the global op-timum of a given function. “Annealing” refers to an analogy with thermodynamics, specifically with the way that metals cool and anneal. Simulated annealing uses the objective function of an optimization problem instead of the energy of a material. The simulation can be performed either by a solution of kinetic equations for density func-tions or by using the stochastic sampling method.

Basic Iteration At each step, the simulated annealing heuristic considers some neigh-boring state s∗of the current state s, and probabilistically decides between moving the system to state s∗or staying in-state s. These probabilities ultimately lead the system to move to states of lower energy. Typically this step is repeated until the system reaches a state that is good enough for the application, or until a given computation budget is exhausted.

The neighbours of a state Optimization of a solution involves evaluating the neighbours of a state of the problem, which are new states produced through conservatively altering a given state. The well-defined way in which the states are altered to produce neighbour-ing states is called a "move", and different moves give different sets of neighbourneighbour-ing states. These moves usually result in minimal alterations of the last state, in an attempt to progressively improve the solution through iteratively improving its parts. Heuristics use the neighbours of a solution as a way to explore the solutions space, and although they prefer better neighbours, they also accept worse neighbours in order to avoid getting stuck in local optima. They will find the global optimum if run for a long amount of time.

Acceptance probabilities The probability of making the transition from the current state s to a candidate new state s0is specified by an acceptance probability function P(e, e0, T), that depends on the energies e=E(s)and e0=E(s0)of the two states, and on a global time-varying parameter T called the temperature. States with a smaller energy are better than those with a greater energy. The probability function P must be positive even when e0 is greater than e. This feature prevents the method from becoming stuck at a local

(26)

Theory 15

minimum that is worse than the global one.

For sufficiently small values of T, the system will then increasingly favor moves that go "downhill" (i.e., to lower energy values), and avoid those that go "uphill." With T=0 the procedure reduces to the greedy algorithm, which makes only the downhill transitions. The temperature T plays a crucial role in controlling the evolution of the state s of the system with regard to its sensitivity to the variations of system energies. To be precise, for a large T, the evolution of s is sensitive to coarser energy variations, while it is sensitive to finer energy variations when T is small.

The annealing schedule The algorithm starts initially with T set to a high value (or infinity), and then it is decreased at each step following some annealing schedule—which may be specified by the user, but must end with T=0 towards the end of the allotted time budget. In this way, the system is expected to wander initially towards a broad region of the search space containing good solutions, ignoring small features of the energy function; then drift towards low-energy regions that become narrower and narrower; and finally move downhill. [26], [27]

3.7 Gibbs Sampling

Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate prob-ability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution or to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables).The idea in Gibbs sampling is to generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values. [28]

Suppose we want to obtain k samples of X = (x1, . . . , xn) from a joint distribution

p(x1, . . . , xn). Denote the ith sample by X(i)=

x₁(i), . . . , x(ni)

. We proceed as follows:

We begin with some initial value X(i).

We want the next sample. Call this next sample X(i+1). Since X(i+1)=x₁(i+1), x₂(i+1), . . . , x(ni+1)

is a vector, we sample each component of the vector, x(_ji+1), from the distribution of that component conditioned on all other components sampled so far. But there is a catch: we condition on X(i+1)’s components up to x(_j₋i+₁1), and there-after condition on X(i)’s components, starting from x(_j₊i)₁ to x(ni). To achieve this,

(27)

Theory 16

formally, to sample x(_ji+1), we update it according to the distribution specified by px(_ji+1)|x₁(i+1), . . . , x(_j₋i+₁1), x(_j₊i)₁, . . . , x(ni)

. Note that we use the value that the

(j+1)th component had in the ith sample, not the(i+1)th sample. Repeat the above step k times.

Gibbs sampling is commonly used for statistical inference. The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively a posterior distribution conditioned on the observed data. The most likely value of a desired parameter (the mode) could then simply be selected by choosing the sample value that occurs most commonly. [29]

3.8 Text Representation as N-Grams

Language models assign probabilities to the sequences of words, N-grams are used for that. N-grams is a sequence of N words. Collection of text is called Corpus. Corpus is treated as a sequence of tokens— words and punctuation. In the context of text corpus, n-grams will typically refer to sequences of words. One word is a Unigram, a bigram is a sequence of two words, a trigram is a sequence of three words etc. The items inside an n-gram may not have any relation between them apart from the fact that they appear next to each other. For example- Manifold is broken and exhaust leaked , this sentence has 6 unigrams, 5 bigrams(Manifold is,is broken,etc.), 4 trigrams and so on. [30]

(28)

Methodology 17

4 Methodology

This chapter describes the data-set, dictionary and experimental setup used to test the word segmentation. First, the approach taken to prepare the data for the experiments is described. This is followed by the implementation of the approaches.

4.1 Data Processing

Major part of data used for the approaches are provided by Scania.

4.1.1 Data-set

Data-set mainly comprises of the failure reports from the year 2013-2018. These failure reports

contain-• Failure description,

• Date when the report is submitted,

• Chassis number which identifies the product with problem,

• Part affected,

• Country of the reporter and the country code

• Placement of the part/component in the truck,

• Attachments- any pictures or documents.

Failure description is a free text in different languages like Swedish, Spanish, English etc. which can be distinguished based on the country code or the country. But even after filtering the data for English text, reports still had other language words due to local language influence. Description usually consists of detailed information about the complaint, codes related to parts/components, assignment numbers, DTC (Diagnostic Troubles codes) fault codes. As the description is free text, text involves a lot of spelling errors, compound words, special characters (!,-,.,/,’,:,;,etc.) company names, FQ engineer names, technical and non-technical abbreviations and codes. The examples below are few samples of what failure description looks like.1

(29)

Methodology 18

Hi FRAS!The 3rd axle cracked on the RHS at the brake flange welding area. Costomer name Bxx Mxxx Site M P.Failure Date 1/1/2014 Running Hrs B40. Best Regards Jim.

Zxxxxxxx Concrete mixer pump Truck Customer complain emission over standard. After check and diagnose it found pressure of pump couldn’t be built up, exchange pump found reductant doser was mulfunction.So exchange reductant doser with new one and clean system then all fults disappeared.Refer snapshot and screenshots on SCR test attached please!!!

HelloDriver, complaines sensitiv ESP

11:57:11, saxbx , Hello FQs! vehiculo ingresal to the workshop for lack of braking efficiency. After the test it is observed that all the liquid is expelled by vent. Finding a fault in the battery retainer. The retarder is repaired and the problem solved.Best

V5 EGB valve seized as per FQIT 1234- Hello Christoffernew truck with ems faultcode 51xx Egb bypass valve fault unable to recalibrate. Replaced new valve calibrated and test all ok.Regards, Gawał ´Swer˙zenwski

Hej,Hvordan sætter man kalibreringsfaktorern i Interactor 100 til 1,008 ??? M.v.h.Scania Danmark A/S Jim Jam mmmabc: Jag har lyckats repetera det här på kontoret och du/ni har hittat något mycket konstigt. Även 1.003 verkar inte gå....

4.1.2 Vocabulary

Before stepping into pre-processing, a major requirement for NLP is to construct a vocabulary. The biggest obstacle in Word Segmentation is to identify what words are, or rather what strings are most likely to have meaning. If Oxford dictionary was considered as vocabulary for this case, it won’t be able to tell that ’LOL’ should be a word.

Google has provided two major data corpora: the first, released in 2006, contains the frequency counts of language data in the form of n-grams from over a trillion words from a random sample of the entire internet. The counts of these n-grams occur above a certain threshold. Unfortunately this corpus is over a hundred gigabytes of uncom-pressed data and consequently it can’t be found online. And the second corpus is more specifically limited to language data organized as n-grams from scanned books. The data was scanned directly from books, so optical character recognition was used to determine what image segments correspond to words, and what words they are. Due to deterioration or printing errors in the scanned books, this results in a lot of misinter-preted tokens. Even if the data was proofread the terms that appear in internet like ’LOL’ won’t be available. [12]

For the above reasons, a custom vocabulary was created with different technical doc-uments from Scania along with language data from scanned books. These were the documents considered for

(30)

vocabulary-Methodology 19

• Scania Lexicon which consists of different technical description and abbreviations

• Scania workshop manual

• Driver manual

• Dismantling information

• SDP3 (Translated sentences)

• Scania service data

• Name list of FQ engineers

• Language data from scanned books

Length of the dictionary was calculated to be 172 650 words.

4.1.3 Pre-processing

Data needs to be prepared before any algorithm can be applied to it. Text is prepared by pre-processing it. Pre-processing text means bringing text into a form that is predictable and analyzable for the task. It used for extracting interesting and non-trivial knowledge from unstructured text data i.e, Information Retrieval(IR) to satisfy a user’s need for information [31]. Need of Text preprocessing in NLP is

-1. To reduce data file size of the data-set

• Stop words removal can account for 20-30 percent word reduction from the main text document,

• Stemming i.e., reducing words to a root by dropping unnecessary characters can also reduce word size to up to 50 percent.

2. To improve the efficiency of the IR system

• Stop words and punctuation are not that important for IR,

• Stemming is used for matching the similar words in a text.

All the terms mentioned are discussed more elaborately in upcoming paragraphs. There are four different parts in pre-processing [32]:

1. Cleaning i.e., getting rid of the less useful parts of text through stop-word removal, dealing with capitalization and special characters.

2. Annotation may include structural markup and part-of-speech tagging.

3. Normalization consists of the linguistic reductions through Stemming, Lemmati-zation and other forms of standardiLemmati-zation.

(31)

Methodology 20

4. Analysis consists of statistically probing, manipulating and generalizing from the dataset for feature analysis.

There are different ways to pre-process the text. These were the following techniques used for this thesis:

Lower-casing or De-capitalization Lower-casing is one of the simplest form of text pre-processing. It can significantly helps with consistency of expected output. When dic-tionary lookup is done, if the words are not lower-cased or capitalized, there can be variations for the same word. For example- Consider the word ’Data’

Table 4.1: Effects of De-capitalization Raw Lower-cased data, Data, DATA data

The problem with mixed-case occurrences of the same word is that it can impact the frequency of the word in a dictionary look-up approach as the same word with mixed-case will be considered as two separate words by the approach. Lower-casing is a great method for solving that problem as it can map word with different cases to the same lowercase form. While lower-casing is generally helpful, it may not be applicable for all tasks. [32]

Stop-words removal Many words in documents recur frequently but are essentially mean-ingless as they are used to join words together in a sentence. Stop-words are the words that are commonly encountered in texts without dependency to a topic (e.g., conjunc-tions, preposiconjunc-tions, articles, etc.). Therefore, the stop-words are usually assumed to be irrelevant in text classification studies and removed prior to the classification. Stop-words are specific to the language being studied as in the case of stemming. NLTK stop-words package was used to filter out around 179 stop-words. These are some sample stop-words- ’i’, ’me’, ’my’, ’myself’, ’we’, ’our’, ’ours’, ’ourselves’, ’you’, "you’re", "you’ve", "you’ll", "you’d", ’your’, ’yours’, ’yourself’, ’yourselves’. [33]

Removal of special characters Removal of special characters also plays a major role in cleaning the text. Specially when ’.’ and ’-’ are involved making the word a compound word, for example- stop-word or stop.word. These are some of the special characters removed from the data-set- [™!®\"#\\()*+,-./:;<=>?@\\&[\\]–‘| ”“%£].

Stemming Stemming is the process of reducing inflection in words (e.g. troubled, trou-bles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word. Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root

(32)

Methodology 21

form. So the words ’trouble’, ’troubled’, ’troubles’ might actually be converted to ’troubl’ instead of trouble because the ends were just chopped off. There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. [34]

Lemmatization Lemmazation is an alternative approach from stemming to removing inflection. By determining the part of speech and utilizing WordNet’s lexical database of English, lemmazation can get better results. Lemmatization is very similar to stemming. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop the end off, it actually transforms words to the actual root. It uses a dictionary such as WordNet for mappings or some special rule-based approaches.SO the words ’awake’, ’awoke’,’awoken’ become ’awake’ [32].

4.2 Implementation

4.2.1 Approach 1

Approach 1 used in this thesis is a very similar approach to Peter Norvig’s algorithm. This method was designed by Jeremy Kun. This approach is a dictionary based seg-mentation which works on dynamic programming. When dynamic programming is involved, recursion comes into picture. So, to save the computation and complexity cost, this approach uses memoization concept. [12], [35]

As we know, DP is a method to solve a complex problem by breaking it down into a group of simpler sub-problems, solving each of those sub-problems once, and storing their solutions in cache. The technique of storing solutions to sub-problems instead of recalculating them is called memoization. Every time before the program makes a recursive call to segment the remaining string, cache is checked to see whether this exact substring has been segmented before. If so then the recursive call is dismissed and best segmentation result is selected from the cache.

This approach is an ideal algorithm that compares multiple segmentations of the same text and picks the one that is best suited. This is how the approach

works-1. A simplified unigram language model is defined with a probability distribution over all the strings in the language. Frequency of unigrams in the complete text acts as Vocabulary for this model. Probability of each candidate is defined by using this model. Here candidate is the word that is to be segmented.

2. Next step is to enumerate the candidates for all the possible segmentations. An n-character string has 2n−1different segmentations and n−1 positions between

(33)

Methodology 22

Figure 4.1: Enumerating possible segmentations

eg: for the word isit, there are 8 different segmentations and 3 positions between characters.

3. Best probable segmentation is chosen by applying the language model to each candidate to get its probability, and choosing the one with the highest probability. The model uses Bayes’s Rule to calculate the probabilities and identify the best segmentation. For example, In the Fig.4.1, two possible solutions could be correct, i sit, is it. The one with higher probability is chosen.

For this method, recursive definition of the best segmentation of string s, derived from a recursive algorithm is described

by-seg(s₀n) =argmax(quality(sx₀, seg(sn_x))) (4.1)

This definition states that the best segmentation of a string of length n will be the highest quality segmentation among n choices. For example, for the short English string “isit”, the best segmentation will be the highest quality segmentation of the following:

’i’ | seg(’sit’) ’is’ | seg(’it’) ’isi’ | seg(’t’) ’isit’

This definition relies on recursively calling the function on sub strings of the original. This is where the dynamic programming is useful with memoization. The above possi-ble segmentations gives out the result as is it. The reason for this is even if the strings ’i sit’ are actual words, the strings ’is it’ have a higher probability compared to ’i sit’.

Segmentation ambiguity is not a problem for this case study as the vocabulary collected is more technical and related to the testing data-set i.e., the failure reports. But it can be

(34)

Methodology 23

a problem where general cases are considered and it can be solved by combining this ap-proach with a context probability which gives conditional distribution of segmentation based on the prior segmentations.

One of the advantage of using this approach is this works great with the dictionary used . This approach is faster than other approaches due to the memoization concept. Also this approach can be used for any language, only the dictionary referred needs to be change to the required language. A major disadvantage of this approach is arithmetic underflow. If the calculated probability is anything smaller than the smallest positive floating-point number that can be represented is about 4.910−324, it will cause underflow. To avoid this, the simplest solution is to add logarithms of numbers rather than multiplying the numbers themselves.

This approach is also a dictionary based segmentation which uses DP. The difference between 1st approach and this one is that it doesn’t use memoization. This approach aims to extract hidden words from continuous text by performing word segmentation on probabilistic unigram model which uses Viterbi algorithm. Frequency dictionary is build by counting the word frequencies from the custom vocabulary. Zipf’s law makes sure that word frequency for this vocabulary are similar for given language minimizing the chances of the calculated probabilities being wrong. [36]

Segmentation for this approach follows the main idea of maximum matching algorithm or the maxmatch algorithm in combination with the viterbi algorithm. The approach works in this way, firstly, word boundary i.e., space is removed from the the entire text, and the approach reads a continuous string of non-segmented text. Pointer po-sition is at the end of the string. Viterbi algorithm is applied to each line. It looks up the frequency dictionary to check if the current string is a word in the dictionary. If match is found space is inserted and the process is repeated. If no match is found the pointer is moved to left by one position and then the process is repeated. If no match is found in the dictionary, a single character token is created. This approach finds the longest word that can be made before a space can be inserted. The result is then reversed to retain the segmented sentence. So, for each sentence the viterbi algorithm is used to find out the most likely sequence of words in the list of characters that it is in. [37]

The algorithm tries to detect word boundaries while iterating through each character in the input string. These are steps followed by the entire approach:

1. Two lists probScores and lastWordStartIndex are created.

2. Score for a boundary combination is calculated by multiplying single-word proba-bilities.

(35)

Methodology 24

3. Each possible word boundary combination from the preceding characters to the current index (in the input string) is evaluated to a score. Here, a preceding character can be at a maximum longest word length indices away from current index.

4. The maximum score score for a particular index is stored in the probScores list. Whereas, the starting index of the last word in this combination is stored in lastWordStartIndex list.

5. This continues till the entire input string is checked.

6. Next step is to extract the words from the input string following these word boundaries.

7. Empty list of words is created to store the extracted words, and the words are extracted starting from the last word.

8. Starting index of the last word in the string is checked from the lastWordStartIndex list to see if all entire text is covered.

9. This process is repeated for all the continuous text until every word is extracted from the input string.

10. Once all the words are extracted they are reversed to retain the sentence back with segmented words.

Segmentation is performed here based on maximum total score, which is calculated as the product of single word probability.

P(a, b, c) =P(a) ∗P(b) ∗P(c) (4.2)

For example, Consider the continuous text thetabledownthere. Based on the above steps,Viterbi algorithm tries to identify the hidden words from the continuous text. For this example words are identified from the left and are stored in a list i.e., (there down table the) and then the list is reversed to form the segmented sentence. For the above example, segmentation can be done in two

ways-the table down ways-there theta bled own there

The 1st segmentation is chosen as the result of probability is higher compared to the 2nd sentence. Zipf’s frequency is used to calculate this probability.

Advantages of using this approach is that it is language independent and with the proper language dictionary, it can work on any language. It is less storage and computation intensive as it computes in O(n)asymptotic complexity and it does not back track.

(36)

Methodology 25

Disadvantages of using this approach are that it is time intensive compared to 1st approach. Even this approach has the underflow problem. The probability of occurrence is extremely small for most of the words in the dictionary. So, the combination probability can be even smaller. And if the length of sentence is large, there are chances where combination probability can be zero causing underflow. This approach depends on word frequency and like the above example, if the 2nd result was the optimum segmentation, then segmentation is not efficient. Solution for this can be to include more words in dictionary making it better. The solution is more specific to the case study and will fail for generic cases. A more generic solution could be use of contextual information and a chain rule in calculating the probability equation instead of independent probability. [36], [37]

This model is an unsupervised learning concept. This works on the concept of Bernoulli trial, Gibbs sampling and simulated annealing. This algorithm focuses on grouping set of characters into a word and finding out if that word’s occurrence is significant or not. All the three algorithms mentioned above are used to calculate the probability of assigning the boundaries in any position.

Hyperparameters are pre-set before they can be used for training purpose. Beta pa-rameter p, i.e. number of heads and tails is defined along with alpha, a DP(Dirichlet Process) hyperparameter. Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables, i.e., the random vari-ables are distributed according to one or another particular distribution.The Dirichlet process is specified by a base distribution H and a positive real number α called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of α→0, the realizations are all concentrated at a single value, while in the limit of α→_{∞ the realizations become continuous.}

In this approach, alpha is proportional to the probability of visitors to sit on an unoccu-pied table which comes from the Chinese restaurant process. In a Chinese restaurant when customers enter, a new customer sits down at a table with a probability propor-tional to the number of customers already sitting there. Addipropor-tionally, a customer opens a new table with a probability proportional to the scaling parameter α . After infinitely many customers entered, one obtains a probability distribution over infinitely many tables to be chosen. This probability distribution over the tables is a random sample of

(37)

Methodology 26

the probabilities of observations drawn from a Dirichlet process with scaling parameter

α.[38]

Initially, the following operations are performed in the given order. 1. Hyperparameters are set.

2. A random boundary probability constant is assigned that controls what will be the probability of putting boundaries initially in the text.

3. Word frequencies are calculated and assigned to a vocab after assigning the random boundaries.

The following steps are repeated for each iteration. 1. Temperatures are assigned for each iteration. 2. For every sentence,

a) Probability is calculated for every position in the word to check if a boundary can be inserted/removed in that position of the word.(Probability is calculated based on occurrence, temperature and its Bernoulli

trial)-b) Frequencies of the existing words in the vocab is removed

c) Boundary is then inserted/removed based on the Bernoulli trial of calculated probability.

d) Vocab is updated with the new word frequencies of the updated sentence. 3. As there is a slow decrease in the temperature, more of the solution space is

explored.

Issues with this approach Example segmentation:

Hello Customerinfo rmu sthat ACw asnotc ool mec hanic check ing found ACcompressorc lutchwaswo rnoutw hich thegapisabout 12mm So the ychang enewACcompressorc lutc hTh enp roblem. gon e

The above segmentation is a result of 1000th iteration. Space indicate the random boundaries. This model didn’t work well for the following reasons.

1. When a unique word (noun) is involved, then there is a high chance of that word getting segmented which is not supposed to happen.

2. Another reason is the limit for number of iterations can’t be defined for this kind of probabilistic segmentation, i.e., the end result might not have been completely segmented like the above example.

But one advantage of using this approach is that it is self correcting based on the temperature. No external dictionary is required as it works on unsupervised learning.

(38)

Results 27

5 Results

This chapter describes the observations on the data-set and the evaluation techniques used for the approaches along with the result obtained.

5.1 Observations on Dataset

Failure reports were considered from the year 2013-2018 and the total number of reports used for word segmentation were around 111 239 of which 85 101 were unique reports and the remaining 26 138 were duplicates.

The data contained both alphabets and numbers. Total word count was calculated to be around 2 972 037 words of which 54 678 words were unique in nature. And the numeric count was calculated to be around 230 104. Finally, the overall error count was calculated to be 259 759 of which 41 383 were unique. Here the error count consists of both word segmentation errors and spelling errors.

Figure 5.1: Word count representation

This Fig.5.1 represents the distribution of text in the report where the x axis is the number of words and the y axis is count of failure reports. For more than 35 000 reports, there are 10-19 words per report description.

(39)

Results 28

Table 5.1: Word count representation

Bins Count 0-25 67 960 25-50 33 124 50-75 7 180 75-100 1 730 100-200 1 137 >200 108

The table 5.1 indicates how word distribution is in the group of reports where bin is word count and count is count of reports. 67 960 reports have a count of up-to 25 words and vice versa.

Figure 5.2: Distribution of Numeric count representation

This Fig.5.2 represents the distribution of numeric in the report where the x axis is the number of numeric and the y axis is count of failure reports. For more than 65 000 reports, there are zero-one number per report description.

(40)

Results 29

Table 5.2: Number count representation

Bins Count 0-25 68 935 25-50 316 50-75 50 75-100 7 100-200 12 >200 0

The table 5.2 indicates how number distribution is in the group of reports where bin is number count and count is count of reports. 68 935 reports have a count of 0-25 numbers and vice versa.

Figure 5.3: Error count representation

This Fig.5.3 represents the distribution of error in the report where the x axis is the number of errors and the y axis is count of failure reports. For more than 80 000 reports, there are no errors per report description.

(41)

Results 30

Table 5.3: Error count representation

Bins Count 0 62 220 1 25 198 1-2 11 776 2-3 5623 3-5 4214 5-7 1299 7-10 560 >10 322

The table 5.3 indicates how number distribution is in the group of reports where bin is number count and count is count of reports. 62 220 reports have a count of zero errors i.e around 55.93 percent of the reports don’t have any errors in them.

Figure 5.4: Word,Number and Error count Comparison

The figure 5.4 indicates the comparison of numeric, word and error count per report description where x axis represents the failure report count and y axis represents word, number and error count. Blue indicates words, orange indicates numbers and green indicates the error i.e., segmentation and spelling error.

(42)

Results 31

5.2 Evaluation based on Word Boundary

Word boundary evaluation evaluates segmented words based on correct word bound-aries [4]. Confusion matrix is used to evaluate the results obtained from different approaches. Confusion matrix is an NxN matrix that helps in visualizing the perfor-mance of any algorithm used. Here, the row represents the predicted class and column represents an actual class, where N is the number of classes. [39] One of the examples of its use is in the context of clustering.

Figure 5.5: Confusion matrix

From Figure 5.5, TP indicates that the predicted word is segmented and the actual word is also segmented, TN indicates the the predicted word is not segmented and the actual word is also not segmented, FP indicates that the predicted word is segmented but the actual word is not segmented and FN indicates that the predicted word is not segmented but the actual word is segmented. Performance measures like accuracy, precision, recall and F1 score were used to evaluate the performance of each approach and were com-pared against existing segmentation python package wordsegment.

Accuracy is one of the most important performance measures and is calculated by dividing the diagonal values of the confusion matrix with the total sum.

Accuracy= TP+TN

TP+TN+FP+FN (5.1) Precision is calculated by instances marked positive that are really positive. [40]

Precision= TP

TP+FP (5.2) Recall is percentage of positive instances that are correctly identified. [40]

Recall= TP

TP+FN (5.3) F1 score is a harmonic average of precision and recall.

F1Score=2· Precision·Recall

Precision+Recall (5.4) A manual data-set was created by selecting 1000 unique non segmented words which were fed to the approaches and the results were compared to the manually corrected

(43)

Results 32

words to see if the word boundary placed was correct. From this confusion matrix was created and the above metrics were calculated. To compare the results, Wordsegment package was used, which is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Table 5.4: Example of manual data-set

Non-segmented

word Segmented word

Result based on Approach 1

Result based on Approach 2

rootcause root cause root cause root cause gearshifting gear shifting gear shifting gear shifting

The Table 5.5 presents the result on the created data-set for each model.

Table 5.5: Evaluated results

Metrics/Approaches Approach 1 Approach 2 Wordsegment

Accuracy 0.87 0.58 0.66 Precision 0.86 0.47 0.62

Recall 0.79 0.92 0.22

F1 score 0.82 0.62 0.33

From the Table 5.5, while comparing these approaches, it can be inferred that Approach 1 has a better Accuracy, Precision and F1 score while Approach 2 scores a better recall which can mean that the Approach 2 was able to return most of the relevant results. In order to conclude that Approach 1 is more efficient than approach 2, technique for calculation of probabilities of frequency of the word is considered. Approach 3 couldn’t be considered for any evaluation because of the improper word formations.

5.3 Classification Accuracy

Scania has an existing team that works on auto sorting the failure reports through keyword extraction.The main focus of this evaluation was to combine the word segmen-tation approaches with the auto sorting pipeline and then calculate the classification accuracy. Table 5.6 represents the accuracy obtained with and without segmentation.

(44)

Results 33

Table 5.6: Classification Accuracy

Approaches Accuracy

Without Segmentation 0.674 Approach 1 0.675 Approach 2 0.686 Wordsegment 0.675

From the above table, it can be inferred that Approach 2 scores a better accuracy com-pared to all the other approaches. But there is no huge difference with the accuracy obtained. Currently, limited number of features(country code, country name, language filter, removing special characters, component number )used for classification of the failure reports which has a huge impact on the accuracy. If the number of features used are extended like engine type, subgroup etc, there would be a definite increase in classi-fication accuracy. The above obtained values were checked for statistical significance and it was found that Approach 2 has a more significant value.1

5.4 Error Count Comparison

Data for error count comparison is obtained after sending the data-set through Scania’s auto-sorting algorithm. Table 5.7 represents error distribution before and after applying the word segmentation approaches to the auto-sort algorithm.

Table 5.7: Error count comparison

Bins Data-set Approach 1 Approach 2 Wordsegment

0 62 220 68 887 69 013 68 901 1-2 36 974 24 754 15 487 27 510 3-6 11 163 7650 5092 8113

7-10 560 210 86 378

>10 322 138 14 201

From the above table, it can be inferred that Approach 2 was able to improve the zero error count and the error distribution count for all the ranges have also reduced.

1_{Statistical significance is a concept used in research to test whether a given data set is reliable or not and} decide if it can help in a further decision making or in formulating a relevant conclusion. The concept itself is based on the comparative error figure that uses the sample size and on the difference between the percentages of response in the data set in question.