Relation Classification Between the Extracted Entities of Swedish Verdicts

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017 ,

Relation Classification Between the Extracted Entities of Swedish Verdicts

NILS DAHLBOM NORGREN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Relation Classification Between the Extracted Entities of Swedish Verdicts

Relationsklassificering mellan extraherade entiteter ur svenska domar

Thesis Report May 8, 2017

NILS DAHLBOM NORGREN nilsdn@kth.se

Master’s Thesis in Computer Science (30 credits) Supervisor: Alexander Kozlov (akozlov@kth.se)

Examiner: Johan H˚ astad (johanh@kth.se)

(3)

Abstract

This master thesis investigated how well a multiclass support vector

machine approach is at classifying a fixed number of interpersonal rela-

tions between extracted entities of people from Swedish verdicts. With

the help of manually tagged extracted pairs of people entities called re-

lations, a multiclass support vector machine was used to train and test

the performance of the classification. Di↵erent features and parameters

were tested to optimize the method, and for the final experiment, a

micro precision and recall of 91.75% were found. For macro precision

and recall, the result was 73.29% and 69.29% respectively. This resulted

in an macro F score of 71.23% and micro F score of 91.75%. The re-

sults showed that the method worked for a few of the relation classes,

but more balanced data would have been needed to answer the research

question to a full extent.

(4)

Sammanfattning

Detta examensarbete utforskade hur bra en multiklass st¨ odvektor- maskin ¨ ar p˚ a att klassificera sociala relationer mellan extraherade per- sonentiteter ur svenska domar. Med hj¨ alp av manuellt taggade par av personentiteter kallade relationer, har en multiklass st¨ odvektormaskin tr¨ anats och testats p˚ a att klassifiera dessa relationer. Olika attribut och parametrar har testats f¨ or att optimera metoden, och f¨ or det slutgiltiga exprimentet har ett resultat p˚ a 91.75% f¨ or b˚ ada mikro precision och

˚ aterkallning ber¨ aknats. F¨ or makro precision och ˚ aterkallning har ett

resultat p˚ a 73.29% respektive 69.29% ber¨ aknats. Detta resulterade i ett

makro F v¨ arde p˚ a 71.23% och ett mikro F v¨ arde p˚ a 91.75%. Resultaten

visade att metoden fungerade f¨ or n˚ agra av relationsklasserna men mer

balanserat data skulle ha beh¨ ovts f¨ or att forskningsfr˚ agan skulle kunna

besvara helt.

(5)

Introduction

Extracting information from text is a task that is more important than ever [11].

Due to the rapid growth of digital text-documents available to us, more information emerges and new methods for finding and extracting valuable information needs to be researched and developed.

Extracting information from documents can vary widely in complexity, and it mainly depends on the structure of the text in the documents. This work has its fo- cus on extracting and structuring information from unstructured or semi-structured text, this task is also called information extraction (IE). What is extracted and how it is structured depends on the application that it is used for. For example, a very common and large area to work with within IE is to transform document texts into a structured knowledge database, where the title, authors and other information are separated from each other.

This work has its main focus on the task of relation extraction, and is one small part of the IE area which has its focus on extracting relations between named entities in text documents. For example, a relation can consist of who lives where or who knows whom. These relations are of big importance for many di↵erent applications today. Say, for example, that all relations in a set of documents would be extracted and classified, then the dataset can be used in a search application with more complicated queries than one normally can use. This is possible due to extracted and structured data.

From a scientific point of view, the interesting part of relation extraction is how many small tasks of information extraction are combined to create something more powerful. For the relation extraction area, a few modules that can be used are part of speech tagging, named entity recognition, tokenization, pattern matching and machine learning. In this work, the interesting part is to see if these modules can be used to solve a new problem, which is relation extraction with Swedish verdicts as the dataset.

To be able to answer that question, a machine learning approach is used to

extract and classify relations in the dataset. Di↵erent approaches can be used to

solve similiar problems. Some research suggest pattern matching as an approach

(9)

for solving the problem [3, 1], but due to the large amount of data that needs to be processed, machine learning approaches have shown to be more efficient.

One of the more interesting parts in this work is to see if a dataset that has a

narrow purpose contains information that could hopefully be useful in other appli-

cations. A concrete example could be an application where one wants to find how

people are related to each other. If a database with relations existed, this could be

retrieved by a series of queries and presented in many di↵erent ways. This is inter-

esting due to the reasons stated above, finding structured information in a large set

of unstructured data to help the user retrieve and understand the data.

(10)

Chapter 2

Problem statement

Extracting and classifying relations between named entities in data is a task that has been done before. Di↵erent approaches have been tried with various results [17, 3, 1, 15, 30, 7]. What connects them is their focus on that a relation is limited to be within one sentence or another special constraint. This project has its main focus to find and classify relations between entities in documents that are not limited to being in the same sentence. This problem is interesting for di↵erent reasons, mainly due to the applications that the method can be used in. With structuring of information, humans can understand and draw conclusions from data, which is harder when dealing with unstructured data. Another interesting aspect is that the dataset and method have not been tried together before, so by doing this work, new insights can be retrieved and hopefully be used in other applications. The dataset consists of the public Swedish verdicts under the category person crimes until the date of 02/28/2016 and contains many named entities such as people, companies and locations. This should make the dataset very useful for relations extraction.

2.1 Question

If you extract named entities of people from a set of verdicts, is it possible to accu- rately classify interpersonal relations between the people in the set? The relations consist of who knows each other and who works together.

2.2 Objective

The objective of this project is to see how well a machine learning approach is a

useful method for finding and classifying relations between named entities of people

in a set of Swedish verdicts. The goal is that with the help of machine learning, an

accurate classification of the relations in the Swedish verdicts can be obtained. The

classification could then be useful in other information extraction applications.

(11)

The main outcome is for the reader to get a deeper understanding of the topic that this work cover. It is also important that a functional prototype is created and thoroughly evaluated.

2.3 Delimitations

There are a few delimitations in this project; the first and most important one, is that this work is only towards documents written in the Swedish language, therefore it cannot be certain that this approach works for other languages. The work focuses mainly on the classification of named entity relations, and not the finding of named entities in documents. Named entity recognition is an important task to improve, but is not in the scope of this work.

Another important delimitation is the fact that the method that will be tested is a novel idea, and therefore there doesn’t exist any results that the work can be compared with. Therefore, it is necessary to decompose and test the method and dataset thoroughly, to be able to draw a conclusion from the results.

The work is also delimited to relations between the named entities of people, this is a necessary delimitation for controlling the amount of relations extracted. This is also chosen due to the fact that the Swedish verdicts contains mostly interpersonal relations.

Lastly, one important delimitation is that many of the documents in the dataset

contain a lot of noise, due to being scanned from paper, these documents are dis-

missed. Also, documents that are too small are dismissed due to containing no

information.

(12)

Chapter 3

Background

In this chapter, the background to the problem is explained. The background consists of a relevant theory section and a relevant work section.

3.1 Relevant theory

This section describes the relevant theory regarding the techniques that were used within this project. Firstly, the idea behind information extraction is described in Section 3.1.1. In Section 3.1.2, the theory behind a few components of natural language processing are described. Afterwards, in Section 3.1.3, the basic concepts of machine learning and mainly support vector machines and multiclass support vector machines are described. In Section 3.1.4, the idea behind cross validation is explained and lastly, in Section 3.1.5, a mathematical explanation on how to evaluate support vector machines is given.

3.1.1 Information extraction

Information extraction (IE) is the subtask of natural language processing (NLP) and focuses on the process of extracting structured information from semi and/or unstructured text [6]. The structured information is used for various applications, two common examples are databases and search engines.

To extract information, many di↵erent approaches are used. A few examples of

methods are sequence models, classifiers and rules, patterns and expression match-

ing. Sequence modeling is a probabilistic approach, and can be for instance, hidden

Markov models or conditional random fields. Support vector machines are good

examples of classifiers. The last one, rules, patterns and expression matching is an

approach that matches the exact information that should be extracted [16].

(13)

3.1.2 Natural language processing

Natural language processing is a large field within computer science that concerns the interaction between computers and human language. This section describes a few theories used within this field that are relevant for this work.

3.1.2.1 Tokenization

Tokenization is a task that focuses on breaking up a text into smaller elements, tokens, such as words or phrases [20]. In many languages, every token is separated by whitespace or punctuation marks, this makes the tokenization a fairly easy task.

There are di↵erent ways to implement a tokenizer, but regardless of the implemen- tation the reason for doing tokenization is because it is useful in other processing steps. Tokenization gives a standard form of the text, that is easy to use in other modules.

3.1.2.2 Named entity recognition

Named entity recognition (NER) is the task of locating and classifying named enti- ties in text. Named entities are specific types of words that refer to the names of, for example, locations, organizations and people [29]. An example of the output from an entity recognizer, can be seen in Figure 3.1. There are several di↵erent methods for solving this task. A few common examples are rules, pattern recognition and machine learning, supervised or semi-supervised.

George works in Washington at Cleanup Inc.

Person Location Organization

Figure 3.1: Example of named entity recognition

3.1.2.3 Bag of words model

Bag of words is a model for representing text in a simple mathematical structure.

It uses a multiset of the words in a text to represent the text as an array. It keeps track of the frequency, but not the order of the words in the text. For example, the two small documents:

Document 1. “Eric likes to watch movies at night”

Document 2. “Eric also likes to watch football. Eric loves football.”

can be converted to the array

A = [Also, At, Eric, Football, Likes, Loves, Movies, Night, To, Watch]

(14)

where only unique words from the two documents appear. After this, the two documents can be described by these two vectors:

Document vector 1. [0,1,1,0,1,0,1,1,1,1]

Document vector 2. [1,0,2,2,1,1,0,0,1,1]

These two vectors described above maps the frequency of the terms in the original documents with the use of array A. Take for example vector 2. It says that the first word of array A appears one times in the original document. This is correct because the word “Also” appears one time in the document. This can be done for all values in the vector.

Bag of words is often called “term” frequency and is a common method used for representing text in machine learning algorithms. One common approach is to use a binary bag of words, where the vectors only count if the word is present or not in the original document. Due to the fact that a support vector machine only takes numerical values as features, bag of words is a good way to represent text in them.

3.1.3 Machine learning

Machine learning is a field within computer science that focuses on learning algo- rithms that make predictions from data. There are several di↵erent tasks within the machine learning field, the two most common are supervised and unsupervised learning. In supervised learning the algorithm is given input and the desired out- put, and is then trained to map what input relates to what output. In unsupervised learning the algorithm is given input, but no given output. The goal is to find patterns or structure in the input. In this work, only supervised learning is con- sidered due to the fact that only specific relations are used. If a general algorithm for finding relations would be created, supervised and unsupervised methods could both be of interest.

In this project, a supervised learning method called support vector machines (SVMs) is used. The reason for choosing this method over many other machine learning algorithms, such as neural networks or naive bayes, can be described by a few general viewpoints. The first reason for choosing SVMs is due to the fact that they have been proven to work fairly well for multiclass relation classification, this is discussed in Section 3.2.2.2. Another very important viewpoint is that SVMs are widely used for classification task. Due to having a limited amount of time to research and create this method, a well known approach is in favor for achieving results that can be discussed. Also, when using SVMs, ideas and thoughts from other similar studies can be reflected over.

3.1.3.1 Support vector machines

Support vector machines were initially non-probabilistic supervised methods used

for binary classification [5]. To achieve the binary classification between two classes,

(15)

the SVM first receives training data, known as observations. These consist of ob- served features and a corresponding classification. By mapping the observation into a n-dimensional space, the SVM tries to separate the observations into two classes, I and J. This is done by creating one hyperplane that divides the classes and max- imizes the margin between the nearest observations of each class on each side. The hyperplane that is created can be described by the following equation.

w · x + b = 0 (3.1)

The observations on the margin are called the support vectors. The margin on which these lie can be described by two hyperplanes showed in the two Equations 3.2 and 3.3 below. In those equations, the x is a set of observations in space, w is the normal vector to the hyperplane described in Equation 3.1 and b a constant.

w · x + b = 1 for observations of class I (3.2) w · x + b = 1 for observations of class J (3.3) If we add an extra constraint so that the observations can be on the margin, but also further away, we get:

w · x i + b 1 for x _i of class I and w · x i + b  1 for x i of class J (3.4) If these equations are combined we get Equation 3.5, where y _i is 1 or -1 depending on the class I or J the observation belongs to.

y i (w · x i + b) 1, for all 1  i  number of observations. (3.5) After the data is separated, new unclassified data can be mapped onto the same space and be classified depending on which side of the hyperplane they are mapped to. Mathematically, the idea behind an SVM is to maximize _kwk ² which is the distance between the support vectors. But instead of maximizing, the problem is transformed into a minimizing problem, due to the fact that minimizing kwk max- imizes the above expression [14]. This step is done for mathematical conveniences, as it will later be simpler to solve it on this form.

max w,b

2 kwk ) min

w,b

kwk ²

2 (3.6)

for any i = 1, . . . , number of points subject to y _i (w · x i + b) 1.

where y is 1 for class I and 1 for class J.

The problem on this form is called the “primal problem”, and can be solved at

this step by quadratic programming. Quadratic programming is a mathematical

problem where the object is to max or minimize a quadratic function of multiple

(16)

variables subject to a linear constraint of these variables, exactly like the Equation 3.6. But a smarter approach is to transform the problem in Equation 3.6 into the

“dual problem” [23]. This is done by using Lagrangian multipliers with Karush- Kuhn-Tucker conditions, which allows inequality constraints [28]. The Lagrangian for Equation 3.6 can be seen in Equation 3.7 where ↵ is the introduced Lagrangian multiplier.

The following mathematical formulas (Equation 3.7, 3.8 and 3.9) are the basic idea behind the transformation to the “dual problem”. These formulas are from the paper Support vector machines by Andrew Ng [23], in this paper, the steps are explained more thoroughly.

L(w, b, ↵) = 1/2kwk ² X m i=1

↵ _i [y _i (w ^| x _i + b) 1] (3.7) By using the Lagrangian multiplier, the optimization for Equation 3.6 can be expressed as Equation 3.8 where m equals the number of data points.

max ↵ W (↵) = X m i=1

↵ i 1/2 X m i,j=1

y i y j ↵ i ↵ j x ^| _i x j (3.8)

subject to ↵ _i 0 X m

i=1

↵ _i y _i = 0

Now the SVM optimization problem only depends on the new introduced La- grangian multiplier parameter ↵. The problem on this form is called the “dual problem”. By solving this optimization problem with quadratic programming, a hyperplane that has the maximum margin to the nearest data points on each side is created. To be able to classify a new unseen datapoint x ⁰ , the Equation 3.9 is used, where the value of the function indicates the class of the unseen datapoint x ⁰ [23].

sign(

X m i=1

↵ _i y _i x ^| _i x ⁰ + b) (3.9) The interesting part about both the optimization problem (Equation 3.8) and the classification function (Equation 3.9) is that both only uses the inner product of two feature vectors x ^| _i and x j . The reason for transforming the problem into the

“dual problem”, only depending on ↵ and the inner product between two feature vectors, is that the kernel trick can now be used to e↵ectively find support vectors for non-linear separable and/or high dimensional data.

3.1.3.2 The kernel trick

In the previous section, the “dual problem” and its corresponding classification

function was described, both depending on ↵ and the inner product of two feature

(17)

vectors. Say that we want the SVM to learn a nonlinear classification in dimension , one way to solve it is to map the feature vector x into another space of dimension

! with the help of a function (x) where it is linear separable. This could be done by replacing the inner product hx, yi with h (x), (y)i in Equation 3.8. This feature mapping is called the kernel, and can be described as in Equation 3.10.

K(x, y) = (x) ^| (y) (3.10)

We can now replace all inner products in Equation 3.8 and 3.9 with K(x, y). It is now possible to calculate the value of K by using the feature mapping (x) and (y) and taking the inner product between these values. But the interesting part here is that even though the mapping is expensive to compute, the function K can be less expensive, even though depending on two functions. An example below from “Support vector machines” by Andrew Ng [23] explains this very intuitive. If we have a kernel K(x, y) = (x ^| y) ² which can be written as

K(x, y) = ( X n i=1

x _i y _i )(

X n j=1

x _j y _j ) = X n

i=1

X n j=1

x _i x _j y _i y _j = X n i,j=1

(x _i x _j )(y _i y _j )

We can now see that K(x, y) = (x) ^| (y), where is (for n = 3 above)

(x) = 2 6 6 6 6 6 6 6 6 6 6 6 6 4

x ₁ x ₁ x ₁ x ₂ x 1 x 3

x ₂ x ₁ x ₂ x ₂ x 2 x 3

x ₃ x ₁ x ₃ x ₂ x 3 x 3

3 7 7 7 7 7 7 7 7 7 7 7 7 5

Here we can see that finding takes O(n ² ) time, while finding K takes O(n) time and we don’t need to calculate anything in the transformed dimensional space. So to summarize, using the kernel trick, an SVM can e↵ectively learn in high dimensional feature space without mapping all features into a higher dimensional space.

In the example above, with K(x, y) = (x ^| y) ² , we had an example of a polynomial kernel. There are many other kernels that could be used, but a few common kernels used are.

1. Linear x ^| · y

2. Polynomial (x ^| · y + c) ^d

3. Radial basis function (RBF) exp( · ||x ^| y || ² )

Linear kernel only computes the inner product between the two feature vectors,

polynomial takes the dot product plus raises the value with a constant degree chosen

by the user. This results in a mapping to a feature space represented as polynomials

(18)

of the same points in input space. RBF kernels, which is an Gaussian function uses the exponential of the squared euclidean distance between the two feature vectors and multiplies them by a free parameter . Both the variable degree in the polynomial kernel and the variable in RBF heavily impacts the performance of the SVMs classification. The choice of kernel function in an SVMs can change the impact of the classification drastically.

For example, with an RBF kernel the impact of can be visualized as in Figure 3.2. Yellow circles are instances of one class, and purple squares another. From the figure we can see that the higher value of , the more the SVM “surrounds” the data points. This is due to RBF being an Gaussian function, and the parameter defines to width of the peak in the corresponding curve that the function translates to. A lower value, gives a much wider peak, while a large value gives a pointy peak.

(a) = 1 (b) = 10 (c) = 100 (d) = 1000

Figure 3.2: The impact of parameter on the classification with an RBF kernel, visualized with LIBSVM SVM-Toy tool [4].

3.1.3.3 Soft margin

Another way to extend the SVM is to allow some data points to be on the wrong side of the margin, this is called a soft margin. Every data point on the wrong side of the margin receives an error e proportional to the distance to the margin. The Equation 3.11 below is explained in A practical guide to support vector classification [19] by Hsu et al. The value C can be changed to scale the error value for each data point on the wrong side of the margin. This influences the classification of the SVM, small values of C gives a larger margin separating hyperplane with more misclassified data points. Large values of C gives a smaller margin, but is strict with misclassifying data points. By changing the value C, the SVM can be optimized to better classify the data.

min w,b,e

kwk ² 2 + C

X l i=1

e _i (3.11)

for any i = 1, . . . , number of points subject to y _i (w · x i + b) 1 e _i , e _i 0.

To visualize it, the Figure 3.3 shows an SVM with RBF kernel and di↵erent C

values. Yellow circles are one class, and purple squares another. Here we can see

(19)

that the higher value of C, the more the SVM is strict with misclassifying points.

With a really low value, it classes all data points as the same class.

(a) C = 0.1 (b) C = 100 (c) C = 10000 (d) C = 100000

Figure 3.3: The impact of C parameter on the classification with an RBF kernel, visualized with LIBSVM SVM-Toy tool [4].

3.1.3.4 Combining soft margin with the kernel trick

It is of course possible to combine the idea behind soft margin classifiers together with the kernel trick. The Equation 3.12 shows how.

min w,b,e

kwk ² 2 + C

X l i=1

e _i (3.12)

for any i = 1, . . . , number of points subject to y _i (w · (x i ) + b) 1 e _i , e _i 0.

This could then be rewritten to the dual problem seen in Equation 3.13.

max ↵ W (↵) = X m

i=1

↵ _i 1/2 X m i,j=1

y _i y _j ↵ _i ↵ _j K(x _i , x _j ) (3.13) subject to 0  ↵ i  C,

X m i=1

↵ i y i = 0.

When combining these two ideas, we have a method for combining both a soft margin classifier, with a kernel function. This gives the user good control over the classifier, and enables it to be optimized to best suit the problem.

3.1.3.5 Multiclass support vector machines

Multiclass is an expansion of the classic SVM where the number of classes is greater

than two. The core idea is the same as for a binary classifier, but the method used

to solve the classification is far more complicated. The two methods that are most

used are One-versus-All and One-versus-One [18]. The One-versus-All approach

(20)

involves training one classifier per class, with all observations from that class as positive, and the rest as negative. The classifier then needs to output a real-value confidence score. Making a decision in a One-versus-All classifier is running all classifiers on an unseen sample K and then chooses the class from the classifier with the highest confidence score. One-versus-One classifier trains one classifier for each pair of class. The number of classifiers are ^{N (N 1)} ₂ where N equals the number of classes. Making a decision in an One-versus-One classifier is running all classifiers on an unseen sample K and every classifier votes on one of their two classes. The sample K is then classified according to the maximum voting strategy. In this work, the number of classes corresponds to the number of relations we want to be able to classify.

3.1.4 Cross validation

Cross validation is a method for optimizing the unknown parameters of a model to fit the data as well as possible. The idea is to generate independent sets of training and validation data from a given set of data. For example, in the method k-fold cross validation, the original data set is split in k di↵erent groups. From these k groups, k number of models are trained to fit the unknown parameters to the data.

On each optimization, one of the k groups are used as validation set and the rest as the training set. This is done to prevent overfitting. Overfitting is when a model is trained to perform excessively on a training set, even taking random error and noise into account. Therefore it misses the real relationship between the data in the set. This yields bad performance when testing it on a validation set. When all k optimizations are done on the cross validation, they can be averaged to find the best set of parameters that fit the data.

3.1.5 Evaluating classification algorithms

When using machine learning for multiclass classification, several measurements need to be used to evaluate the quality of the classification algorithm. In this work, precision, recall and F score are used as measurements. These measurements come in two di↵erent approaches, micro and macro. Macro treats all classes equally while micro favors bigger classes [27]. The following notations are used for explaining how to evaluate multiclass classification, and are used in the following equations in Section 3.1.5.1, 3.1.5.2 and 3.1.5.3.

tp i = true positive, number of retrieved items correctly labeled as class i.

fp _i = false positive, number of retrieved items incorrectly labeled as class i.

tn _i = true negative, number of items not labeled as class i and not of class i.

fn i = false negative, number of items not labeled as class i but of class i.

N = the number of classes in the support vector machine

(21)

3.1.5.1 Confusion matrix

In machine learning, a confusion matrix is a common way to visualize the perfor- mance of the classification algorithm used. The rows show the actual classes in the classification, while the columns represent the predicted classes. So one cell in the matrix shows how many of actual class i was predicted as class j. An example of a confusion matrix (M) can be seen in Table 3.1.

Table 3.1: Example of an confusion matrix (M) Predicted class

C1 C2 C3 C4

Actual class

C1 3 2 0 1

C2 2 3 1 0

C3 2 2 0 0

C4 3 3 1 1

In the next Table 3.2, an adaption of the notations explained in Section 3.1.5 are used in a confusion matrix for class C1 from Table 3.1.

Table 3.2: Example of notation in a confusion matrix for class C1 Predicted class

C1 C2 C3 C4

Actual class

C1 tp fn fn fn

C2 fp tn tn tn

C3 fp tn tn tn

C4 fp tn tn tn

3.1.5.2 Multiclass precision

Precision measures the fraction of retrieved correctly classified relations against the number of retrieved relations.

Micro Precision =

P _N

i=1 tp _i P _N

i=1 (tp _i + f p _i ) (3.14) Macro Precision =

P _N

i=1 tp

i

tp

i

+f p

i

N (3.15)

If the precision is to be calculated for only one class i, following formula, based on the confusion matrix (M), is used. The idea is to divide the true positives of class i with all the total predicted as i.

Precision _i = M _ii P

j M _ji (3.16)

(22)

3.1.5.3 Multiclass recall

Recall measures the fraction of retrieved correctly classified relations against the number of total relations in the set, detected or not.

Micro Recall =

P N i=1 tp _i P N

i=1 (tp _i + f n _i ) (3.17) Macro Recall =

P _N

i=1 tp

i

tp

i

+f n

i

N (3.18)

If the recall is to be calculated for only one class, the following formula, based on the confusion matrix (M), is used. The idea is to divide the true positives of class i with all the actual of class i.

Recall _i = M _ii P

j M _ij (3.19)

3.1.6 Multiclass F-score

F-score is a measurement that combines precision and recall and is defined as the harmonic mean between them. The basic case Micro F 1 is when recall and precision is evenly weighted.

Micro F 1 = 2 ⇥ Micro Precision ⇥ Micro Recall

Micro Precision + Micro Recall (3.20) Macro F 1 = 2 ⇥ Macro Precision ⇥ Macro Recall

Macro Precision + Macro Recall (3.21)

(23)

3.2 Related work

In this section, previous related work to this work is described. The methods and ideas described here are partially used in the method.

3.2.1 Named entity recognition

The task named entity recognition was first introduced under the Message Un- derstanding Conference in 1996 [26, 12]. After this, the area has been researched throughout the years, with new methods and improvements. From the beginning the approach was to use a dictionary or patterns/rules to find the named entities.

Lately, approaches such as decision trees, support vector machines and conditional random fields have been used. A brief survey about named entity recognition in English from 1991 to 2006 can be read in the paper A survey of named entity recognition and classification by Nadeau, David and Sekine, Satoshi [22].

3.2.1.1 Named entity recognition for the Swedish language

In 2001, Dalianis and ˚ Astr¨ om created SweNam, which is a named entity recognizer for four types of entities in the Swedish language [8]. In the article they describe the method as a mix of rules, lexicons and training strategies. By starting o↵ with a small set of rules and lexicons that matches entities, for example all words that end with AB (English: Corp or Corporation) are organizations. The program can learn new entities and expand the lexicon. The training data consists of 108 000 Swedish news articles from 2000-2001. The evaluation data consists of 100 manually tagged Swedish text documents. The results from doing this approach can be seen in Table 3.3.

Table 3.3: Results from the SweNam named entity recognizer.

Before Training After training

AVG Precision 74% 72%

AVG recall 29% 36%

AVG F-score 40% 49%

In 2005, the paper Named entity recognition for the mainland Scandinavian languages by Johannessen et al. a new method was proposed for finding named entities in the Swedish and other Scandinavian languages [21]. It was restricted to six entity categories and is a statistical method with gazetteers. On a test corpus with 1800 words, the Swedish NER with gazetteers received and average of 91%

recall and 93% precision. Without gazetteers on a test corpus with 40000 words, the recall dropped from 91% to 53%.

In the paper Identification of Entities in Swedish [25] from 2012, Salomonsson,

Marinov and Nugues proposed a machine learning approach for identification of

(24)

entities in Swedish. A linear classifier, LIBLINEAR [10] was used to train and test the system on the Stockholm-Ume˚ a Corpus (SUC) [9]. The method starts with tokenization of the text and sentence detection. By then tagging each token with their corresponding part of speech tag and run it through the NER, a precision of 75.77% and recall of 72.35% was achieved.

3.2.2 Relation classification and extraction

3.2.2.1 Pattern matching

In the article Extracting patterns and relations from the world wide web [3] from 1999, Sergey Brin proposed a method called DIPRE (Dual Iterative Pattern Rela- tion Expansion). DIPRE is a method for extracting relations from the world wide web. In this article the relation extracted was limited to one, this relation was author to book.

The idea is to start with a small sample of authors and books that they con- structed. With this sample, an algorithm that finds all occurrences of the sample and extracts the surrounding text is used. With the help of these extracted sur- rounding text and a pattern generator, patterns that match the samples and other relations can be created. By running the patterns against the database, new re- lations can be extracted, and new patterns can be created. When the number of patterns is big enough, the algorithm stops. This work has no listed results, but the article states that it did indeed find author and books that were labeled correctly.

In 2000, Agichtein and Gravano, proposed a new improved relation extractor [1]

that was built upon the techniques described by Brin. The di↵erence from Brin’s work are these three main contributions:

1. A new strategy for generating patterns and extracting tuples. By adding named entity tags they are able to ignore unwanted entities and focus on the correct entities. Therefore, they receive a higher coverage and the patterns get selective.

2. Strategies for evaluating patterns. A new strategy for estimating the reliability of the new extracted patterns and tuples is added.

3. Evaluation methodology and metrics. A new scalable evaluation method and their metrics is added.

According to Agichtein and Gravano this method performs better than Brin’s method, with a higher precision and recall on all tests. In 2004, Hesegawa et al.

proposed in the article Discovering relations among named entities from large cor- pora a method for discovering relations between entities in a large corpora [15]. The method used is based on context based clustering of entity-pairs. The assumption is that every entity-pair occurring in similar context can be clustered, and that they are an instance of the same relation. The method flow consists of five steps:

1. Tag named entities in the documents.

2. Retrieve co-occurrence entity-pairs and their context. Co-occurrence of entity-

pairs are two named entities co-occuring in the same sentence.

(25)

3. Measure context similarity between entity-pairs. This is done with the cosine similarity, which is a measurement of the cosine angle between two vectors in n-dimensional space.

4. Cluster entity-pairs.

5. Label each cluster of entity-pairs.

The relations that were evaluated are, Person (PER) - Geo Political Entity (GPE) and Company (COM) - Company. With the context similarity cosine value threshold just above zero the results were the following:

Table 3.4: Hasegawa et al. [15] method results

Precision Recall F-measure

PER-GPE 79% 83% 80%

COM-COM 76% 74% 75%

3.2.2.2 Machine learning

In the paper Relation extraction using support vector machine [17] from 2005, Hong describes a method for finding relations in sentences with the help of SVMs. In this paper, the number of relations/classes in the SVM is limited to five, the classes and a few examples of their subtypes can be seen in Table 3.5 below.

Table 3.5: Hongs five relations/classes [17]

Relation Type

At Near Part Role Social

Example: Based-in, location

relative- location

part-of, subsidiary

client, founder

children, parent

By first using NER to extract all entities from a large set of documents, he finds entity pairs in the same sentence and calculate features between them. The features used are:

• Words

• Part of speech tag

• Entity type

• Entity mention type

• Chunk tag

• Grammatical function tag

• IOB chain

• Head word Path

• Distance

• Order

(26)

By then using a SVM to train and classify new data, a precision of 68.8% was

measured when using all features. The F-measure was calculated to 58.8%.

(27)

(28)

Chapter 4

Methodology

This chapter describes the method used to solve the problem. In Section 4.1, the data that is used in this project is described. Under Section 4.2, the implementation method is described, with subsections about every step. In the last Section 4.3, two baselines methods to compare the results with are described.

4.1 Data

In this work, a subset of all the Swedish verdicts were used as data for the exper- iments. The reason for using this data was because of its interesting domain, and that no other work surrounding it has been done before. Other sources could have been chosen, but no source had an already classified set of relations for the method that this project tried to create.

The language in all documents used is Swedish, and the verdicts varied only by length and case content. Every document in the data has been parsed with the help of optical character recognition (OCR) from portable document format (PDF) to rich text format (RTF).

The subset used for implementing the method and evaluation consisted of 100 of the latest verdicts under the category “crimes against people” with two exceptions.

Due to the limitations set in the background chapter, documents that were smaller than 10 kilobytes were dismissed due to the fact that they contained almost no information. This is because most documents of size 10 kilobytes or smaller, are almost just one page. Also documents that were unreadable, for example, character encoding errors or error with the OCR scanning were dismissed.

4.2 Implementation

In this section, all steps of the implementation process are described. In Figure

4.1 below, the data pipeline can be seen with all corresponding steps. In the sec-

(29)

tions below, all steps are explained more thoroughly. Each number in the figure corresponds to one of the sections below.

1. Set of Swedish documents

2. Tokenization and filtering

3. Named entity recognition

4. Entity pairing and feature

calculation 5. Manually

tagging data 6. Training data

6. Test data 7. Multiclass

support vec- tor machine

Results

Figure 4.1: Pipeline of the method used

4.2.1 Preprocessing of data

Before using the data in any relation extraction related algorithm, the data needed to be preprocessed. This was done to simplify the data for later use and to extract valuable information. Tokenization was the first step that needed to be done in this pipeline. Tokenization refers to the act of breaking up a text into words which we call tokens, and can be read about in Section 3.1.2.1. To extract tokens, an already existing library was used. The library used was Natural Language Toolkit (NLTK) and is used for building Python programs that handles natural languages.

The NLTK has many methods for doing tokenization of text, with support for many di↵erent languages. In this project, the NLTK module named “Tokenizer” was used [24]. The tokenizer module has several di↵erent functions to tokenize the text. This project used the “word tokenize” function, which splits sentences into tokens based on whitespace and punctuation. The input to the tokenizer consisted of all the documents explained in the data Section 4.1. The output consisted of equally many documents, but every document was tokenized with one word per line.

4.2.2 Named entity recognition

After the preprocessing of the data, the next step was to find the named entities of

people in the tokenized texts. In this work an already created module was used for

this task. The Stanford NLP Group has created a program for training and finding

named entities in many languages [13]. With the help of this module together with

a Swedish language model for named entities, it can parse the tokenized files and

return a list of the tokenized texts with the corresponding named entity tag. The

three entities that the Standford NER looks for are locations, organizations and

people. In accordance with the limitation set for this work, only the named entities

of people were considered. The function used within the Stanford NER is called

CRFClassifier, and is a conditional random field classifier.

(30)

The input for the Stanford NER module in this work consisted of line-separated tokens, and the output was line-separated tokens with the corresponding NER tag.

If the module didn’t find an entity for that token, it was represented as a zero.

After the NER module had been used, a regular expression filter was applied to remove the most common errors made by the NER module. The filter was made by first listing all extracted named entities of people from the NER module by number of occurrences. The next step was to extract named entities from this list that occurred more than two times and were incorrectly labeled as people entities and put these into a list. The reason for choosing the limit of more than two occurrences, was that below two occurrences, many of the wrongly labeled entities were spelling mistakes or OCR scanning errors, and it was impossible to take all of these into account. If they occurred two or more times, it was an error that could probably happen again. By then matching every entity extracted from the NER and match it against the filter, the entities that contained a part of the filter were converted to non-people entities. A list of filters used in this work can be found in Appendix A.1.

The next step after filtering out the most common errors was to combine the people entities that occurred directly after each other, this was done because of how the NER module handled the tokens. In the NER module every word could be considered an entity, but in this work, an entity was defined as the full name of a person. To do that, all people entities were checked and all of those who occurred directly after each other in the documents, without punctuation or any other characters were combined.

After that, the named entities were ready to be used for the next step, entity pairing and feature calculation.

4.2.3 Relations: Entity pairing and feature extraction

After finding the named entities in the text, the entities needed to be combined into relation classes. In the scope of this work, four relation classes between people were considered. The reason for only being four relations, was that these four were the most common relations in the dataset. The relation classes were:

• People who have none of the below relations.

• People who work together at the same company.

• People who are on the same side of the verdict.

• People who are on the opposite side of the verdict.

The first step in the entity pairing was to find the first occurrence of every entity.

All other occurrences of entities with the same name were dismissed. This was done to create a list of only unique pairs of people entities per document.

When all unique occurrences were found, they all needed to be paired with each

other. This was done because of the simple reason that all pairs of people entities

(31)

could possibly be a relation. This resulted in a document with Z unique people entities, the total number of relations were ^{Z⇤(Z 1)} ₂ . By doing this entity pairing for every document, all entity pairs were extracted. In this work, a pair of people entities will henceforth be called a relation.

After this step, the features for every relation needed to be calculated. The features for every relation were used as feature vectors in the support vector machine for classification. The features used between each pair of entities were, where M 2 N:

Feature 1: First entity word of the relation. Represented as a binary bag of words that contains words from all verdicts.

Feature 2: Second entity word of the relation. Represented as a binary bag of words that contains words from all verdicts.

Feature 3: M words preceding the first entity. Each word represented as a binary bag of words.

Feature 4: M words following the first entity. Each word represented as a binary bag of words.

Feature 5: M words preceding the second entity. Each word represented as a binary bag of words.

Feature 6: M words following the second entity. Each word represented as a binary bag of words.

Feature 7: Distance between the entities in the relation. Represented as the number of words between the two entities divided by the length of the longest document in the dataset.

Feature 8: Relative position of the relations first entity in the document. Repre- sented as a real number between zero and one.

Feature 9: Relative position of the relations second entity in the document. Repre- sented as a real number between zero and one.

Feature 10: Order. In what order the entities appear in the document. Represented as zero if entity 1 comes before entity 2, one otherwise.

The variable M was equal for all features, from this point M will be referred to as number of surrounding words.

The first step to calculating features was to retrieve a set of all unique words in

all documents. This set was to be used with binary bag of words to represent many

of the features above. To create a list of all the unique words in the document set,

every document was inspected, word by word. If the word was not in the list, it

was added. Every word got a unique number to represent it.

(32)

To represent word features that was used in the SVM, every word needed to be translated to a number instead. This was necessary, because the SVM only take numerical values to represent the features. So for the first six features described above, they needed to be represented as a binary bag of words. For example, the first feature “Entity 1 Word” was represented by multiple features, where every feature corresponded to one of the elements in the unique word vector as can be seen in Section 3.1.2.3. The number of features for every bag of word was therefore as many as there were unique words in the document set.

So for example, if there were 100 unique words in the document set, then every bag of words feature was represented as 100 independent features, where all features had zero as value except the one that was corresponding to the word, that one had one as value.

The last four features could on the other hand just be described by a number instead of a binary bag of words. In the following equations, these features are described. The position variable in these feature equations below represents where the word appears in the document with a number P . P corresponds the number of words from the beginning of the documents to where the word is located.

For feature 7, which was the distance between the entities corresponding to the number of words between the two entities in the relation divided by the length of the longest document, this is done to normalize the feature. The following Equation 4.1 is used.

Distance = |Entity 1 word position Entity 2 word position |

Length of the longest document (4.1) For feature 8 and 9, the locations of the entities were described by its word position in the document divided by the number of words in the document. This can be seen in Equation 4.2.

Location of entity K = Entity K word position

Number of words in corresponding document (4.2) The last feature, the order of the entities was described with a zero if entity 1 comes before entity 2, it was described by the number one if it was the other way around.

So the following Equation 4.3 was used.

Order =

( 0 if entity 1 position < entity 2 position

1 if entity 1 position > entity 2 position (4.3) After all entity pairs have been extracted and features have been calculated, every- thing was saved for later usage. The data was stored in the following format for each entity pair:

class Feature1:FeatureValue1 Feature2:FeatureValue2 ...

(33)

4.2.4 Manual classification of relations

After all entities have been paired into relations, they needed to be manually clas- sified. This step was necessary to be able to run and evaluate the performance of the SVM. First o↵, an SVM needs classified training data to be able to create a model that can later be used to classify new data. Secondly, without classified data, calculations such as precision, recall and other measurements can’t be calculated.

To manually classify relations, it was needed to go through every relation extracted in the previous step. For each relation, the original document corresponding to that pair needed to be opened and read to see if there was a relation between the pair of entities. If there was a relation, it was also needed to annotate which kind of relation it was. After this was done, the data was stored similar as to above, but the class attribute had a value corresponding to the correct class.

4.2.5 Splitting the data

Lastly, before using the manually classified data in the support vector machine, the data had to be split into training and validation set. The training set was used for almost all experiments, from tweaking parameters to analyzing the importance of the features. The validation set was hidden away to be used for the last experiment, which was to show how the method works with the best setup of parameters and features. A machine learning algorithm should not be trained on the same data as the validation data. This was also the only way to show how the method works for new unseen data.

4.2.6 Support vector machine

After all preprocessing was done, the data was used in a support vector machine.

In this work, the LIBSVM library was used [4]. LIBSVM is a library for support vector machines that is written in C and has a Java interface of which a few classes and methods were used. The parameter class defines all the parameters that the SVM should use. The most important parameters in this class are C, and SVM type. These parameters and their impact on the SVM were explained in Section 3.1.3.1 and Section 3.1.3.3. The parameter class is used in the method “svm train”, which is the core method of the library and performs classification on the inputted data and parameters. This method can also save the model calculated for analysis and classifying new data.

The second method is the N -cross validation method. It splits the data into N number of sets, and uses N -1 for training the and the last one for validating.

The cross validation is done by calling the “svm train”-method N times so that

every set is used for validating and the rest for training. By then calculating the

classification measurements for every run and taking the average of all runs, an

overall measurement can be obtained which gives an indication on how the SVM

performs with these parameters.

(34)

4.3 Baseline methods for relation classifica- tion

This section contains information about baseline methods. Baseline methods are used for comparison to other methods, and defines a point of reference for comparing results. As there doesn’t exist any other work that can be compared with this, two very simple baselines are explained in the following sections.

4.3.1 Most probable baseline

The first baseline used was the most probable class, here every data point in the validation set was guessed to belong to the most probable class from the training set. That class was the one with the most data points belonging to it.

4.3.2 Random guessing with weight baseline

The second baseline was the weighted random guessing. The data points in the

validation set were guessed to belong to one of the four classes with di↵erent prob-

abilities. The probabilities corresponded to the number of class i in the training set

divided by the total number of data points in the training set.

(35)

(36)

Chapter 5

Experiments and results

In this chapter, all experiments and their corresponding results are explained and showed. In the beginning of the chapter, a few notations about the computer resources, dataset and parameters used in the SVM are described.

5.1 Computer requirements

For running this experiment, there are low minimum computer requirements. It takes around five minutes to create an SVM model from the data in these exper- iments on a typical office computer (3GHz CPU, 8GB RAM), depending on the parameters chosen. The memory usage depends on how many features and data points that are used, as the allocation of memory is number of features times num- ber of data points. For creating a model from the training set described in 5.2, 100MB of memory is allocated. The LIBSVM package can use 4GB of memory for creating an SVM model, so there is room for more features and data points.

5.2 Information about the dataset used in ex- periments

The dataset that was used in the three first experiments consisted of 83 verdict

documents. In these verdicts, 3040 relations have been extracted by pairing named

entities of people from Section 4.2.3. Of these 3040 relations, the distribution be-

tween di↵erent classes can be seen in Table 5.1. This training set is used with

the cross validation method that can be read about in Section 3.1.4 to prevent

overfitting.

(37)

Table 5.1: Distribution of classes in the training data Class Explanation Number of relations

0 No relation 2467

1 Works together 277

2 Same side of verdict 184 3 Opposite sides of verdict 112

The validation dataset used for the last experiment consisted of ten verdict documents with 449 relations. The distribution of these relations classes can be seen in Table 5.2.

Table 5.2: Distribution of classes in the validation data Class Explanation Number of relations

0 No relation 372

1 Works together 39

2 Same side of verdict 23 3 Opposite sides of verdict 15

5.3 Parameters used in the SVM

The SVM used in these experiments was the one explained in Section 4.2.6. The type of the SVM was a C-SVC, which is a multiclass classification, where the parameter C can be changed. C-SVC is only implemented to use One-vs-All classification, and therefore One-vs-One was not be considered in the following experiments. One-vs- All classification is explained in Section 3.1.3.5. The kernel that was used is a radial basis function (RBF kernel) because it is often considered to perform the best on non-linear data [19]. RBF kernel can be read about in Section 3.1.3.2. When using RBF kernel, the parameter can be changed.

5.4 Experiment delimitations

There are a few delimitations that were needed to conduct the upcoming experi- ments. An optimal tuning of parameters and features requires a program that runs over every combination of these, this is called an exhaustive search. This could not be done due to time complexity, instead the experiments followed the ideas of the practical guide described in A practical guide to support vector classification by Chih-Wei Hsu et al [19]. This work describes a procedure to tune the results of an SVM by following a few steps. In this work, the same idea is applied but with di↵erent methods, these are described in the following list.

1. Use cross-validation to find the best parameter C and by conducting an

experiment.

(38)

2. Use the best parameter C and from step 1 to find the best combination of features to use by conducting several experiments.

3. Validate the parameters and features on the validation set.

The reason for using the best parameters from the first experiment in the up- coming experiments is due to the curse of dimensionality. The best would be to optimize the parameters together with the di↵erent features, but the time needed would get extremely large, like explained in the beginning of this section.

5.5 Experiment 1

5.5.1 Optimizing C and

Because of the choice to use an SVM of type C-SVM and the kernel RBF, both the parameters C and can be set by the user to enhance the classification perfor- mance of the SVM. These parameters are explained in Section 3.1.3.2 and 3.1.3.3.

To be able to find these values for C and parameters, an experiment that tests a large number of di↵erent pairs of parameters needs to be conducted. Finding a combination of parameters that is optimal is not doable, due to the large amount of time it would take. But it is possible to find a combination that is a good approxi- mation. In this experiment, two methods were used to find this approximation of C and parameters. The first one is called grid search and is a common method for parameter optimization. The second method is not as common as grid search, but is used for SVM parameter optimization, it is called random search. The reason for doing both methods for optimizing the parameters is that it is not possible to say which method yields the best result for this current dataset.

There exists of course other methods for optimizing hyperparameters in SVMs.

An example is the Nelder-Mead method, which evaluates the SVM based on the values of a vertex for a large simplex (tetrahedral in n parameter-dimension), and then iteratively shrinking the size of the simplex around the best vertex until a boundary is met. The reason for not choosing a more complex method like Nelder- Mead in this work, was that both grid search and random search were methods commonly used in similar tasks.

5.5.2 Grid search

Grid search is a method that focuses on splitting the ranges of the parameters

into finite values and testing all combinations. This can be visualized as a two

dimensional grid. The advantage of grid search, is that it is a systematical approach,

and can be done iterative with smaller grids. The disadvantage is that it is easy

to miss interesting areas on the grid if the steps are to large. In the grid search

experiment, there are two iterations of the grid search. First, a grid search that

covers a large area is done, then a new grid search is done that only focuses on the

(39)

best area of the previous grid. The first grid search is chosen to have the following range for C:

C = (1E0, 1E1, . . . ,1E9)

This results in ten di↵erent values for parameter C. The range for variable has been chosen as

= (1E-8, 1E-7, . . . ,1E-2)

This results in seven di↵erent values for parameter The combination of all C and parameter values gives a total of 70 combinations, every combination of these is tested by a ten-fold cross validation on the training set. From each test a macro precision (3.15) is calculated. The number of surrounding words, explained in Section 4.2.3 is set to be constant at ten for each test in the grid search. This value is chosen as a starting point, and is evaluated in experiment 2.

The reason for choosing this particular ranges for parameter C and for the first grid search is explained partially in A practical guide to support vector classification [19]. Where Hsu et al. proposes a range from 2 ⁵ to 2 ¹⁵ for the C parameter and 2 ¹⁵ to 2 ³ for . In this work, Hsu et al. ranges were used as a starting point and guideline. As there are no exact ranges to use, because it all depends on how the dataset looks, some calculations and testing were used to find the ranges to use for this experiment.

Range of C

The largest value of C was chosen to be 1E9. This value was found by testing a few di↵erent values for C and drawing conclusions from these results. The tests showed, that using a value higher then 1E9 drastically decreased the performance of the classifier. Even though 1E9 is a larger value then Hsu et al. proposes, it can with the help of a low result in high classification performance.

The smallest value of C was chosen to be 1E0 because a smaller value would scale the error smaller, and by testing a few values below 1E0 (e.g 1E-1, 1E-2), it showed that these values drastically decreased the performance of the classifier.

Range of

The reason for chosing the largest value of for the first grid search iteration to be 1E-2 can be explained by a few examples and knowledge surrounding the RBF kernel. First of, the RBF kernel value is the part that is influenced by the value.

The RBF kernel is a measure on how similar two feature vectors are, where zero is

not similar, and one complety similar. The RBF kernel is calculated by the equation

K(x, x ⁰ ) = exp( ||x x ⁰ || ² ) where x and x ⁰ are two di↵erent feature vectors (See

Section 3.1.3.2). The feature vectors in this work consist of 88 features when using

number of surrounding words equal to 10. As all feature values are normalized

between zero and one the maximum value of ||x x ⁰ || ² is 88 for this experiment.

Relation Classification Between the Extracted Entities of Swedish Verdicts

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017 ,

Relation Classification Between the Extracted Entities of Swedish Verdicts

NILS DAHLBOM NORGREN

KTH ROYAL INSTITUTE OF TECHNOLOGY

Relation Classification Between the Extracted Entities of Swedish Verdicts

Relationsklassificering mellan extraherade entiteter ur svenska domar

Thesis Report May 8, 2017

NILS DAHLBOM NORGREN nilsdn@kth.se

Master’s Thesis in Computer Science (30 credits) Supervisor: Alexander Kozlov (akozlov@kth.se)

Examiner: Johan H˚ astad (johanh@kth.se)

Abstract

This master thesis investigated how well a multiclass support vector

machine approach is at classifying a fixed number of interpersonal rela-

tions between extracted entities of people from Swedish verdicts. With

the help of manually tagged extracted pairs of people entities called re-

lations, a multiclass support vector machine was used to train and test

the performance of the classification. Di↵erent features and parameters

were tested to optimize the method, and for the final experiment, a

micro precision and recall of 91.75% were found. For macro precision

and recall, the result was 73.29% and 69.29% respectively. This resulted

in an macro F score of 71.23% and micro F score of 91.75%. The re-

sults showed that the method worked for a few of the relation classes,

but more balanced data would have been needed to answer the research

question to a full extent.

Sammanfattning

˚ aterkallning ber¨ aknats. F¨ or makro precision och ˚ aterkallning har ett

resultat p˚ a 73.29% respektive 69.29% ber¨ aknats. Detta resulterade i ett

makro F v¨ arde p˚ a 71.23% och ett mikro F v¨ arde p˚ a 91.75%. Resultaten

visade att metoden fungerade f¨ or n˚ agra av relationsklasserna men mer

balanserat data skulle ha beh¨ ovts f¨ or att forskningsfr˚ agan skulle kunna

besvara helt.

Contents

1 Introduction 1

2 Problem statement 3

2.1 Question . . . . 3

2.2 Objective . . . . 3

2.3 Delimitations . . . . 4

3 Background 5 3.1 Relevant theory . . . . 5

3.1.1 Information extraction . . . . 5

3.1.2 Natural language processing . . . . 6

3.1.2.1 Tokenization . . . . 6

3.1.2.2 Named entity recognition . . . . 6

3.1.2.3 Bag of words model . . . . 6

3.1.3 Machine learning . . . . 7

3.1.3.1 Support vector machines . . . . 7

3.1.3.2 The kernel trick . . . . 9

3.1.3.3 Soft margin . . . . 11

3.1.3.4 Combining soft margin with the kernel trick . . . . 12

3.1.3.5 Multiclass support vector machines . . . . 12

3.1.4 Cross validation . . . . 13

3.1.5 Evaluating classification algorithms . . . . 13

3.1.5.1 Confusion matrix . . . . 14

3.1.5.2 Multiclass precision . . . . 14

3.1.5.3 Multiclass recall . . . . 15

3.1.6 Multiclass F-score . . . . 15

3.2 Related work . . . . 16

3.2.1 Named entity recognition . . . . 16

3.2.1.1 Named entity recognition for the Swedish language 16 3.2.2 Relation classification and extraction . . . . 17

3.2.2.1 Pattern matching . . . . 17

3.2.2.2 Machine learning . . . . 18

4 Methodology 21

4.1 Data . . . . 21

4.2 Implementation . . . . 21

4.2.1 Preprocessing of data . . . . 22

4.2.2 Named entity recognition . . . . 22

4.2.3 Relations: Entity pairing and feature extraction . . . . 23

4.2.4 Manual classification of relations . . . . 26

4.2.5 Splitting the data . . . . 26

4.2.6 Support vector machine . . . . 26

4.3 Baseline methods for relation classification . . . . 27

4.3.1 Most probable baseline . . . . 27

4.3.2 Random guessing with weight baseline . . . . 27

5 Experiments and results 29 5.1 Computer requirements . . . . 29

5.2 Information about the dataset used in experiments . . . . 29

5.3 Parameters used in the SVM . . . . 30

5.4 Experiment delimitations . . . . 30

5.5 Experiment 1 . . . . 31