Extracting Particular Information from Swedish Public Procurement Using Machine Learning

(1)

UPTEC F 20029

Examensarbete 30 hp Juni 2020

Extracting Particular Information from Swedish Public Procurement Using Machine Learning

Eystein Waade

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Extracting Particular Information from Swedish Public Procurement Using Machine Learning

Eystein Waade

The Swedish procurement process has a yearly value of 706 Billion SEK over approximately 18 000 procurements. With each process comes many documents written in different formats that need to be understood to be able to be a possible tender. With the development of new technology and the age of Machine Learning it is of huge interest to investigate how we can use this knowledge to enhance the way we procure.

The goal of this project was to investigate if public procurements written in Swedish in PDF format can be parsed and segmented into a structured format. This process was divided into three parts; pre-processing, annotation, and training/evaluation. The pre-processing was accomplished using an open-source pdf-parser called pdfalto that produces structured XML-files with layout and lexical information. The annotation process consisted of generalizing a procurement into high-level segments that are applicable to different document structures as well as finding relevant features. This was accomplished by identifying frequent document formats so that many documents could be annotated using deterministic rules. Finally, a linear chain Conditional Random Field was trained and tested to segment the documents.

The models showed a high performance when they were tested on documents of the same format as it was trained on. However, the data from five different documents were not sufficient or general enough to make the model able to make reliable predictions on a sixth format that it had not seen before. The best result was a total accuracy of 90,6% where two of the labels had a f1-score above 95% and the two other labels had a f1-score of 51,8% and 63,3%.

Handledare: Tim Lachmann

(3)

Populärvetenskaplig sammanfattning

Den svenska upphandlingsprocessen har ett årligt värde på 706 miljarder kro- nor fördelat på 18 000 upphandlingar. I varje process finns många dokument med olika strukturer som man måste förstå för att kunna konkurrera i en anbudsprocess. I en tid med utveckling av ny teknologi och början av mask- ininlärningens era är det av stort intresse att undersöka om denna kunskap kan användas för att förbättra och effektivisera hur vi upphandlar.

Målet med denna uppsats var att undersöka om offentliga upphandlingar skrivna på svenska i PDF-format kan segmenteras till ett strukturerat format. Projektet kan delas in i tre delar; pre-processering, annotering och träning/evaluering. Pre-processeringen genomfördes med hjälp av en open- source pdf-parser som heter pdfalto, vilken producerar strukturerade XML- filer med layout och lexikal information. Annoteringsprocessen bestod av att generalisera upphandlingar till hög-nivå segment som är användbara för olika dokumentstrukturer, samt är bra för att hitta relevanta särdrag. Utförandet bestod av att identifiera frekventa dokumentstrukturer för att skapa en ef- fektiv annotering av data. Till slut blev en linear Conditional Random Field tränad för att kunna segmentera dokumenten.

Modellerna visade en hög prestanda när de testades på samma sorts dokument som de hade blivit tränad på. Dock visade det sig att data från fem olika dokumentstrukturer inte var tillräcklig eller generell nog för att göra pålitliga prediktioner på en sjätte dokumentstruktur som modellen inte hade sett förut. Den bästa modellen hade en träffsäkerhet på 90,6% där två av klasserna hade en f1-score över 95% och de två andra klasserna hade en f1-score på 51,8% och 63,3%.

(4)

List of Figures

3.1 Simplified Random Forest classifier with n number of decision trees of depth 2. . . 10 3.2 Graphical model of HMM-like linear-chain CRF. Taken from [1]. 13 3.3 Graphical model of a linear-chain CRF in which the transition

score defends on the current observation. Taken from [1]. . . . 13 4.1 ALTO xml structure. . . 18 4.2 Layout XML tree structure. . . 19 4.3 The featured document formats: Tendsign, Eavrop, Police Au-

thority and Kommers. . . 20 4.4 The featured document formats: Upphandlingscenter and Rik-

shem. . . 21 4.5 Illustration of high-level labels in Eavrop and Tendsign. . . 22 4.6 Example of a feature vector of categorical features. . . 25 5.1 Confusion matrix for Random Forest classifier trained and

tested on Tendsign and Eavrop. . . 28 5.2 Confusion matrix for Random Forest classifier trained on Tend-

sign and tested on Eavrop. . . 29 5.3 Precision and recall of each class label from training with

Tendsign, Rikshem and testing with Eavrop. . . 31 5.4 Precision and recall of each class label from training with

Tendsign, Rikshem, Eavrop and testing with Eavrop. . . 32 5.5 Precision and recall of each class label from training with all

formats except Eavrop and testing with Eavrop. . . 34 5.6 Precision and recall of each class label from training with all

document formats and testing with Eavrop. . . 35

(7)

List of Tables

3.1 Confusion matrix for binary classification. . . 14

3.2 Confusion matrix for classification with four classes. . . 15

4.1 Total number of documents and TextLines available in this project. . . 19

4.2 Example features with fixed categories. . . 24

4.3 Example features with unfixed categories. . . 24

4.4 Example of punctuation_profile. . . 24

4.5 Visualisation of which labels the different document formats contains. Green means yes and red means no. . . 25

5.1 Data used in the CRF with selected formats. . . 29

5.2 State transition weights. . . 30

5.3 The highest positive and negative weights for BODY, FOOT- NOTE and HEADNOTE. . . 30

5.4 Data used in the CRF with selected formats where Eavrop is included in training. . . 32

5.5 Data used in the CRF with all formats where Eavrop is not included in training. . . 33

5.6 State transition weights with all formats except Eavrop. . . 33

5.7 Data used in the CRF with all formats where Eavrop is included in training. . . 34

5.8 State transition weights with all formats included Eavrop. . . 35

(8)

Acronyms

ALTO Analyzed Layout and Text Object. 6, 17, 22, 38

CRF Condtional Random Field. 5, 6, 13, 22, 26, 27, 29, 34, 37 HMM Hidden Markov Model. 5, 11, 12

NER Named Entity Recognition. 8, 11, 23, 39 NLP Natural Language Processing. 5, 8, 23 RF Random Forest. 9, 10, 27, 29, 37

SVM Support Vector Machine. 5

(9)

Chapter 1 Introduction

A public procurement can be defined as the acquisition of a wide range of missions, supplies or services required by the state. It can range from the purchase of routine equipment to placing contracts for large scale infrastruc- ture projects. The common goal of the procurements is to ensure an efficient and secure use of public funds. It is common to separate public and private procurements. Even though the process seems similar, the goals of the process differ. Where the private sector procures to increase the shareholders profits, the public sector procure on a non-profit basis where the goal is to add value to the social supply chain. Having said that, the differences might not be as distinct as in the previous definition, however it helps building an impression of how the public and private sector differ.

The Swedish Government Office describes public procurement as follows:

Public procurement must be efficient and legally certain, and make use of market competition. It must also promote innovative so- lutions and take environmental and social considerations into ac- count. The procurement law framework must also help realise the internal market and facilitate the free movement of goods and services in the European Union. Opening purchases made by public authorities and public bodies to competition can mean better deals for the public sector and a more efficient use of public funds. [2]

A key part of this description is having transparent procurements that are economically efficient. However, economically efficiency does not mean con- sistently choosing the cheapest deal but rather the deal that ensures satisfying results at a minimal cost. Managing the balance between cost and satisfying results is probably the main challenge of procuring, however, the goal of the work in this thesis project is to bring new insight and to enhance specific parts of the procurement process.

(10)

1.1 Tendium

Tendium is a company that was established in 2018 and is currently re- searching the use of artificial intelligence to make a more efficient and secure procurement process. This project is done in collaboration with Tendium.

1.2 Background

The Swedish procurement process has a yearly value of 706 billion SEK.

There are annually more than 18 000 procurements where 69 % of these are adverted by municipalities [3]. Working with such numbers, it is of huge interest to make this process more efficient. Companies participating in public tenders build their own competence to improve their work with procurements.

This is a fragile system as it requires a certain amount of experience to understand the procurements as well as being based on individuals experience and knowledge within specific types of procurements. When employees with this experience quit or are absent for longer periods of time, companies will lose ground in the procurement process. For some companies there might simply not be enough resources to invest into procurements which results in smaller companies not being able to procure. Reading and understanding procurements is time consuming work that is limited to a person’s ability to read and the analytical tools available. If some of these tasks can be moved to a computer, the process would get a considerable acceleration.

Security and transparency are two important keywords for public procurements. The introduction of AI might not only be more efficient from an economic perspective as described in the previous paragraph but might also improve the security and transparency of the process. It may also increase the number companies that will be able to procure as less resources and time will be required.

1.3 Project Aim

The aim of this project is to investigate if some parts of the public procurement process can be assisted or in some cases replaced by machine learning models. The ideal model will take all documents related to a procurement and return the content in a structured format that are able to highlight important parts. This should be achieved using pdfalto [4] as a pre-processor of the documents, followed by feature engineering and annotation of the pre- processed documents. The annotated training data will then be used to train both a sequential and non-sequential classifier to segment the documents into

(11)

a more readable format that can give new and important insight to specific procurements, or to the entire process.

1.4 Delimitation

There exists procurements in numerous languages that this project could greatly benefit from including. However, for simplicity, Swedish procurements written in Swedish, from the cleaning services were the choice for this project. This decision was made based on available data at the start of the project

When introducing AI in the public procurement process, juridical ques- tions might arise, which will not be discussed in this thesis. This thesis focuses on technical application of extraction particular information from procurements. For a discussion from a legal point of view I refer to the student thesis written by Carolina Wibring [5].

1.5 Outline

Chapter 2 is a description of projects related to this thesis, where Natural Language Processing is the main topic. Chapter 3 is a description of the two models used in this project and the theory behind them. The following chapter, i.e. 4 gives a detailed description of how PDF documents are parsed and annotated to be able to be used in a machine learning application. Chapter 5 contains the results from both of the models used in this project. Finally, chapter 6 contains a discussion of the data process and the results, followed by chapter 7 which concludes the work.

(12)

Chapter 2 Related Work

In this chapter Conditional Random Fields within Natural Language processing will be introduced in section 2.1 followed by the work that inspired this thesis will be described in section 2.2. i.e. the open source software GROBID.

(13)

2.1 Conditional Random Fields in Natural Lan- guage Processing

A Condtional Random Field (CRF) is considered as a strong model for text classification within Natural Language Processing (NLP). It allows for rich and flexible features that are needed for a computer to be able to interpret human language. The history of NLP generally started in the 1950’s and was for a long time defined by sets of complex handwritten rules. This was the case up until the end of the 1980’s which in the field of NLP often is considered as the "statistical revolution" [6] and the beginning of applied machine learning. Practical use of machine learning continued to grow as computational power increased and became economically possible. The development of computational power can be described by Moore’s Law which states that the number of transistors in an area doubles every 24th month[7].

Some of the first applications were tree-based models taking advantages of older rule-based methods. Another important part of NLP is Part-of-speech tagging. This introduced new statistical approaches using probabilistic deci- sions based on weights assigned to the inputs, where use of Hidden Markov Model (HMM) can be seen as the starting factor. A CRF can be understood as a special type of HMM and the theory behind these models are described in section 3.3.2.

One application of CRF is information extraction from research papers.

In 2004 a research was conducted by Fuchun Peng and Andrew McCallum to compare performance of CRF with state-of-the-art extractors based on Support Vector Machine (SVM) and HMM [8]. The research could conclude that the CRF model reduced the F1-error by 36% compared to previous SVM models and an even higher reduction compared to HMM. This research was based on header and citation extraction from research papers, where title, author, abstract, date, email are some of the labels that were extracted.

This research was important for GROBID, which will be described in section 2.2.

2.2 GROBID

GROBID means GeneRation of BIbliographic Data and is an open source machine learning library for extracting, parsing and re-structuring raw documents such as PDFs into structured formats with a particular focus on scientific publications [9]. The work was started in 2008 and were made public in 2011 and is now being used by acknowledged publication sites like ResearchGate. In a complete PDF processing GROBID manages 55 final

(14)

labels to structure the information from traditional publication metadata or full text structures. This project is of high interest to my work, as the high- level parsing done by GROBID is in many ways what we are trying to achieve in this project.

GROBID is developed in Java and consists of multiple Conditional Ran- dom Fields models, each trained for specific tasks. The project consists of 11 different CRF models, each with its own set of features, set of training data and normalization. To understand GROBID better, one can consider the project as a cascading of linear chain CRF. The first CRF in the hierarchy exploits layout features combined with lexical features to do the initial segmentation of 7 classes: cover, header, body, footnotes, headnotes, biblio and annexes. New CRFs are then applied on each class until certain conditions are met. Header, as an example, is divided into: title, author, affiliation, abstract, date, keywords using the header segmentation model. Sub level models are then applied to each of these segments to extract the particular information. In general, the high-level segmentation is done using layout and lexical features from text row to text row, whereas the sub-level segmentation is done from word to word using mainly lexical features.

It is also worth mentioning that GROBID and pdfalto have been developed by the same people. GROBID use pdfalto for the initial processing of PDF documents, which process is the idea behind this thesis. Pdfalto is described in section 4.1. In this project a similar approach has been used to process and structure public procurements, as well as taking inspiration from the feature engineering in GROBID to build a robust conditional random field model. The feature engineering process in GROBID has been crucial for my work as it has inspired creative ideas for taking advantage of the Analyzed Layout and Text Object (ALTO) format produced by pdfalto. The features engineering process is further described in chapter 4.

What is important to remember is that GROBID is trained on a special type of documents, which is academic papers. These documents follow a specific pattern, i.e. the model expects the text to have an abstract, authors and references etc. This constrains GROBID to dealing with documents of these formats. However, when only considering academic and research papers of the expected format, GROBID has a high performance and gives valuable results. One great example is the references parser, which can determine whom has been cited in a research paper. Another important model is the full text parser which can categorize texts based on its content. This simplifies the filtering process for readers when they want to find relevant scientific publications.

(15)

Chapter 3 Machine Learning

In this chapter the theory behind the models used in my project will be presented. The first part, section 3.1, is a introduction to the concept of sequential data in text applications followed by section 3.2 that is a comparison between discriminative and generative classifiers. Section 3.3 describes the theory behind the models. Finally section 3.4 is a introduction to common measurements to evaluate classifiers.

(16)

3.1 Sequential Text Data

When working with text applications it is important to understand that you are often dealing with sequential data. The problems are often complex and context is important as words may have separate meanings in different settings. A great example of this is homonyms such as "date" or "leaves"

that have different meanings depending on its context.

It was on our third date that I asked her to marry me.

At what date is your birthday?

The cleaning company clears the street of leaves every morning through September.

Her assistant leaves for work every morning at 6:30 AM.

These dependencies are intuitive to humans as we are able determine the context when we read a sentence, however this is not intuitive for a computer.

When your data is text, sequential models have numerous advantages. Before discussing these models, let us consider the task of Named Entity Recogni- tion (NER) which is an important application from NLP. NER is the problem of identifying and classifying entities in text, such as people, locations, organizations, expressions of time, etc. Examples of these are George Wash- ington, France, World Health Organization, 12PM which at first sight seems intuitive, however some of these entities are too rare to even occur in large training sets, which means that the model needs to be able to identify words it has never seen before. Therefore context is important, as the model can learn the meaning of different context, and therefore correctly identifying unknown words.

3.2 Discriminative versus Generative classifiers

Generative classifiers learn a model of the joint probability function, p(x, y), of inputs x and class labels y and make their predictions by applying Bayes rules to calculate p(y|x), to find the most probable label y given input x.

In a discriminative classifier it models the posterior p(y|x) directly, which is the same as learning a map from input x to label y. In other words, the discriminative model learns what separates the different classes y whereas the generative models learn what defines each class. This means that discriminative models are computationally more efficient as it skips learning the joint probability but at the cost of flexibility. Discriminative models are often preferred in supervised learning tasks as it gives a higher performance.

(17)

However, this is highly dependent of the classification problem that are being solved and the data available. The advantages and disadvantages of either model is discussed further by Andrew Y. NG and Michael I. Jordan [10].

3.3 Classification models

Two methods have been applied on the training data throughout this project, both a sequential and a non-sequential model. In this section the theory behind both models will be described to build an understanding of how they work.

3.3.1 Random Forest

Random Forest (RF) is an ensemble method with its roots in decision trees.

Ensemble methods is a term for describing methods that combines weaker classifiers to build a stronger one. One common ensemble method that are applied to increase performance is bootstrapping, also known as bagging, is a key part of RF. By introducing bagging, the decision trees are trained on randomly sampled subsets of the data, that allows for a significant reduction in variance compared to single decision tree classifiers. RF also takes advantage of feature bagging which means that only a subset of the available features is considered at each split. This allows for a more robust model as the correlation between each decision tree is reduced which in the end will reduce the bias without increasing the chance for overfitting. It is worth to mention that decision trees in general are sensitive to noisy data, which im- plies that a successful RF model is a result of well-prepared and structured data.

(18)

Figure 3.1: Simplified Random Forest classifier with n number of decision trees of depth 2.

Figure 3.1 shows a simplification of how RF works. A forest of n decision trees are initialized and each tree is trained on a subset of the data utilizing a subset of the features available. How many features you want to consider at each split depends on the problem you are trying to solve as well as how many features there are available. Each decision tree will make a prediction for the corresponding input data, where the final class is based on a majority vote between all decision trees. In the cases where there are no class in majority, the final class is determined by a coin flip.

When constructing decision trees, it is common to calculate the impurity of a node before deciding what feature the split should be based upon. Com- mon measurements for impurity are Gini index and Entropy [11]. Gini index is defined as

Gini index = 1 −

n

X

i=1

p²_i, (3.3.1)

where the Gini index has a value between 0 and 1 and it is the sum of

(19)

the squared class probabilities (pi) of each class present in the sample. 0 means perfect homogeneous sample and 1 corresponds to maximum inequal- ity among the elements. The same logic follows for the Entropy, however the Entropy is calculated as

Entropy = −

n

X

i=1

p_i· log(p_i). (3.3.2) Both estimates are popular, where one is not always preferred over the other. They are both used to minimize the final classification error of the model by calculating the optimal splits during the construction of the decision tree. In Random Forest each decision tree does not to be equal in shape as presented in Figure 3.1. Every decision tree can have a unique depth or number of leaf nodes. This depends on the construction of each tree and on the nature of the classification problem itself.

3.3.2 Hidden Markov Models & Conditional Random Fields

Before understanding how CRFs work, it is of high interest to have a look at HMM. If we again consider the NER problem mentioned in the beginning of this chapter, it might be tempting to solve this problem with a naive approach, i.e. assuming all of the entities are independent. However it turns out that the neighbouring words are dependent; while New York is a loca- tion, New York Times is an organization [1]. One popular approach to this problem is arrange the labels in a linear chain, which is what is done in a HMM [12]. An HMM models a sequence of observations X = {xn}^N_n=1 that assumes there is an underlying sequence of states Y = {yn}^N_n=1 drawn from a finite state set S. HMM is generative meaning we want to model the joint probability distribution p(x, y). To do this HMM makes two independence assumptions [1]. First, it assumes that each state only depends on the previous state, i.e. yn is independent on all ancestors y1, y₂, ..., y_n−2 given the previous state yn−1. Secondly, it is assumed that each feature variable xn

only depends on the current state yn. This allows us to draw the joint of a sequence y given a sequence of observations x as

p(x, y) =

N

Y

n=1

p(y_n|y_n−1)p(x_n|y_n), (3.3.3)

(20)

where (3.3.3) can be rewritten to a more general form before defining a linear chain CRF,

p(x, y) = 1 Z

N

Y

n=1

expn X

i,j∈S

θij1{yn=i}1{yn−1=j}+X

i∈S

X

o∈O

µoi1{yn=i}1{yn=o}

o , (3.3.4) where θ = {θij, µ_oi} are real-valued parameters of the distribution and Z is a normalization constant that makes the distribution sum to one. 1 is the identity function, (i, j) corresponds to a transition and (o, i) is a state observation pair. This definition of an HMM can be further generalized by introducing the concept of feature functions on the form fk(y_n, y_n−1, x_n). To further develop (3.3.4) we define one feature fij(y, y⁰, x) = 1{y=i}1{y⁰=j} for each transition (i, j) and one feature fio(y, y⁰, x) = 1_{y=i}1_{x=o} for each state observation pair (i, o). By generalizing the feature function as fk and letting it range over both fij and fio we can rewrite (3.3.4) as

p(x, y) = 1 Z

N

Y

n=1

expnX^K

k=1

θ_kf_k(y_n, y_n−1, x_n)o

, (3.3.5)

where fk is a general feature function and Z is a normalization constant chosen so the distribution sums to one. The final step is to write the conditional distribution p(y|x) from the the HMM defined in (3.3.5) as

p(y|x) = p(x, y) P

y⁰p(y⁰, x) = QN

n=1expn PK

k=1θ_kf_k(y_n, y_n−1, x_n)o P

y⁰

QN

n=1expn PK

k=1θ_kf_k(y_n⁰, y_n−1⁰ , x_n)o . (3.3.6) When the feature functions fk are indicator functions this results in a special type of linear chain CRF, one which only includes features from the current word’s identity. General feature functions allows more flexibility and lead to the general definition of a linear chain CRF in (3.3.6). Figure 3.2 shows an example of this HMM-like CRF. In this case, the score received from state i to j is independent of the input, i.e. the same score will be the same for different inputs. In text applications it is desired that these scores depend on the input x, which can be achieved by adding a transition feature 1{yn=j}1{y_n−1=1}1{xn=o}. The graphical model of this kind of linear chain CRF is presented in Figure 3.3.

(21)

Figure 3.2: Graphical model of HMM-like linear-chain CRF. Taken from [1].

Figure 3.3: Graphical model of a linear-chain CRF in which the transition score defends on the current observation. Taken from [1].

For linear chain CRFs it is important to understand that each feature function can depend on observations from any time step. The observation argument has been written to fkas a vector xn. This means that it is assumed that xn contains all the global observations x that are needed to compute the features at step n. An example of this is when xnuses the previous word x_n−1 as a feature, then feature vector xn have to include the identity of xn−1. This allows the linear chain CRF to use complex feature vectors at each time step.

3.4 Evaluation Metrics

In classification problems it is crucial to be able to analyze your results ana- lytically. This is especially important for non-binary classification problems as you want to determine the performance of your model based on each class individually. In this section I will give a brief explanation of the most common metrics.

3.4.1 Confusion Matrix

For a binary classification problem the confusion matrix can be presented as in table 3.1. The diagonal of confusion matrix corresponds to the correct predictions of each class, which in the binary example is the True Positives

(22)

(TP) and the True Negatives (TN). Negatives falsely classified as positives are called False Positives (FP) and positives falsely classified as negatives are called false negatives (FN). The confusion matrix can easily be extended to present the result of a classification problem with more than two classes.

Table 3.1: Confusion matrix for binary classification.

Predicted

Positive Negative

Actual Positive True Positive (TP) False Negative (FN)Negative False Positive (FP) True Negative (TN)

In the binary problem, a glance at the confusion matrix may be sufficient to determine the performance of your model. However, this might not be intuitive when extending to multiple classes. Therefore it is smart to define a few more evaluation metrics. The following terms are often used to understand the performance of a classification model:

Accuracy

Accuracy may for some be the most intuitive metric as it describes the total number of correct predictions divided by the total number of predictions and is defined as

Accuracy = T P + T N

T P + T N + F P + F N. Precision

Precision is the number of correct predictions of a class divided by the total number of predictions of that class. In other words, it is a measurement of how many of the predictions the model made of a specific class were correct.

P recision = T P T P + F P Recall

Recall is a measurement for how many of a specific class were successful classified.

Recall = T P T P + F N F1-score

F1-score does not bring new information compared to the previous

(23)

metrics; however it is a useful tool to analyze precision and recall with one number. F1-score is the harmonic mean of precision and recall, where a score of 1 is a result of perfect recall and precision.

F 1 score = 2

precision⁻¹+ recall⁻¹ = 2 precision · recall precision + recall All these metrics can easily be applied to a classification problem with more than two classes. The dimension of the confusion matrix corresponds to the number of classes where the diagonal will contain the true positives of each class. Table 3.2 shows the confusion matrix for a classification problem with four classes.

Table 3.2: Confusion matrix for classification with four classes.

Predicted

Class A Class B Class C Class D

Actual

Class A T P_classA E_1,2 E_1,3 E_1,4 Class B E_2,1 T P_classB E_2,3 E_2,4 Class C E_3,1 E_3,2 T P_classC E_3,4 Class D E_4,1 E_4,2 E_4,3 T P_classD

The accuracy, precision and recall can be calculated for the confusion matrix with four classes as well. In the multi-class case precision and recall will be calculated class-wise as

P recisionclass A= T Pclass A

T P_{class A}+ E_2,1+ E_3,1+ E_4,1

Recall_{class A}= T P_{class A}

T P_{class A}+ E_1,2+ E_1,3+ E_1,4

where the f1-score for each class is the harmonic mean of the corresponding precision and recall.

(24)

Chapter 4 Data process

This chapter describes the data used in the project and how it was acquired.

Most of the time spent on this project was related to the annotation of training data. Section 4.1 describes the pre-processor that was used on the PDF documents. Section 4.2 shows what the a typical procurement looks like and describes the annotation process. In Section 4.3 the Python libraries that were used to implement the machine learning models are described.

(25)

4.1 Pdfalto

Pdfalto [4] is an open source software developed in C for parsing PDF documents. In this thesis project pdfalto is used as the main tool for processing PDF documents. The first version of pdfalto was officially released on June 12th 2018 and the project is an initially fork by pdf2xml [13]. Pdfalto was developed into its own project giving modification for robustness, addition of features and output enhanced format. Pdfalto processes PDF documents into XML-files based on a format called ALTO.

4.1.1 Analyzed Layout and Text Object

ALTO, short for Analyzed Layout and Text object, is a structured way of storing information from a document with enough information so that the original document can be reconstructed. It is useful for storing layout and content information related to an object from a text document. This is the output format of pdfalto-processed documents whose enhanced format opens for innovative feature extraction options. In machine learning applications, understanding your data is essential for achieving success. This makes the feature engineering process on ALTO-xml files the core of this project. Figure 4.1 shows the ALTO structure and it consists of three major sections as children of the root <alto> element: [14]

• <Description>

• <Styles>

• <Layout>

<Description> contains metadata related to the ALTO file itself and information about how it was created. This structure is visualized in Figure 4.1.

The child <sourceImageInformation> is a path to which folder the extracted images from the document are stored. The <Styles> section has one child that is <TextStyle/> which contains the different fonts used in the document and corresponding font specification. Finally, the <Layout> section contains the extracted information of the document, divided into pages.

(26)

1 <?xml version="1.0"?>

2 <alto>

3 <Description>

4 <MeasurementUnit/>

5 <sourceImageInformation/>

6 <Processing/>

7 </Description>

8 <Styles>

9 <TextStyle/>

10 </Styles>

11 <Layout>

12 <Page>

13 </Page>

14 </Layout>

15 </alto>

Figure 4.1: ALTO xml structure.

4.1.2 Processed procurement

The tree structure shown in Figure 4.1 is the tip of the iceberg, where the details come within the <page> branch. The full tree structure from <Lay- out> is shown in Figure 4.2. The children of the <Layout> are the <Page>,

<TextBlock>, <TextLine> and <String>. The least intuitive of these is the

<TextBlock> which is a customized feature in pdfalto that aims to capture related <TextLines>. A <TextBlock> can be understood as a paragraph.

The <TextBlock> open for new features in the CRF model, but also has limitations. These advantages and disadvantages are discussed in further detail in section 6.2. <TextLine> is the most essential part of pdfalto as this is where the information in the document is stored. Each row of words in the document is defined as a <TextLine> with a position and size. TextLines are described in further detail in section 4.2.3. The <TextLine> then contains all words of the row, represented as <String>. The attributes of the

<String> is its position, size, font and the word itself under the key "CON- TENT". Information regarding the different fonts is stored in the <Styles>

branch.

(27)

1 <Layout>

2 <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="595.320" ...

HEIGHT="841.920">

3 <PrintSpace>

4 <TextBlock ID="p1_b1" HPOS="433.200" VPOS="40.5529" ...

HEIGHT="14.4612" WIDTH="123.954">

5 <TextLine WIDTH="123.954" HEIGHT="14.4612" ...

ID="p1_t1" HPOS="433.200" VPOS="40.5529">

6 <String ID="p1_w1" CONTENT="INTRODUKTION" ...

HPOS="433.200" VPOS="40.5529" ...

WIDTH="123.954" HEIGHT="14.4612" ...

STYLEREFS="font0"/>

7 </TextLine>

8 </TextBlock>

9 .

10 .

11 .

Figure 4.2: Layout XML tree structure.

4.2 Procurement Data

The Swedish procurements are publicly available and were acquired by Tendium before the start of this project. The amount of data used to train and evaluate the model has increased as high frequent formats have been discovered.

The public procurements include many industries but has been limited to documents from the cleaning services in this project. Table 4.1 shows a list of the total number of documents available of each format as well as the total number of TextLines of each format.

Table 4.1: Total number of documents and TextLines available in this project.

Format Documents TextLines

Tendsign 853 261164

Eavrop 139 23338

Police 42 33279

Kommers 12 5666

Upph. Center 21 10636

Rikshem 8 4815

The idea is to find frequent document formats to create a simpler and

(28)

more accurate annotation process as the unprocessed documents are unla- beled. This procedure allows us to use a rule-based deterministic system to label the documents processed by pdfalto. At the current state of the project, six different document formats have been chosen with the expectation that these formats are sufficient to generalize a procurement from the cleaning services. Figure 4.3 and 4.4 shows what the typical front page looks like in these formats.

(a) Tendsign procurement format. (b) Eavrop procurement format.

(c) Police procurement format. (d) Kommers procurement format.

Figure 4.3: The featured document formats: Tendsign, Eavrop, Police Au- thority and Kommers.

(29)

(a) Upph. center procurement format. (b) Rikshem procurement format.

Figure 4.4: The featured document formats: Upphandlingscenter and Rik- shem.

4.2.1 High-level segmentation labels

The goal of the high-level segmentation is to extract some information as well as dividing the document into two main classes. The idea is to extract the headnotes and the footnotes and segment the document into COVER and BODY. The COVER corresponds to the front page, table of contents etc.

The BODY class is the less specific and will contain the rest of the document that will need further work to retrieve the desired information. In the non- sequential model with features of continuous variables there were two more features, i.e. headline and sub headline. The idea with these classes was being able to connect paragraphs of text to their related headlines and sub headlines. In the sequential model these classes were integrated in BODY, and the segmentation were moved one level down. Figure 4.5 is a visualization of which labels that is assigned to different parts of a document page.

(30)

(a) Eavrop example. (b) Tendsign example.

Figure 4.5: Illustration of high-level labels in Eavrop and Tendsign.

4.2.2 Annotation of training data

The annotation process was divided into two parts. First, a data set was annotated using five continuous features based on the layout. This set was used to train a Random Forest model to see how it could perform the segmentation task. The second part of the project was to annotate a dataset for a sequential model, namely a linear chain CRF. In this set we could only allow categorical features which means that the feature engineering process needed to be extended.

4.2.3 A line-based approach

In text classification tasks it is common to create feature vectors xn for each word of the text and define each feature in relation to this word. In this project we have chosen a different approach that uses the same idea. For our segmentation task we have chosen an approach that is similar to the one used in GROBID [9] for high-level segmentation. Instead of extracting features from each word in the text we extract features from each line of text in the document. Pdfalto makes this approach possible as pre-processed documents in the ALTO format keeps layout information such as position and size, and keeps all words belonging to a line in the same object (TextLine).

(31)

4.2.4 Features of continuous variables

The first annotation process was done with five layout features, where each datapoint corresponds to one line from the document with the following features:

• dist_nxt - vertical distance to next line

• dist_prev - vertical distance to previous line

• h_pos - horizonatal position

• v_post - vertical position

• height - height of biggest letter in the line

In addition each TextLine was assigned one of the labels: Cover, Headline, Subheadline, Content, headnote, footnote. The goal was to train a classifier that can successfully segment a document accordingly, which will open for numerous NLP and NER applications. To be able to compare documents of different formats it is important to normalize the data. The normalization was done based on the largest and smallest value of each feature for each document. This allowed us to give each feature a value between zero and one, which makes different formats comparable for the model. This type of normalization is called Min-Max Feature scaling and is defined as [15]

X⁰ = X − X_min

X_max− X_min, (4.2.1)

where the Min-Max Feature scaling is guaranteeing all values to be on the interval (0, 1) provided the maximum and minimum value is known.

4.2.5 Categorical Features

For the linear chain CRF we will be using categorical features based on both lexical and layout information. I will not explain all of the features used in the CRF, however, I will present a subset to give an understanding of the concept of categorical features and how it is applied to the TextLine-based feature vectors. All categories of non-binary features are an exclusive feature that is either True or False. This means that the actual number of features used in the model is the sum of all different categories.

(32)

Table 4.2: Example features with fixed categories.

Feature Categories

bold binary

capitalisation INITCAP, ALLCAP, NOCAP digit NODIGIT, CONTAINDIGIT, ALLDIGIT

italic binary

page_status PAGESTART, PAGEIN, PAGEEND

Table 4.2 shows some of the features with all of its categories known before applying it to the data. bold and italic are examples of features with binary categories, i.e. they are either True or False. In this dataset this corresponds to if the first word in the TextLine is bold or italic. capitalisation is designed to separate words that has the first letter capitalised, all letters capitalised or no letters capitalised. digit uses the same logic as capitalisation for digits in the first word of the TextLine. page_status is designed to capture the first and last TextLine on each page.

Table 4.3: Example features with unfixed categories.

Feature Categories

punctuation_profile all special characters in the TextLine string_1 first word in the TextLine

Table 4.3 shows examples of features that are created during annotation of the data. punctuation_profile is the sequence of all occurring special signs in a TextLine. This feature is especially helpful in classifying FOOTNOTE which often contains dates or the page number in different configurations.

Table 4.4 shows examples of what the punctuation_profile looks like.

Table 4.4: Example of punctuation_profile.

TextLine content punctuation_profile 2020-04-20 "- -"

page 3/20 "/"

3(20) "()"

Figure 4.6 shows all of the features used in the sequential model. Many of these features are the same features used by GROBID.

(33)

Figure 4.6: Example of a feature vector of categorical features.

4.2.6 Note on labels

Among the data acquired for training there are three of the six document formats that have both a headnote and footnote.

Table 4.5: Visualisation of which labels the different document formats contains. Green means yes and red means no.

Document type headnote footnote cover body Tendsign

Eavrop Police Kommers

Upphandlingscenter Rikshem

4.3 Computer implementation

The work of this project is done using Python. Pdfalto and GROBID are the two open source tools that have been used. Pdfalto is written in C and is the pre-processor and base for annotation and feature engineering. GROBID is an implicit part of this project which processing and feature engineering is used in my project. The two main Python libraries that has been used for building the models are sklearn.ensamble where RF is implemented, and sklearn_crfsuite which is an extension to scikit learn where a linear chain CRF is implemented.

(34)

4.3.1 Sklearn Random Forest Classifier

Sklearn Random Forest Classifier is a part of the sklearn.ensamble library.

In my implementation the default values were used, which is a forest of 100 estimators and the maximum number of features used in each estimator is the square root of the total number of features available. This means that each tree uses two to three features.

4.3.2 Sklearn crfsuite

There are already numerous developed tools for conditional random fields where sklearn_crfsuite was chosen for this project. It is a Python library and has the familiar fit() and predict() functions that are used in general scikit-learn libraries. There are a few options for training, which is the type of optimizer and regularization. The Limited-Memory BFGS were used as optimizer with L2 regularization. Test were also done with L1 regularization which uses Orthant-Wise Limited-memory Quasi-Newton optimizer instead.

The CRF model in chapter 5 were trained using Limited-Memory BFGS and L2 regularization. The L2 regularization parameter c1 = 0.1.

(35)

Chapter 5 Results

This chapter shows the performance of the models that were tested during the project. In section 5.1 the differences between the annotated data for the non-sequential and sequential model will be presented. Section 5.2 shows the results from the RF model. Finally the results from the CRF model using selected formats is presented in section 5.3, followed by the results from a CRF model that is trained on all formats available in section 5.3.

(36)

5.1 Differences in annotated training data for the non-sequential and sequential model

The presentation of the results from the RF and CRF model differ as the training and testing process were done with different amounts of final labels.

In the RF model DocInfo is equivalent to COVER in CRF. The label had its name changed for logical reasons. Content, headline and sub headline corresponds to a segmentation of BODY. In the RF model this segmentation were done in the same model, where the goal for the sequential model was to first do a high-level segmentation with: BODY, COVER, FOOTNOTE and HEADNOTE, and then create a new model with different features that can classify BODY into new segments.

5.2 Non-sequential labeling

The result from the RF model is visualized as a confusion matrix in Figure 5.1. The model is trained using a forest of 100 trees with Gini Index as impurity measurement. Each class has both a recall and precision above 95%. These results were the first evaluation of the training data and were mainly used to see how the features generalized the problem we are trying to solve.

Figure 5.1: Confusion matrix for Random Forest classifier trained and tested on Tendsign and Eavrop.

(37)

The confusion matrix in Figure 5.2 shows the results from a RF model that was trained on the layout of Tendsign documents and then tested on Eavrop. As one can see, the performance drops when the model has not seen Eavrop documents before segmentation. It can perform well on a few of the labels, but in general has a hard time performing the segmentation using only layout information. In the next section the CRF is used which allows us to use more complex and specific features.

Figure 5.2: Confusion matrix for Random Forest classifier trained on Tend- sign and tested on Eavrop.

5.3 Sequential labeling - selective formats

The results presented in this section is from a CRF that is trained on Tendsign and Rikshem and tested on Eavrop. The CRF model was trained using Limited-Memory BFGS optimization and L2 regularization c1 = 0.1. The number of TextLines from each format is presented in table 5.1 below.

Table 5.1: Data used in the CRF with selected formats.

TextLines train TextLines Test

Tendsign 21357 0

Rikshem 4815 0

Eavrop 0 14006

(38)

With 4 classes there are 16 possible state transitions. Table 5.2 shows the weights assigned to each transition, where positive numbers indicate that a transition is more likely and a negative value indicates that the transition is less likely. The highest weights can be found in the diagonal of the table which means that each state is most likely to transition into the same state. The transitions with the highest assigned weights when the state changes are: FOOTNOTE→HEADNOTE, BODY→FOOTNOTE, HEADNOTE→BODY and HEADNOTE→COVER. This tells us how each state is most likely to change.

Table 5.2: State transition weights.

When evaluating the linear chain CRF it is interesting to have a look at what the model have learned. In table 5.3 some of the most positive and negative features are presented with corresponding weights. A positive value means that the category of the related feature has high impact on the label.

Negative values indicate that the corresponding label is less likely given the feature.

Table 5.3: The highest positive and negative weights for BODY, FOOT- NOTE and HEADNOTE.

Class Feature Weight

BODY fontsize_status:HIGHERFONT +2.340

BODY page_status:PAGESTART -3.617

FOOTNOTE string_2:/ +3.876

FOOTNOTE digit:NODIGIT -2.986

HEADNOTE page_status:PAGESTART +5.376

HEADNOTE page_status:PAGEEND -4.852

(39)

From table 5.3 one can see that the model has learned that a HEADNOTE is very likely to occur as the first TextLine of a page and not likely to occur as the last TextLine of the page, which is reasonable. For FOOTNOTE, the second string often contains a "/" and when the TextLine does not contain a digit this indicates that it is less likely to be FOOTNOTE.

It is worth to investigate how the linear chain CRF performs on a document format that it is not explicitly trained on. The result in Figure 5.3 shows a model that has been trained and tested according to table 5.1. The result from this evaluation shows that FOOTNOTE and HEADNOTE have a precision close to 100% and a recall above 50%. COVER is classified with a precision close to 80% and a recall of 40%. BODY has close to 100% recall but a lower precision. This means that model predicts BODY too often.

Figure 5.3: Precision and recall of each class label from training with Tend- sign, Rikshem and testing with Eavrop.

When the model is tested on the same document formats as it is trained on, both precision and recall is above 95%. This result is shown in Figure 5.4. The data used in this model is shown in table 5.4

(40)

Table 5.4: Data used in the CRF with selected formats where Eavrop is included in training.

Tendsign 21357 0

Rikshem 4815 0

Eavrop 12055 14006

Figure 5.4: Precision and recall of each class label from training with Tend- sign, Rikshem, Eavrop and testing with Eavrop.

5.4 Sequential labeling - all formats

In this section we will see how the model is affected when all the acquired formats are used to train the model. A summary of the number of TextLines used from each format is presented in table 5.5.

(41)

Table 5.5: Data used in the CRF with all formats where Eavrop is not included in training.

Tendsign 21357 0

Rikshem 4815 0

Eavrop 0 14006

Kommers 4252 0

Police 19467 0

In Table 5.6 on can see how the states change when more formats are included in the training data. Figure 5.5 shows that the overall precision and recall decreases when more formats are included in the training compared to Figure 5.4 where it is trained on Tendsign and Rikshem. The most significant change is that the transition FOOTNOTE->HEADNOTE is less likely which can explain the lower precision on HEADNOTE.

Table 5.6: State transition weights with all formats except Eavrop.

(42)

Figure 5.5: Precision and recall of each class label from training with all formats except Eavrop and testing with Eavrop.

The following results are based on a CRF that is trained on the data shown in table 5.7. As we saw in the previous section the model performs to perfection when Eavrop is included in the training data. The result from training with all formats and testing with Eavrop is presented in table 5.8 and Figure 5.6.

Table 5.7: Data used in the CRF with all formats where Eavrop is included in training.

Tendsign 21357 0

Rikshem 4815 0

Eavrop 12055 14006

Kommers 4252 0

Police 19467 0

(43)

Table 5.8: State transition weights with all formats included Eavrop.

Figure 5.6: Precision and recall of each class label from training with all document formats and testing with Eavrop.

(44)

Chapter 6 Evaluation

In this chapter models used in this project is evaluated. Section 6.1 covers the general discussion of the results and section 6.2 is an evaluation of the training data that has been produced through the use of pdfalto and the following annotation process.

(45)

6.1 Discussion - models

The annotation process for the RF model was limited to features based on the layout of two document formats. What we wanted to achieve with this model was a evaluation of how the layout information of a procurement can be used solely as features for document segmentation. What we then saw was that the layout features contain valuable information for a classifier. The model was able to segment the documents into the desired classes when it is trained and tested on the same formats. However, the results from Figure 5.2 shows that the model is not able to separate the classes when it was trained on Tendsign and tested on Eavrop. The next step was to include lexical features. Instead of investing more time in the RF model we decided to continue with a sequential model, i.e. the linear chain CRF. This way we use the knowledge acquired from the RF model to apply a linear chain CRF that is preferred when working sequential data.

To evaluate the features used in the sequential model and the model itself one can compare Figure 5.3 and 5.4. These models were trained on the data presented in table 5.1 and 5.4. The difference between the two results is whether the models were trained on Eavrop or not. The main challenge for the the model that is not trained on Eavrop is identifying HEADNOTE and COVER. It detects half of the HEADNOTE and COVER, however there are almost no false negatives of these classes. BODY has a higher recall than precision which means that the model prefers this class which cause many false positives of BODY. This indicate that the training data is not general enough for the model to identify segments of a document format it is not explicitly trained on. Further research of the feature engineering process and exploration of more document formats could improve the generality of the model.

It is interesting to see what happens when the model is trained on all formats available. Figure 5.5 and 5.6 shows how the result changes when documents from Police and Kommers are added to the training data. The difference between the two results is the same as above, i.e. whether Eavrop is included in the training data. What happens is that some state transition weights changes and the model has only four positive weights. The biggest difference is that the transitions BODY→FOOTNOTE and FOOTNOTE→

HEADNOTE become neutral. One explanation to this change is that Kom- mers does not have HEADNOTE and Police does not have FOOTNOTE which means that these transitions does not occur and therefore decreases the performance of the model when it segments documents from Eavrop.

This is a great challenge when it comes to generalizing documents that follows different structures and needs to be studied further to create more robust

(46)

models.

6.2 Discussion - data

The annotation process was accomplished deterministically. This means that the layout of a few documents from each format were assumed to be same as the rest of the documents. To ensure the quality of the training data, further tests needs to be conducted as there were not enough time during the project to do this. Incorrect annotation of the training data can explain why the model was unsuccessful in segmenting documents it was not explicitly trained on.

Some of the features used in the sequential model are directly dependent on the accuracy of pdfalto. These are page_status and block_status. As a page of the document is easily defined there is no uncertainty in using the page information from pdfalto to define the feature page_status that is either the first line of a page, the last or somewhere in between. The same logic is applied to each block produced by pdfalto, however the block produced by pdfalto is more loosely defined than a page. A block is normally a set of TextLines that are located close to each other, where the simplest example is a paragraph. The block_status feature will therefore in many cases indicate the start and end of a paragraph. However, all blocks in a procurement are not simple paragraphs. When the documents are less structured and follow less logical patterns, this feature becomes heavily reliant on the performance of pdfalto. To be able to get a deeper understanding of this feature more evaluations of the XML-ALTO data needs to be done.

Some of the features presented in Figure 4.6, like string_1 and string_2 may not seem logical at first glance. These features represent the two first words of a TextLine. The reason for using the two first words at each line as lexical features is based on benchmarks of GROBID. Multiple tests were conducted using different amounts of strings from each TextLine using up to four of the first words of each line in addition to trials with the last words of each line. One particular advantage of using the first word of each line is that the capitalisation of the first word can give indicators that the TextLine is a new paragraph, a new headline etc. The information alone does not tell everything about the state of the TextLine, but when combined with the other features it brings valuable information.

(47)

Chapter 7 Conclusion

This thesis aimed to investigate if a machine learning model could be trained to parse public procurements documents in PDF format into structured data of appropriate segments. This work should serve as a basis for increasing the knowledge of different procurement formats and open for more specific models that can perform NER on the structured data. Both models evaluated were able to perform with accuracies close to 100%, however the performance dropped drastically when the model had not been explicitly trained on the document format it was tested on. The model was still able to classify FOOTNOTEs correctly despite never having seen a document from Eavrop.

This confirms that the model was able to generalise a procurement document to a certain extent. However further work needs to be done to improve the performance on HEADNOTE and COVER. This can be achieved by annotating more data or developing new features that are based on knowledge from procurements.

7.1 Future work

The most promising part of this project is the feature engineering process. I believe that further analysis of procurements can bring new insights to what features that should be used, or even introduce new features. One interesting feature could be a dictionary feature that will be true when specific words occur in the text. This would require a deeper understanding of the procurements and what type of words are occurring in the different segments.

Time is the limit when it comes to exploring different document formats.

During this thesis we were continuously searching in our data to find frequent document formats that we could build a deterministic model for annotation upon. As we develop our knowledge within the domain of procurement docu-

(48)

ments there will be a natural increase in the diversity of the annotated data.

The annotation of high-quality data is crucial for the success of segmenting procurement documents.

(49)

Bibliography

[1] A. M. Charles Sutton, “An introduction to conditional random fields.”

https://arxiv.org/pdf/1011.4088v1.pdf. [Viewed on May 21st 2020].

[2] G. O. of Sweden. https://www.government.se/government-policy/.

[Viewed on February 27th 2020].

[3] Upphandlingsmyndigheten. https://www.upphandlingsmyndigheten.

se/verktyg/statistik-om-offentlig-upphandling/. [Viewed on February 27th 2020].

[4] A. A. Patrice Lopez, “pdfalto.” https://github.com/kermitt2/

pdfalto.

[5] C. Wibring, “En intelligentare upphandling? – en utredning om möjligheten att använda artificiell intelligens som hjälpmedel vid upphandling.” http://lup.lub.lu.se/student-papers/record/

9000582, 2019. Student Paper.

[6] M. Johnson, “How the statistical revolution changes (computational) linguistics.” https://www.aclweb.org/anthology/W09-0103.

pdf. [Viewed on June 4th 2020.

[7] “Moore’s law.” https://en.wikipedia.org/wiki/Moore\%27s_law.

[Viewed on June 5th 2020].

[8] A. M. Fuchun Peng, “Accurate information extraction from research papersusing conditional random fields.” https://www.aclweb.org/

anthology/N04-1042.pdf. [viewed on June 4th 2020.

[9] “Grobid.” https://github.com/kermitt2/grobid, 2008 — 2020.

[10] M. I. J. Andrew Y. NG, “On discriminative vs. generative classifiers:

A comparison of logistic regression and naive bayes.” http://ai.

(50)

stanford.edu/~ang/papers/nips01-discriminativegenerative.

pdf. [Viewed on May 6th 2020].

[11] R. Mlola, “Entropy, information gain, and gini index; the crux of a decision tree.” https://blog.clairvoyantsoft.com/. [Viewed on May 20th 2020].

[12] R. Lawrence R, “A tutorial on hidden markov models and selevted applications in speech recognition.” https://www.ece.ucsb.edu/Faculty/

Rabiner/ece259/Reprints/tutorialonhmmandapplications.pdf. [Viewed on May 22nd 2020].

[13] H. Déjean, “pdf2xml.” https://sourceforge.net/projects/

pdf2xml/.

[14] ALTO. http://www.loc.gov/standards/alto/techcenter/

structure.html. [Viewed on February 20th 2020].

[15] “Normalization (statistics).” https://en.wikipedia.org/wiki/

Normalization_(statistics). [Viewed on May 26th 2020].

Extracting Particular Information from Swedish Public Procurement Using Machine Learning

Examensarbete 30 hp Juni 2020

Extracting Particular Information from Swedish Public Procurement Using Machine Learning

Eystein Waade

Abstract

Extracting Particular Information from Swedish Public Procurement Using Machine Learning

Populärvetenskaplig sammanfattning

Contents

List of Figures

List of Tables

Acronyms

Chapter 1 Introduction

1.1 Tendium

1.2 Background

1.3 Project Aim

1.4 Delimitation

1.5 Outline

Chapter 2

Related Work

2.1 Conditional Random Fields in Natural Lan- guage Processing

2.2 GROBID

Chapter 3

Machine Learning

3.1 Sequential Text Data

3.2 Discriminative versus Generative classifiers

3.3 Classification models

3.3.1 Random Forest

3.3.2 Hidden Markov Models & Conditional Random Fields

3.4 Evaluation Metrics

3.4.1 Confusion Matrix

Chapter 4 Data process

4.1 Pdfalto

4.1.1 Analyzed Layout and Text Object

4.1.2 Processed procurement

4.2 Procurement Data

4.2.1 High-level segmentation labels

4.2.2 Annotation of training data

4.2.3 A line-based approach

4.2.4 Features of continuous variables

4.2.5 Categorical Features

4.2.6 Note on labels

4.3 Computer implementation

4.3.1 Sklearn Random Forest Classifier

4.3.2 Sklearn crfsuite

Chapter 5 Results

5.1 Differences in annotated training data for the non-sequential and sequential model

5.2 Non-sequential labeling

5.3 Sequential labeling - selective formats

5.4 Sequential labeling - all formats

Chapter 6 Evaluation

6.1 Discussion - models

6.2 Discussion - data

Chapter 7 Conclusion

7.1 Future work

Bibliography