Contextualising government reports using Named Entity Recognition

(1)

FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2020

Contextualising government

reports using Named Entity

Recognition

ALMIR ALJIC

THEODOR KRAFT

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Contextualising government reports using

Named Entity Recognition

Almir Aljic, Theodor Kraft

Abstract -- The science of making a computer understand text and process it, natural language processing, is a topic of great interest among researchers. This study aims to further that research by comparing the BERT algorithm and classic logistic regression when identifying names of public organizations. The results show that BERT outperforms its competitor in the task from the data which consisted of public state inquiries and reports.

Furthermore a literature study was conducted as a way of exploring how a system for NER can be implemented into the management of an organization. The study found that there are many ways of doing such an implementation but mainly suggested three main areas that should be focused to ensure success - recognising the right entities, trusting the system and presentation of data.

Sammanfattning - - Vetenskapen kring hur datorer ska förstå och arbeta med fria texter, språkteknologi, är ett område som blivit populärt bland forskare. Den här uppsatsen vill utvidga det området genom att jämföra BERT med logistisk regression för att undersöka nämnandet av svenska myndigheter genom NER. BERT visar bättre resultat i att identifiera namnen på myndigheter från texter i statliga utredningar och rapporter än modellen med logistisk regression.

Det genomfördes även en litteraturstudie för att undersöka hur ett system för NER kan implementeras i en organisation. Studien visade att det finns flera sätt att genomföra detta men föreslår framförallt tre områden som bör fokuseras på för en lyckad implementation - användande av rätt entiteter, trovärdighet i system och presentation av data.

Index - Machine learning, Natural Language processing, Logistic Regression, BERT

I. INTRODUCTION

A. Background & Goals

This study is made of two different parts closely related to each other. The first part is about comparing two models that use natural language processing (NLP) to classify words in text. The second part discusses the possible implementations of such models within an organization from the perspective of

industrial management. Ultimately the goal is to provide a protocol for organizations who wish to implement NLP into their work. This is done through a literature study.

1) Technical problem

NLP has gained in popularity over the past few years with its wide range of potential applications, such as automating various processes in chat bots or service desks, sentiment

analysis and information retrieval from advanced

documentation among others. This study focuses primarily on NLP, specifically Named Entity Recognition (NER), which can identify pre-determined entities (such as person, organization, date, time, zip code) from documents.

The goal of this study is to identify which one of two models for NER is the best at identifying the names of organizations in official governmental reports.

2) Literature Study

The second part of the study is set to identify how an organization can implement machine learning, and mainly NLP, to their management systems. There can be many reasons for wanting to use such a system in an organization but this study is set to focus on implementations where NLP is used to contextualize and extract quantifiable data from plain text. The results from this part of the study will provide help in prioritizing what kind of data that is suitable for use in NLP-guided management and how such methods can and should be implemented in a relevant and effective way.

3) Company

This study is conducted in cooperation with two official government agencies in Sweden, Research Institutes of Sweden (RISE) and The Swedish National Financial

Management Authority (Swedish: Ekonomistyrningsverket -

ESV). ESV develops efficient financial management for central government agencies, and analyzes and makes forecasts of central government finances. Research Institutes of Sweden (RISE), is Sweden’s research institute and

innovation partner, whose aims are to ensure the

competitiveness of the swedish business community on an international level and to contribute to a sustainable society. Vinnova is Sweden’s innovation agency, focused on building Sweden’s innovation capacity and on contributing to

(3)

sustainable growth. In 2019, ESV launched the project Resultaten i staten (here abbreviated RiS) , financed by Vinnova and collaborated on by researchers from RISE, KTH and Uppsala University. This study is conducted in conjunction with RISE and ESV, as part of the project RiS, in which one of the key challenges is analyzing, with the help of NLP, official reports of the Swedish government.

4) Ethics and implications for society

In recent years, there has been growing concern about the rise of artificial intelligence and machine learning, as it is being used in a variety of sectors for different purposes. The concern lies in the potential for this form of technology to replace or drastically reduce the need for human labor. This sentiment is manifested through an increase in the number of media reports on the subject, where a substantial amount of articles discuss and analyze which jobs are at low risk of automation. As of February 2020, the Google search “jobs safe from ai” produces 65 million results. According to McKinsey & Company, automation will eliminate a number of jobs but it will also in turn create new jobs, often requiring a new set of skills which may have higher educational requirements[3]. The scope of this study, albeit small in relative terms, is complicit in the change toward future automation. However, such widespread changes in the labor market are not going to be a direct result of this study, and the full range of consequences associated with mass automation are not known, which is why the authors do not place any particular emphasis on potential moral concerns.

B. Scientific inquiry

The scientific question that is examined in this study is: Can the NER algorithm BERT identify the names of organizations in a set of official reports of the Swedish government better than a Logistic regression algorithm?

The part of the study that is focused on the industrial engineering and management part of this problem the scientific inquiry is: How can NER and other machine learning systems successfully be implemented into the management system of an organization?

1) Problem definition:

The problem faced in this study is a classification problem where a model has to be built so that it can extract the words in a text that represent public government agencies and companies. This means that the model has to be able to find features within a word that are likely to indicate that said word is in fact a public organization.

One challenge of the study was to find a relevant dataset to use as a baseline and training set for the models. A corpus

from _{Språkbanken was selected which contained tagged}

entities such as “company” and “government organization” . That corpus is, in concern to style of writing, very similar to the dataset of government reports that is used for the evaluation in this project. This means that it is a very relevant and useful source for testing the performance of our models against a baseline.

2) Scientific relevance

This study is relevant from a scientific point of view because it compares two different models of NER and helps visualize their differences in construction and performance.

So this study is relevant to companies and organizations interested in creating multi-dimensional NER models in Swedish, capable of identifying a wide range of different entities, in that it may serve as the starting point of further interest.

Additionally, the results are going to be of particular importance to ESV as part of RiS, as it is the first step in

graphically/visually representing connections between

organizations mentioned in the official reports. Such information may be used to gain a deeper understanding of the general relationship between different organizations, perhaps how they relate to each other and how such relations change over time. This sort of information may be relevant for ESV in conducting financial analyses and prognoses on behalf of the Swedish government.

3) Hypothesis

The problem of classifying entities can be summarized in a hypothesisH1. For H1 it is assumed that BERT can solve

this problem with a higher F1 score than the Logistic

Regression algorithm. This hypothesis can be motivated by BERT previously showing very impressive results when solving similar problems [5].

II. THEORY A. Classification problems

Machine learning has many subfields, one of which is classification of data. It involves placing an observation into one of a definite pre-determined categories. Classification is a supervised field of machine learning, meaning that it requires the training data to be correctly labeled according to the available categories. The classification algorithms in this study are based on neural networks, called perceptron-based [9].

B. Word2Vec

In order to be able to process plain text and store it in the computer's memory it needs to be encoded in some way. This means that the words written with normal characters should be converted to a long sequence of numbers. Thus allowing every

(4)

word to be represented as a vector used for computing and be compared to each other. One approach to solve this problem is the Word2Vec model, which represents a way of vectorizing words [10].

Another way of vectorization is feature extraction, a method where a set of attributes are evaluated from the input. The input can then be displayed as a vector where each cell represents if the input fulfills that attribute or not.

C. Logistic Regression

Logistic regression is an algorithm commonly used for determining the link between a set of features and a specific outcome. It is one of the fundamental basics in machine learning. When used for classification it takes an input, feature-vector, representing a vectorization of the token to be classified by a sigmoid function as either part of or not part of the desired class. There is also a training algorithm involved which adjusts the weight each feature is assigned during the calculation of the sigmoid function.

Each token in the text input is represented as a vector of distinguishing features, and the entire text input can be represented as a matrix of those vectors. There are various methods of feature selection but for the sake of simplicity this will not be discussed further. If x 1 and x 2 are some binary

features with value 0 or 1, those features can be represented as a feature vector (x1, x2). Additionally, the following expression

can be posited: θ 0 + θ1x1 + θ2x2. By feeding this expression

through a sigmoid function, which always returns a value between 0 and 1, it is possible to receive an output that acts as a prediction for whether a word token (which, remember, is represented by a feature vector) belongs to the positive or negative class - in this case, positive means _{is an organization} and negative means is not an organization.

Given a large set of training data, each word token can be represented as a feature vector, which means all x-values are in place. Additionally, their sigmoid output, given by the variable y, is also known: y = 0 if a token is not a name, and y = 1 if a token is a name. The weights of all θ-values are left to be found. This is done through a process called gradient descent, in which a loss function is minimized by computing the partial derivatives (with respect to each θ-value to be found) of the loss function and solving for each respective θ. Solving for each θ can be done numerically. There are different gradient descent methods, and depending on the task at hand,

computing power and time available, it is difficult to

objectively say which one is best to use. As gradients are continuously calculated during training, so are the θ-values updated by subtracting the gradient (multiplied by a _learning rate factor) from the θ. Note that the gradient, by definition, indicates the direction of steepest ascent of the loss function. By subtracting the gradient, the opposite direction is taken -

the steepest descent, which is also the end goal: to minimize the loss function.

D. BERT

BERT is a new language representation model developed in 2018 by researchers at Google AI Language and is a state of

the art model for various NLP tasks. As opposed to most

previous NLP algorithms which are sequential, meaning that they read text from left to right (or reversed) BERT is using a non sequential approach. This means that instead of reading text in a directional manner it is using a so called bi-directional approach by reading the whole text at once. That can be seen as reading without direction as the algorithm uses all of a word's surroundings to learn its context. When used for NER the algorithm receives a text and is required to label the entities that it finds. For training, the algorithm receives an output vector that represents each token which is used for evaluation [5].

E. Transformers & Training BERT

In order to gather contextual information about the relationship of different words in a given sentence, an attention mechanism such as transformers can be used - and is used, by BERT. The most basic form of a transformer consists of an encoder, capable of representing text input in a machine-readable manner, and a decoder which generates a

prediction for the NER task. The most appropriate way of

representing text input for a machine to process is by

transforming them into vectors. These vector representations of text input include tokens that indicate the start and the end of sentences. Additionally, each word token (the representation of an actual word in the sentence) is marked with the sentence it belongs to (sentence 1, sentence 2, and so on) and it also includes a positional marker indicating exactly where the word token appears in the sentence.

Training is done using Masked LM (MLM) and Next

Sentence Prediction (NSP). MLM works by masking a certain percentage of the words in the text input, at random. The word tokens are then replaced by a MASK-token. Then, the masked tokens are predicted by BERT using the other, surrounding words. Additionally, NSP is performed, which predicts if a randomly selected sentence B is the next sentence after another randomly selected sentence A in the original text. As a result, the BERT’s loss function can be minimized and thereby also optimized [5].

F. NER in industrial management

This study is set to examine how machine learning and

automated computer systems can aid management in making

rational and fact-based decisions. In particular how NER can help finding and displaying relevant data in such a way that the

(5)

managing role of making decisions becomes easier. Letting the data decide what actions to take is called data-driven decision making(DDD), meaning a decision is based on analysis of data rather than intuition [11]. DDD has been shown to increase a firm's productivity, market value and profitability, thus proving a solid ground for using machine learning as a natural part of the management process [12].

G. Trusting the system

Studies have shown that when an automatic data system is implemented without workers understanding how it functions or why it is useful they tend to use and appreciate it much less [8]. Thus the development of such a system needs to include representatives from the user side, both to build trust and to ensure the right functionality is created [13]. People working with automated management systems have also been shown to work around what they perceive as flaws in the systems in

order to achieve what they believe is a more effective

work-flow. These actions may or may not be actual improvements, that is not the point, but rather that people tend to distrust automatic systems compared to their intuition. [8]

H. Visualizing data

The importance of presenting the results of a data system in a way that is easy to understand for everyone in the chain of management is shown to be equally important as the results themselves. Research shows that three of the main challenges when designing a system for visualising data is; language design, performance optimization and design.

Also it has been shown that latency in the system has a clear negative effect on data exploration sessions. This means that when designing the presentation interface towards the final user complexity sometimes has to be reduced in order to reduce computation time[14].

III. PREVIOUS STUDIES

Named entity recognition is a field of study which has become widely researched in the past years as its usefulness in solving various NLP tasks har proved itself. However the specific problem faced in this study has not yet been given any research. There are many studies conducted which compare different NER models against each other when performing various tasks. There are many different implementations of RNNs and most of them have been tested and evaluated in previous studies. LSTM has been shown to present promising results when handling speech related problems compared to for example a gated recurrent unit [4]. The LSTM model has also been shown to outperform other RNN algorithms on benchmark illustrative problems [6].

Additionally, binary logistic regression has been shown to produce adequate results when used to classify words as a name or not a name in email messages[18]. Features used in the study were_{‘first letter in token is capitalized’} ,_{‘token ends} with common ending of last names’, among others.

List-lookups were also used to determine if tokens were

common first, last and city names, which were then included in the list of possible features. Fine-tuning of the model was performed by altering the learning rate, convergence margin and maximum allowable iterations in gradient descent. Lastly, accuracy, precision, recall and F-score were measured to evaluate the model. The resulting F-score was 66.65%. Better results were achieved by applying regular expressions and logistic regression to short SMS messages[1] in identifying entities such as name, location, date, time and telephone number. The resulting F-score was 86%.

Furthermore, a pre-trained BERT model was shown to produce good results in a Portugese NER task. By using BERT

with a Linear-Chain Conditional Random Field (CRF), an

F-score of up to 83.24% was achieved, while BERT alone

actually yielded an F-score of up to 83.3%, both of which were state-of-the-art results[19]. Two transfer-learning approaches were used in achieving these results: feature-based and fine-tuning. The feature-based approach kept the BERT model weights fixed while the CRF layer was trained, while the fine-tuning approach updated the weights of both BERT and CRF during training. For pre-training the BERT model, the brWaC corpus was used, which is a large data set consisting of 2.68 billion tokens. Pre-training was conducted through whole

word masking and for 1,000,000 steps, with a linearly

decaying learning rate of 1e-4. The BERT weights were initialized using an English equivalent BERT model.

Using machine learning, and specifically NER, as a part of the information system used to aid and guide management has become more and more common as the technology has evolved. The recent advancement in this field of study combined with the ever increasing amounts of raw data that is available has both forced and enabled management teams to implement these new solutions into their chain of decision. NLP can substantially enhance most phases of an organization's information cycle, firstly in regards to analysis and validation of data but also in processing and presenting data. [7]

It has also been shown that humans sometimes tend to distrust management systems that are built using machine

learning. This comes from a study which has researched

organizations that to a great extent use machine learning to allocate, optimize and evaluate the work of their staff.[8] Their study showed that workers at Uber and Lyft, two ridesharing services, sometimes felt reluctant to perform tasks assigned by a machine learning system. That was said to mainly depend on

(6)

the fact that they did not fully understand how the systems worked and therefore did not always trust their judgement. Learnings can be made from this study when designing

data-driven and automatic information systems for

management to increase workers' likelihood of using and trusting those systems.

IV. METHODOLOGY A. Technical problem

1) Data:

The data used in this study is available for everyone to use under the _{Public Access to Information and Secrecy Act} in Sweden which regulates the access to government reports and files. It consists of state public inquiries and reports written by various public agencies from 2006-2019.

2) Processing the data:

Originally the data was in the format of pdf files all in various shapes and styles as each public agency has their own way of writing reports. They were then converted to an html-like text format using the Abbyy FineReader software.

To train the logistic regression model, each word in a subset of the reports was manually annotated as either “organization” or “other”. This subset contained approximately 200.000 tokens from randomly selected reports, but the model was only trained on 80% of the data, and the remaining 20% was used for evaluation. Evaluation was thus performed on a data set consisting of approximately 40.000 tokens, for both logistic regression and BERT.

3) Developing the BERT model

Any NLP model has to be pre-trained in order to determine how to weigh its different hyperparameters. This training should be done on a separate dataset than the one used for evaluation and actual use. The BERT model was fine-tuned using the _{SUC 3.0} corpus. _SUC is short for_{Stockholm-Umeå} Corpus and is a resource developed by _{Språkbanken that} consists of a balanced collection of swedish texts that are part-of-speech tagged and annotated. This trained model was developed by the National Library of Sweden and was used as a base for the BERT implementation in this study. According to the authors of the model, it was pre-trained on approximately 15-20 GB of data consisting of 3 billion tokens

from a variety of sources, including books, government

publications, wikipedia articles and so forth. The

hyperparameters were the same as the ones used in the original BERT model developed by Google[5]. The completed BERT model took strings of text as input, and produced an output detailing which entities were able to be identified. Due to the fact that the python implementation of the model could only be

run on small pieces of text at any given time, a simple data stream was created in python3 to automatically feed the model new strings of text from the test data set which contained roughly 40.000 tokens. In short, the difference between evaluating the logistic regression and the BERT models, in terms of structuring the input data, was that LR required it to be BIO-tagged whereas BERT required streams of pure text, including punctuation.

4) Developing the LR model

The logistic regression model is based on an outline created by Johan Boye and Patrik Jonnell which was further developed and modified by the authors of this thesis. The model uses gradient descent to minimize a loss function. The classification was done using a sigmoid function. The inputs were vectorized using the simple features; capitalized first letter, length of token and all letters capitalized. This means that the model's parameters were set from the start and only its weights adjusted during training. This way of developing machine learning is far less intense in regards to computing power than other models like BERT where there are far more hyper-parameters that are adjusted throughout the training process. The data sets for both training and evaluation were BIO-tagged and saved as .csv files, which was the most effective way of structuring the data given that the model was based on word-level features. Given a BIO-tagged input, the

completed model would produce a confusion matrix which

detailed the number of identified true and false positives and negatives.

5) Evaluating the models

In order to correctly evaluate the models, different sets of data were used for training and evaluation in order to avoid the common problem of a model overfitting on its training-set and thereby not performing accurately when faced with new data. The two models were not trained on exactly the same dataset, mainly due to lack of computing power to do so. However, they were tested on exactly the same data in order for a comparative analysis to be done.

To evaluate and analyse the performance of the algorithms, a variety of methods were used. The F score metric is a widely used measure of how accurate a NER model is. It combines the _precision and _recallscores.

The formula for precision is: P = T rue P ositive

T rue P ositive + F alse P ositive

The formula for recall is: R = T rue P ositive

(7)

The formula for F1 score is: F = 2×precision×recall_{precision+recall}

By considering both precision and recall this often provides a more realistic and accurate description of a model's performance. This is especially true if the classes used for classification are of very different sizes like they are in this study. Since the number of true positives are substantially smaller than the number of true negatives it can be argued that the evaluation should rather be more accepting towards false positives than false negatives. That is because if a small number of the true positives are missed by the model a large part of the results could be lost.

B. Literature study 1) Gathering data

To answer the scientific inquiry related to industrial engineering and management a literature study was conducted. The study focused on finding relevant research on organizations that have implemented various machine learning and data systems into their management systems. Google scholar was used as the main source in the study for finding relevant research material. Some of the search terms used were “data driven”, “management”,_{“machine learning”},_“project management”, _{“named entity recognition”}, _{“information} systems”. These were combined in various ways to find a large amount of articles and studies that could be relevant to this project.

2) Sorting relevant data

The collected articles were reviewed briefly, firstly by removing those which had titles and abstracts that had nothing to do with the studied question in this case. The summaries of the remaining articles were read to further identify which articles were relevant. A broad perspective was used to look into multiple ways of analyzing an implementation of machine learning systems without having a pre-decided agenda for the cases to be studied. These most interesting studies were then read on a deep level to be used as a theoretical base for this

thesis. The sorting of research was based on a model of

determination where the quality of the research and its relevance were judged subjectively.

3) Areas of interest

After the initial research three main areas of interest were identified as most relevant for answering the scientific inquiry. They are; trusting the system, understanding data and presentation. These key-words were then used as a tool for acquiring further knowledge from related work.

4) Limitations

The literature study was limited to not including investigations regarding what kind of systems to implement, how data driven management affects organizational structure or market opportunities of NER systems.

V. RESULTS A. Results of BERT NER

Applying the BERT model to a sample of test data from reports of the Swedish government, consisting of 40239 tokens, there were 1068 true positive identifications of unique mentions of entities belonging to the ORG (organization) domain, 140 false positives, 39004 true negatives and 180 false negatives. These numbers yield an F1-score of approximately 0.8697.

B. Results of Logistic Regression NER

Similarly, applying the binary classifier model to the same

corpus as BERT from the Swedish government yielded 870

true positives, 732 false positives, 38413, true negatives and 377 false negatives. The results are presented in the tables below. These numbers result in an F1-score of approximately 0.6259.

Table I CONFUSION MATRIX

Table II ACCURACY MATRIX

C. Results of literature study

The literature study showed that there are many difficulties in implementing a data system in an organization which aims to aid management in making decisions. Three main areas of importance were identified as essential in order to achieve success. The results of the literature study are not dependent on the results provided by the two classification

(8)

algorithms. The implementation of a machine learning system into the management system of an organization does not consider the technical architecture of such a system. Therefore the results of the literature study should be considered individually from the results of the technical inquiry.

1) Importance of trust

By relying on automated data you have to trust the output of your algorithm and systems. The literature study showed that when an automatic system is implemented without workers understanding how it functions they tend to use and appreciate it much less. Therefore transparency is key when automating work so that the one using the automatically generated data understands its origin. One issue is the disparity between the one who does the work and the one who gets the benefit, in this case meaning that the developer of a NER model rarely is the one who is going to use it in everyday work. Thus the development of such a system needs to include representatives from the user side, both to build trust and to ensure the right functionality is created. [13]

2) Understanding data

NER is by design not limited to any kind of task and can, using the right pre-training, be used to identify every kind of entity imaginable. In the practical implementation of this study the recognition was limited to names of public agencies and organizations. In order to be able to use that data one firstly has to understand the meaning of it. Through the studied literature and discussions with ESV it was discovered that one major problem of implementing a machine learning system is the lack of technical knowledge.

3) Importance of presentation

The importance of presenting the results of a NLP

algorithm in a way that is easy to understand for everyone in the chain of management is equally important as the results themselves. In this age of increasing amounts of data for management to consider it is ever so important to carefully consider how data can be visualised. Meaning that in order to usefully display data the system has to communicate with the user in an easy and natural way, ensuring that there is no language barrier between the human and the machine. This is preferably achieved by limiting the amount of information present to the user and focusing it down to only displaying precisely what is desired and nothing more.

4) Advantages of data driven management

Increased investments in IT in general and data-driven decision tools in particular have been shown to increase productivity and profitability for firms. However, public organizations and agencies are not driven by the same

incentives regarding creating investor and share-holder value as private corporations. Therefore the advantages considered for private corporations have to be examined in order to determine whether they are relevant in the perspective of a public agency or not. Regarding productivity, it is arguable as relevant in any organization, no matter its type of structure or ownership model. That is also true for profitability, which is basically another way of phrasing an increase in productivity. All these benefits are thus relevant for a public organization.

VI. DISCUSSION A. Evaluating the results

The hypothesis stated in the introduction will now be evaluated. H1 stated that BERT can solve this problem with a higher F1 score than the Logistic Regression algorithm.

The results produce an F1 score of 0.8558 for BERT, and 0.6677 for logistic regression. Looking at other measures of performance the results differ somewhat. The overall accuracy, taking the vast number of true negatives into consideration yields scores of 0.9921 vs 0.9725 respectively for the two algorithms which in both cases is respectable. This means that the hypothesis H1 can be proven.

Statistically the dataset creates some problems since the number of words that are names of organizations is so small compared to the rest of the words. This means that for the algorithms to have a statistically significant number of names to identify the rest of the dataset has to be very large. When evaluating the results this gives us a very large amount of true negative classifications. That is why the accuracy score is

significantly higher than the F1 score which was used to

evaluate the hypothesis.

When comparing the classification of the models, and mainly when analyzing the results from the logistic regression it is relatively speaking more prone to be overly generous in

classifying words as organization compared to BERT. The

number of FPs are around twice as many as the FNs whereas BERT have roughly the same number of each. That can be explained by examining the feature design of the logistic regression model and realising that there are a lot of names of other entities than organizations that also match those features. For any given method of binary classification, it is granted that a baseline F1-score of 0.5 is reasonable. This is because in case of random predictions, there would be an equal proportion of each real and predicted class in the confusion matrix. In other words, the results would be TP = FP = TN = FN = 250 out of a data sample consisting of 1000 tokens, which results in an F1 of 0.5. With that said, the model generated using logistic regression may instead serve as a more accurate baseline in evaluating the performance of BERT. It turns out that regardless of which, the F1 score of BERT is significantly

(9)

higher than that of logistic regression, which is why the former model is preferable. However, it would be interesting to

somehow quantify the trade-off in time and resources in

determining more precisely which model is the best fit for a given project.

When comparing the classification done by the two models it is shown that they identify roughly the same number of true positives. However the logistic regression classified approximately six to seven times more false positives and false negatives, indicating that it is way too generous when labeling words as organization and misses a relatively substantial amount of words that ought to be labelled as organization.

B. How to determine true and false

When evaluating whether the algorithms correctly or incorrectly labeled the words, there appeared words where both the label organization and the label other could reasonably be true. One example of that is the word “koncernen”. The word itself is not an example of a public organization which thus argues that such a word should be correctly labeled as other. However, understanding the context of the word could lead to the conclusion that it is referring to a previously named organization and therefore its implied meaning is that of an organization. That is not the only example, these situations appear regularly when analyzing the texts.

It raises the question of what the actual goal of NLP is on a fundamental level. It is not about just finding words but rather about truly understanding the meaning of written language. Thus it is clear that our approach of marking words such as the previous example, _{“koncernen”}, stands to reason as a logical way of approaching entity recognition.

C. Difference in training data

The two different models for NER were trained on two different datasets. The possible implications of that on the outcome of the study will now be discussed. Due to limitations

in computer power the LR model was trained on a smaller

dataset than the BERT model. This is of course not an ideal situation but precautions were made in order to limit this problem as much as possible. The corpus used for LR consisted of a subset of the corpus for BERT and a few other texts very similar in context and style of writing.

Since the LR model is a lot simpler with fewer features it can be assumed that it does not need the same bulk of text to tune its hyperparameters as BERT needs. Therefore the approximately same results can be expected from an LR model trained on a smaller corpus as the same model trained on a larger one. This is true only when the quality of the

corpus can be assumed to be sufficient, meaning that it

contains a large enough amount of entities to classify.

D. How can NER and other machine learning systems successfully be implemented into the management system of an organization?

When developing new computer systems it seems reasonable to use a data driven approach to measure and achieve success. This means using data collected from the users of the system to measure how they interact with it and using that as a base for further development. This is an approach to development which is popular today since there are far better tools for gathering such data now than in the past. Using quantitative data can help settling disagreements between developers and users regarding subjective perceptions on the systems design. This is a time consuming task since the process is heavily dependent on the quality of the interaction data collected. Meaning that much thought and resources has to be put into this part of development in order for it to be usable as a tool. This creates a higher cost for development which can be hard to justify but an argument for using this method is that the cost for making changes in software is generally much higher in the later stages than in the early ones. A user's interaction with any kind of computer system, no matter its purpose, can be seen as a process of building a relationship. It contains all the stages of establishing a new relationship with an actual person and can also be analysed from that perspective. This means that a designer of the user interface should develop his product in a way that lets the user slowly get to know the system and learn to interact with it to establish trust and understanding of it. This process could definitely benefit from data driven testing to help developers understand how the interactions work. Things like how long certain tasks take depending on the user's experience and if there are points in the system where users are prone to give up their work are relevant for investigation. These can help to point out if the development of an understanding between the user and the system advances too fast or too slow at specific points in the process.

The development of a well functioning interface system is very important but there is still a distinction to be made between systems to be used by external customers and users and systems developed to be used within the organization. This thesis focuses on the latter one. That does relieve the pressure on the interface a little bit compared to systems used towards people on the outside where a system like this could be the only way of communicating. Luckily internal systems allow

for a little more headroom when it comes to small

imperfections since knowledge can be developed over time and in-house support technicians can aid with any problems that arise.

(10)

E. Applications of the models

The actual usefulness of the most successful BERT algorithm in this study is not necessarily obvious since its effectiveness is somewhat limited. Firstly the problem of having to determine whether the accuracy of the model is acceptable to use in a practical implementation is faced. Since there were relatively many false positives and negatives it means that the model does produce errors which is not what an end user would expect it to do. That is something which would prevent the model from developing trust with the user, one aspect which was identified as important for a successful implementation. To what extent this problem would become a limitation in the development of trust would depend on the situation and mainly by how transparent the model is with its flaws of not classifying a corpus 100% correctly. If the accuracy of the model is clearly communicated to the user so that there is an understanding of how there is a risk of small parts of data being missed in the classification the trust problem could be avoided.

The data retrieved from the models in this study stands to use in a number of ways. Mainly as a tool by which it is possible to gain an overview of which parties that are involved in a particular government project or study. By identifying all of the present organizations in a report a quick and easy summarization of the participants is presented. This data can be used to coordinate resources and attention towards those involved in a project. It also serves as a way of identifying which organizations that are responsible for a given subject or

project, making it easier to communicate with the right

institution.

F. Cost of implementation

When discussing ways of implementing data driven management in general and specifically NLP systems there is one crucial yet not discussed aspect of such a project. As has been shown there are positive effects on organizations that spring from such systems but that must always be compared to the alternative costs and results of doing something else. That means comparing this particular solution both to other methods of achieving the same thing but also to other completely different projects within the organization that competes for the access to the same limited resources available. Since access to this kind of data is impossible to put a value on it can be hard to motivate such an implementation. This is the same for all data and thus the delivered value of such a system is always subjectively estimated.

G. Limitations

The computation performed in the technical part of this study was constrained by multiple factors but mainly time, computing power and the lack of resources due to restrictions imposed during the Corona epidemic. This meant that with limited computing power the technical part of the study had to be shrunk in order to be successfully executed. So under optimal conditions the dataset tested would be expanded to contain more tokens. That would not necessarily improve the results in any way but rather solidify the performance statistically and give them more emphasis.

Another limitation lies in the human factor of the evaluation since it consists of manual identification of words. This work could have been expanded to include a panel of people that worked on classificating the words that were used as a baseline for evaluating the respective results of the two algorithms. This would reduce the risk of the baseline being wrong in its classification.

H. Future Research

The results of this study is mainly to be viewed as the build-up towards a more advanced model that can analyze the contexts of the identified organization entities. The goal of such a model, of which this study is the foundation, would be to be able to identify which organizations that have developed good and bad relationships with each other and which ones that are dependent on each other. A tool like that could help management focus its relationship building work on the right organizations that they according to the model depend on the most in order to do their job.

VII. CONCLUSION

In conclusion, this study shows that the BERT model performs significantly better at NER than the LR model when using the F1 score as measurement. However when analyzing overall accuracy the difference is not as obvious, showing that the much simpler and easy to understand LR model can in some ways be considered at least sufficient. This means that when using other means of comparison the LR model could at least be studied as an alternative, mainly due to its simplicity.

The other part of this study was focused on the

implementation of a NER or similar system which was aimed at identifying main areas of attention for organizations to focus on for achieving success. Here the three areas, trusting the

system, understanding data, and visualization are

recommended as the most important factors when it comes to successfully implementing a technical system of this kind in an organization.

(11)

VII. REFERENCES

[1] Ek, Tobias, et al. "Named entity recognition for short text messages." Procedia-Social and Behavioral Sciences 27 (2011): 178-187. [2] Finkel, Jenny Rose, Trond Grenager, and Christopher Manning.

"Incorporating non-local information into information extraction systems by gibbs sampling." Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, (2005): 363-370

[3] Chui, Michael., Lind, Susan., Grumbel, Peter. “How will automation

affect jobs, skills and wages?” _{McKinsey Global Institute (2018)}

Available at:

www.mckinsey.com/featured-insights/future-of-work/how-will-aut omation-affect-jobs-skills-and-wages

[4] Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).

[5] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional

transformers for language understanding." _{arXiv preprint}

arXiv:1810.04805 (2018).

[6] Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. "Learning to forget: Continual prediction with LSTM." (1999): 850-855. [7] Métais, Elisabeth. "Enhancing information systems management with

natural language processing techniques." _{Data & Knowledge}

Engineering 41.2-3 (2002): 247-272.

[8] Lee, Min Kyung, et al. "Working with machines: The impact of algorithmic and data-driven management on human workers."

Proceedings of the 33rd annual ACM conference on human factors in computing systems. 2015.

[9] Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. "Supervised machine learning: A review of classification techniques." _Emerging

artificial intelligence applications in computer engineering 160

(2007): 3-24.

[10] Rong, Xin. "word2vec parameter learning explained." _arXiv

preprint arXiv:1411.2738 (2014).

[11] Provost, Foster, and Tom Fawcett. "Data science and its relationship to big data and data-driven decision making."Big data 1.1 (2013): 51-59.

[12] Brynjolfsson, Erik, Lorin M. Hitt, and Heekyung Hellen Kim. "Strength in numbers: How does data-driven decisionmaking affect firm performance?." _{Available at SSRN 1819486} (2011).

[13] Grudin, Jonathan. "Why CSCW applications fail: problems in the design and evaluation of organizational interfaces." _{Proceedings of} the 1988 ACM conference on Computer-supported cooperative work. 1988.

[14] Wu, Eugene, et al. "Combining Design and Performance in a Data Visualization Management System." CIDR. 2017.

[15] Sahlgren, Magnus. "An introduction to random indexing." Methods and applications of semantic indexing workshop at the 7th

international conference on terminology and knowledge

engineering. 2005.

[16] Shleifer, Andrei. "Psychologists at the gate: a review of Daniel

Kahneman's thinking, fast and slow." Journal of Economic

Literature 50.4 (2012): 1080-91.

[17] Hall, Mark Andrew. "Correlation-based feature selection for machine learning." (1999).

[18] Olby, Linnea, and Isabel Thomander. "A Step Toward GDPR Compliance: Processing of Personal Data in Email." (2018). [19] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "Portuguese

Named Entity Recognition using BERT-CRF." arXiv preprint arXiv:1909.10649 (2019).

Theodor Kraft: Theodor is a student at KTH Royal Institute of Technology. He has contributed to all parts of the study.

Almir Aljic:Almir is a student at KTH Royal Institute of Technology. He has contributed to all parts of the study.

(12)