Named-entity recognition with BERT for anonymization of medical records

(1)

Linköping University | Department of Computer and Information Science Bachelor’s thesis, 18 ECTS | Cognitive Science 2021 | LIU-IDA/KOGVET-G--21/006--SE

Named-entity recognition with

BERT for anonymization of medical

records

Olle Bridal

Supervisor: Erik Marsja Examiner: Arne Jönsson

(2)

(3)

iii

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida https://ep.liu.se/ .

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.

(4)

iv

Abstract

Sharing data is an important part of the progress of science of many fields. In the largely deep learning dominated field of natural language processing, textual resources are in high demand. In certain domains, such as that of medical records, the sharing of data is limited by ethical and legal restrictions and therefore requires anonymization. The process of manual anonymization is tedious and expensive, thus automated anonymization is of great value. Since medical records consist of unstructured text, pieces of sensitive information have to be identified in order to be masked for anonymization. Named-entity recognition (NER) is the subtask of information extraction named entities, such as person names or locations, are identified and categorized. Recently, models that leverage unsupervised training on large quantities of unlabeled training data have performed impressively on the NER task, which show promise in their usage for the problem of anonymization. In this study, a small set of medical records was annotated with named-entity tags. Because of the lack of any training data, a BERT model already fine-tuned for NER was then evaluated on the evaluation set. The aim was to find out how well the model would perform on NER on medical records, and to explore the possibility of using the model to anonymize medical records. The most positive result was that the model was able to identify all person names in the dataset. The average accuracy for identifying all entity types was however relatively low. It is discussed that the success of identifying person names show promise in the model’s application for anonymization. However, because the overall accuracy is significantly worse than that of models fine-tuned on domain specific data, it is suggested that there might be better methods for anonymization in the absence of relevant training data.

(5)

v

Acknowledgement

I would like to thank internal and external supervisors for keeping me on track, helping me find relevant literature and methods, and advising me on how to keep the writing flame burning when my motivation was low. I am also very thankful for the seminar and chat sessions with our thesis group. And a special thank you to Emma, who spent hours and hours helping me with various technical difficulties, far exceeding our expertise

(6)

(7)

vii

1. Introduction

Sharing data plays an imperative part in the progress of science, technology, and innovation. Data that is made publicly available enables the possibility of replication and comparisons, and not least the ability to continue and improve upon previous work. In the field of natural language processing (NLP), publicly available data is key.

NLP is now largely dominated by deep learning models, which are dependent on

textual data for training. For neural network approaches to be successful in NLP tasks, models need to be trained on domain-specific data. As neural language models grow larger, so does the need for data to train them. In certain domains, sharing data is constrained by ethical and legal limitations. Medical records contain sensitive information, and in obedience to the General Data Protection Regulation (GDPR), data cannot be made public if it contains

information that can be traced back to individuals. In order to make medical records available as data for a larger community to research and develop statistical models, the data must be anonymized (or de-identified). In other words, all instances of names, addresses, dates of birth, social security numbers, etc. which can be connected to an individual, need to be removed or masked. To manually anonymize is laborious work, thus automated

anonymization is of great value.

Automated anonymization of medical records is however easier said than done. The unstructured, free-text nature of medical records presents challenges to the process of anonymization. Pieces of potentially sensitive information must first be identified and categorized, for it to then be masked or removed. Recent advances of deep-learning

approaches in named-entity recognition (NER) performance show promise in the application of NER on anonymization.

1.1 Aim

This paper describes an exploration of using a transfer learning approach of NER to anonymize a dataset of medical records in Swedish. The study aims to investigate how accurate a pre-trained and fine-tuned BERT model is at finding and categorizing named entities in a dataset of unstructured medical records in Swedish. By doing so, the study also aspires to explore the potential of using such a model for anonymization.

1.2 Research Questions

By evaluating the performance of the BERT model on a dataset of medical records, the study’s purpose is the answer the following research questions:

• How accurate is a BERT model, fine-tuned for NER, at identifying named entities in

medical records?

(9)

9

2. Theory

2.1 NER and anonymization

Named-entity recognition (NER) is the sub-task in information extraction where the goal is to detect and categorize entities, such as names, locations, organizations, and objects in

unstructured text (Nadeau & Sekine, 2007). An entity can consist of several words in a phrase and is not necessarily a single word. For instance, in the example sentence ‘John Doe is a patient’, the person name ‘John Doe’, is a named entity, which consists of two tokens.

Anonymizing a document of unstructured text consists of two steps. The first is to identify the pieces of sensitive information and the second is to mask or remove them from the document. The first step can and has been treated as a NER problem. For instance, in Gardner, et. al. (2008) a conditional random fields model (CRF) was trained on medical records, where sensitive information was annotated and treated as named entities.

Anonymization could be seen as a binary classification problem, where there are sensitive pieces of information on the one hand, and all other words on the other. However, identifying also the categories is useful when it comes to masking. Because masking unwanted entities with a category-representing token has less of an impact on downstream NLP tasks on the anonymized text. Even better is to replace sensitive information with pseudonyms, that is, an arbitrary word of the same category (Berg et al., 2020).

Meystre et al. (2010) describe that automated anonymization of unstructured medical data has been done primarily in two groups of methods, lexicon-based methods, and machine learning methods. Lexicon-based methods do not need any training data, whereas machine learning models rely on large amounts of labeled data. A lexicon-based model, however, usually requires months of design work by professionals with domain expertise. A lexicon-based method has the advantage of being able to identify named entities that exist in the lexicon that a machine learning model might miss if the named entity never occurs in the training data. The main advantage of machine learning methods, such as a CRF model that was mentioned earlier, is that they can find named entities that are not in the lexicon, by learning in what context a named entity usually occurs.

2.2 BERT

The Bidirectional Encoder Representations from Transformers, or BERT, is a powerful language model that, when it was introduced in Devlin et al. (2018), pushed the envelope on several NLP tasks. Even the base model (as opposed to the large version) is sizable at 110 million parameters, which normally would need a substantial amount of training data to fit. However, BERT is pre-trained in an unsupervised manner using vast amounts of unlabeled data. Fine-tuning the pre-trained model for a specific classification task, such as NER, is then done by adding one output layer, and training the model on a relatively small, labeled dataset. Relatively few parameters then have to be trained from scratch. BERTs pre-training and fine-tuning technique is a form of transfer learning, where general language knowledge gained

(10)

10

from large amounts of raw text data is leveraged, which tackles the problem of needing a large, labeled dataset.

2.2.1 Word embeddings

Each token in a BERT input sequence is represented using WordPiece embeddings (Wu et al., 2016). The WordPiece embedding is a vector that maps to a specific token in the model’s vocabulary which consists of 30 000 tokens. When a word in an input sequence does not exist in the vocabulary, the word is split into sub-tokens, or wordpieces. Sub-tokens that exist in the vocabulary can be word endings, letter pairs, and even single letters. As an example, the word “playing” does not exist in BERT vocabulary for english. However, the root word “play”, and the word ending “ing” do exist in the vocabulary. The word “playing” in an input sequence is then represented as two tokens, with two corresponding WordPiece embeddings. Since all letters of the alphabet exist in the BERT vocabulary as sub-tokens, every possible word can practically be split into wordpieces that exist in the vocabulary.

In addition to the WordPiece embedding, BERT also leverages an embedding vector

representing the token’s position in the sequence, as well as an embedding vector representing the surrounding sequence. Ultimately, each token is represented by the sum vector of these three embeddings (Devlin et al., 2018). The contextualized embedding used in BERT facilitates a semantic understanding of words, which is able to handle both word ambiguity, and synonyms (Rogers et al., 2020). As an example of word ambiguity, the word embedding representing the word “bank” in the context of the bank of a river, will be different from the embedding of that of a financial institution. Moreover, if two words tend to occur in the same context, such as the words “couch” and “sofa”, their respective word embeddings will be closer in the vector space(Wang et al., 2019).

2.2.2 Pre-training

Devlin et al. (2018) explain that the BERT model is pre-trained using two methods, masked language modeling (MLM) and next sentence prediction (NSP). Neither of the methods relies on labeled training data, thus the model can “train itself” on large amounts of unstructured and unlabeled text.

With MLM, 15% of the tokens in every inputted sequence are masked by

replacing the token with a specific “[MASK]”-token. The model tries to predict what the masked tokens are and calculates cross-entropy loss for the error. MLM enables the model to learn deep bidirectional representations, by learning what tokens (or sub-tokens) fit in what contexts (Devlin et al. 2018).

In the other pre-training method, NSP, the model is trained by predicting adjacent sentences. For a given sentence A, the model predicts whether sentence B follows sentence A, or not. 50% of the time, sentence B does follow sentence A, and 50% of the time, sentence B is a random sentence from somewhere else in the data. This method of pre-training is however more relevant for tasks such as Question Answering, and Natural Language

(11)

11

2.2.3 Fine-tuning

Because of the immense knowledge of language gained by the pre-training methods, a

relatively small amount of labeled data is required to fine tune BERT for satisfying results on downstream tasks. For a Question and Answering task, BERT has been trained to recognize patterns in sentence pairs. Now it only has to learn to connect certain sentences with a question label, and others with an answer label. Similarly, BERT already has a modelled representation of what words go in what contexts, it does not therefore take much to update the parameters for connecting tokens that occur in specific contexts to an entity label. Fine tuning for a classification task is trivially done by adding an additional classification output layer at the end of BERT, and then further training all parameters, including the already pre-trained layers (Devlin et al., 2018).

2.2.4 Other NER models

Although BERT can be considered to be state of the art, other models have been introduced, since the release of BERT, which have performed even better on the NER task. On the

CONLL-2003 dataset for named entity recognition, introduced in Sang & De Meulder (2003), the model LUKE achieved an F1 score of 94.3 (Yamada, et al., 2020). An even higher F1 score of 94.6 on NER on the CONLL-2003 dataset was achieved by Wange et al. (2021), using the ACE model. This is compared to BERT’s 92.4 in F1 (or 92.8 using the larger version of the model) on the same task and dataset. The LUKE and ACE models are however not available pre-trained on Swedish data, which BERT is.

2.3 Swedish BERT-NER

Malmsten, et. al. (2020) pre-trained a Swedish BERT model, called KB-BERT, at the Royal Swedish Library, which served as a great resource of digitized textual data. They used text from newspapers, government reports, legal documents, and every Wikipedia article in Swedish. The combined data from the different categories of text consisted of 260 million sentences at 18 341MB. To pre-train KB-BERT, the same method and hyperparameters as proposed in the original BERT paper were used.

In order to evaluate the model, the pre-trained Swedish BERT was fine-tuned for

two downstream tasks, POS-tagging and NER, using the labeled SUC 3.0 corpus1_{. The corpus}

was divided into training-, validation- and test subsets. For both POS-tagging and NER, the model outperformed another Swedish BERT model pre-trained by Arbetsförmedlingen, and the M-BERT trained on corpora from multiple languages, which was released along with the original BERT paper. The improvement upon previous models on the POS-tagging task was

(12)

12

minimal with an F1 score increase of less than 1%. However, for the NER task, the KB-BERT achieved an average F1 score of 0.927, compared to the next best M-BERT which produced an F1 score of 0.906 on the same test.

The high accuracy on the NER task reported in Malmsten, et. al. (2020) shows promise for the model’s usability for an anonymization task. KB-BERT is not trained on medical data per se. However, the diversity of both the large amounts of pre-training data, as well as the SUC 3.0 which is a balanced corpus with texts from a variety of genres, indicate that the model is robust and could possibly perform well on a NER task in the medical domain.

The KB-BERT model fine-tuned for NER on the SUC 3.0 dataset is the model that was evaluated on the dataset in this study. KB-BERT-NER is available for download at

the Hugging Face website2_{where it is mentioned that the entities that the tags that the model}

uses are person name(PER), location(LOC), time(TME), organization(ORG), and

event(EVN). The model is therefore evidently not trained specifically for anonymization, rather it is trained to identify the types of named entities that are annotated in the SUC 3.0 dataset. The PER tag is the entity type which will be of most interest, since masking person names is essential for the anonymization task. The rest of the tags do not directly translate to the task of anonymization. The LOC tag, for instance, covers all forms of locations. This means that the name of a street, a city or a country are all tagged with LOC. For

anonymization, a specific address is of interest. A city or a country is however not enough information to connect the medical record to an individual.

2.3 Similar work

MEDDOCAN 2019, described in Marimon et al. (2019), was a shared task where the objective was to explore solutions for automated anonymization of medical documents in Spanish. A dataset, divided into train and test proportions, where pieces of sensitive information had been labeled was provided. Several teams set out to train different NER models on the data to achieve the highest F1 score. Many different deep-learning approaches were implemented by participants and some variations of BERT were used by top contenders. For instance, Garcia-Pablos et al. (2020) fine-tuned the multilingual BERT model on the MEDDOCAN dataset to achieve an F1-score of around 0.96.

(13)

13

3. Methods

To investigate the potential of using the KB-BERT-NER for anonymization of medical records, the model was downloaded and implemented to run offline. To measure the accuracy of the NER predictions, a test set was created through manual annotation using a simple python script. The test set was then passed through the KB-BERT-NER to produce the model’s predictions. Lastly, the measures accuracy, precision, recall, and F1 were calculated for the model’s predictions compared to the annotations.

3.1 Data

The data used for the experiment in this study consisted of electronic medical records (EMRs) from two clinics at Linköping University Hospital. The entire dataset was made up of almost 1 million EMRs, amounting to around 71 million words, as seen in table 1. The larger part of the EMRs were from the cardiology clinic, and the rest from the neurosurgery clinic.

However, only a fraction of the dataset (1000 EMRs) was used in this study.

Clinic EMRs Words

Cardiology 664 821 45 780 055

Neurosurgery 314 669 25 440 484

Total 979 490 71 220 539

Table 1: Number of words and EMRs from the two clinics in the dataset

The EMRs were extracted and stored as separate sentences in a text file, separated by lines. This was the structure of the data used for this project.

Because the records are not anonymized and contain sensitive data, no examples or extracts from the data can be distributed or presented in this report. For the same reason, the experiment had to be carried out on a computer at the hospital without an internet connection. For the experimenter to be allowed to look at and work with the data, a confidentiality agreement was signed.

3.1.1 Creating the testset

To evaluate how the model performed when extracting named entities, a test set with gold standard entities annotated had to be created. Using python, a text-based tool was written to facilitate and speed up the process of annotating. There are open-source software available for this purpose. However, because of the offline environment, these could not be used.

The python tool was simple and minimalistic. One token at a time was printed and the annotator labeled the token by entering a number ranging from 0-5, where each number

(14)

14

represented a tag. The entire sentence surrounding every token was also printed, so the annotator could see the context. 1000 EMR entries were treated as sentences and annotated with the tool. The EMRs used were picked at random from a file containing records from both the cardiology and the neurosurgery clinics. BERTs tokenizer was used to tokenize the

sentences and print the tokens individually, so as to keep consistency between the model’s predictions and the annotated set.

The entity tags in the output of the KB-BERT-NER were not in IOB format. Whether an entity consisted of several words, or several entities were adjacent, the tags were the same. For example, in the name ‘John Doe’, the token ‘John’ would be tagged with ‘PER’, instead of ‘B-PER’, and ‘Doe’, would be tagged as ‘PER’, instead of ‘I-PER’. Because the model’s output tags were in this format, the annotation was done in the same way.

The named entity tags used to annotate were the ones mentioned on the KB-BERT-NER repository, namely person name(PER), location(LOC), time(TME), organization(ORG), and event(EVN).

Table 1 shows the statistics of the created gold-standard test set. The 1000 annotated sentences consisted of 9234 tokens, where most of the tokens were outside-words, meaning that they did not belong to an entity. There were more occurrences of PER and TME, than of LOC and ORG, which were very few. No EVN entity was found in the annotation process and therefore left out.

Sentences Tokens O PER TME LOC ORG

1000 9234 8480 90 391 9 10

Table 1: The number of sentences, tokens, and occurrences of each tag in the test set

3.3 Implementing KB-BERT-NER

The model was implemented using python and Huggingface’s Transformers library (Wolf et al., 2019). This was the method recommended by Malmsten, et. al. (2020). The only tweak to

the code on the repository3_{was that the model, configuration, and vocabulary files had to be}

downloaded and run locally, because of the restricted internet access of the working

environment. When the model had been implemented, a sentence of raw text could be passed through the model with the call of a function. The function would then return a dictionary with the tokenized tokens of the sentence, with the corresponding predicted entity tags.

(15)

15

3.4 Evaluation

The measures precision, recall, and F1 were calculated for the entity-types person name, location, time, and organization, for every token. For instance, for person names, a true positive was recorded if the model predicted that a token was a (or part of a) person name, and the token had been annotated as a person name. If an entity then consisted of three words, and the model predicted only two of them as part of the entity, that would count as two correct predictions out of three, instead of being counted as missing the entire entity.

Most tokens in a text are not entities, thus simply measuring a model’s accuracy is not very meaningful. In many cases, always predicting every word to not be a named entity would result in an accuracy of over 90 percent. Counting true positives, true negatives, false positives, and false negatives for every tag, to then calculate precision, recall and F1 is therefore more telling (Jurafsky & Martin, 2020).

Precision is a measure of how many of the model’s positive predictions are true positives. The maximum score is 1.0, meaning that all the model’s positive predictions are actually true positives. Precision is calculated as:

• true positives / (true positives + true negatives)

Recall is a measure of how many of the actual true positives the model predicted as true positives. The maximum score is 1.0, which means that the model predicted every single actual true positive as a positive. Recall is calculated as:

• true positives / (true positives + false negatives)

F1 is a measure that balances precision and recall. As described in Jurafsky & Martin (2020), F1 is a harmonic mean (as opposed to an arithmetic mean, or average). Essentially, this means that it weighs the lower of the two values more. F1 is calculated as:

(16)

16

4. Evaluation Results

Table 2 shows the precision, recall, and F1 for the model’s predictions compared to the gold standard, for each tag. The table also shows the mean average measures between all tags.

The model had the most success identifying person name entities, with an F1

score of 0.8295 and a perfect recall of 1.0, meaning that all 90 person names was tagged as such. The 0.7087 precision for PER, however, means that only roughly 70 percent of the tokens predicted as person names actually were person names.

The entity ORG was the one the model struggled most with. A precision of

0.2500 and recall of 0.2000 resulted in the F1 score of 0.2222. For the LOC entity, there was a big difference between recall and precision. The model did not miss any LOC entities,

however more than half of the tokens tagged as LOC were false positives. It is important to note that there were 9 LOC and 10 ORG entities in the dataset.

An average precision of 0.5122 and an average recall of 0.7329 resulted in an

overall F1 of 0.5888

Tag Precision Recall F1

PER 0.7087 1.0 0.8295

TME 0.6398 0.7315 0.6826

LOC 0.4500 1.0 0.6207

ORG 0.2500 0.2000 0.2222

AVG 0.5122 0.7329 0.5888

(17)

17

5. Discussion

In this chapter, there is a discussion of what the results of the evaluation mean, a critical review of the methods used, and a discussion of what ethical implications the study might have.

5.1 Discussion of results

For anonymization, because of the ability to connect a medical record to an individual, the most important named entity to detect is person names. The model performed surprisingly well at predicting person names. A 1.0 score in recall for the tag ‘NER’ means that the model predicted all actual person names as person names. In other words, the model did not miss a single person name token. However, the precision score at 0.7087, was not as good. Only roughly 70 percent of the tokens predicted as person names were actual person names according to the gold standard. Although recall might be considered the more important measure when it comes to anonymization, a low precision score has a negative impact on downstream NLP tasks. In this case, the 30 percent of tokens, mistakenly tagged as names, might be names of medicines, illnesses, implants or something else, which may well be essential for other tasks. The F1 score when predicting PER entities of 0.8295 is substantially lower than what Malmsten et al. (2020) achieved. This was not unexpected since their

evaluation was done on a test set from the same source as the training data.

The model performed slightly worse at predicting tokens referencing units of

time (TME). This could be because dates and time of the day are written in very inconsistent ways in the medical records. It is also possible that the measures for the TME tag are more reliable, because of the fact that there were more TME entities in the gold standard set. The recall of 1.0 for PER might decrease on a larger test set.

For the tags LOC and ORG, the scores were below par. These entities were

however few and far between in the test data, rendering the measures quite unreliable. The 1.0 recall for the LOC tag means that the model found all LOC entities

The results of this evaluation are pore in comparison with the results in

Garcia-Pablo et al. (2020), where they used a BERT model for NER on the MEDDOCAN testset. Their evaluation showed an average F1 score of roughly 0.96, whereas the average F1 of this evaluation was about 0.59. The evaluations are not directly comparable, since the tags used in this study and theirs differ. It is however clear that BERT performs significantly better when fine-tuned on a domain specific labeled dataset.

5.1 Method criticism

The method of using a BERT model fine-tuned on a dataset containing documents from other domains, to find named entities in medical records, is quite obviously not optimal. Fine-tuning a BERT model on labeled text from medical records would almost certainly result in

significantly better performance. This study was done to investigate if a model trained on a different type of text would produce adequate results, because of the lack of a dataset of medical records with labeled named entities.

(18)

18

For anonymization some of the tags used in the KB-BERT-NER are redundant,

and some important tags are left out. For instance, a location such as the name of a city is not enough to identify an individual, but an address is. A NER model trained for the specific use of anonymization should instead be trained by treating specifically sensitive data as named entities.

Having a larger test set than the one created in this study would be preferable.

1000 sentences with 9234 tokens are a lot of samples, which takes some time to annotate. However, because most sentences consist solely of words not of any interest, it does not amount to a lot of named entities. To expand the test data set would improve the reliability and generalizability of the results.

The test set could further be improved by having an additional person annotate

the gold standard. An inter-rater agreement score could then be measured. This was however not possible in this case, due to ethical and legal reasons, regarding the sensitivity of the data.

Furthermore, it would have been valuable to compare several models of NER on

the test set that was created. Because of the lack of a labeled dataset of medical records required to fine tune BERT, a lexicon-based method might have yielded better results.

5.1 Ethical implications

The motivation of automated anonymization is fundamentally an ethical one. We want to protect and preserves privacy of individuals, especially with such sensitive information as medical journals. The possibility to save a lot of time and money is however also a powerful motivator in all areas.

It is important that the evaluation of an anonymization model is than accurately

and extensively. Putting too much trust in a system that is supposed to anonymize perfectly can be problematic. Bias in the training data or an error somewhere in the implementation or evaluating processes could lead to a faulty system, that we are expecting to anonymize correctly. However, one could argue that the manual anonymization performed by inconsistent humans is just as prone to error.

Although the purpose of this study was to contribute to the area of

anonymization, a NER tool specialized in finding sensitive information can theoretically be use for the opposite. A system that is very effective might for instance be able to find where occurrences of sensitive information where the anonymizer has made an error, and then use this information for some sinister purpose. Hopefully, a successful and effective

(19)

19

6. Conclusions

In this study, a pre-trained and pre fine-tuned BERT model was evaluated on new data. This was done as an attempt to investigate the potential of using the model to anonymize medical records.

The first research question regarded how accurate a pre-trained and fine-tuned

BERT model would be at identifying named entities in medical records. This question was answered by annotating a small test set of medical records, evaluating the model’s

performance and comparing the results to other similar studies.

The second research question regarded if the evaluated model could be used for

anonymization of medical records. With all of the shortcomings of the study in mind, the evaluation did yield some positive results. The success of the model at extracting person names, even when trained on data that was not domain-specific, does show promise for using a BERT model for anonymization. If your goal is to anonymize some medical records by masking all person names, the results of the evaluation indicate that the model could succeed. Even when fine-tuned on data from unrelated domains, the model found all person names in the test set. The model is however not optimal and the results perhaps not satisfying enough. If the model were to be used to mask all person names, according to this evaluation roughly 30 percent of those masked tokens would be something other than person names. Moreover, the NER-tag in the model is the only essential one for an anonymization task. Whereas for anonymization, one would want the model to also identify and tag social security numbers and addresses, for instance. The off-the-shelf method of using the readily available KB-BERT-NER for identifying named entities in text from the medical domain is attractive, but perhaps optimistic. To fine-tune a BERT model for the specific case of anonymizing, in the specific domain of medical records for a more satisfying result, a labeled dataset is needed. However, because of the knowledge already existent in the pre-trained model, the labeled dataset does luckily not have to be that large.

(20)

(21)

21

Reference List

Berg, H., Henriksson, A., & Dalianis, H. (2020, November). The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis (pp. 1-11). Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. García-Pablos, A., Perez, N., & Cuadros, M. (2020). Sensitive data detection and

classification in spanish clinical text: Experiments with bert. arXiv preprint arXiv:2003.03106.

Gardner, James, and Li Xiong (2008). HIDE: an integrated system for health information DE-identification. 2008 21st IEEE International Symposium on Computer-Based Medical Systems. IEEE,

Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word

embedding models: methods and experimental results. APSIPA transactions on signal and information processing, 8.

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8, 842-866.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

Jurafsky, D., & Martin, J. H. (2020) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd Edition Draft. https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf

Malmsten, M., Börjeson, L., & Haffenden, C. (2020). Playing with Words at the National Library of Sweden--Making a Swedish BERT. arXiv preprint arXiv:2007.01658.

Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A., Rodriguez, H., Martin, J. L., Villegas, M., & Krallinger, M. (2019). Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. In

IberLEF@ SEPLN (pp. 618-638).

Meystre, S. M., Friedlin, F. J., South, B. R., Shen, S., & Samore, M. H. (2010). Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology, 10(1), 1-16.

Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3-26.

(22)

22

Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.

Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, Z., Huang, F., & Tu, K. (2020). Automated Concatenation of Embeddings for Structured Prediction. arXiv preprint arXiv:2010.05006.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint

arXiv:1910.03771.

Yadav, V., & Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470.

Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: deep contextualized entity representations with entity-aware self-attention. arXiv preprint arXiv:2010.01057.

Named-entity recognition with BERT for anonymization of medical records