Word embeddings and Patient records : The identification of MRI risk patients

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2019 | LIU-IDA/KOGVET-G--19/002--SE

Word embeddings and Patient

records

–

The identiﬁcation of MRI risk patients

Erik Kindberg

Supervisor : Robert Eklund Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Identification of risks ahead of MRI examinations is identified as a cumbersome and time-consuming process at the Linköping

University Hospital radiology clinic. The hospital staff often have to search through large amounts of unstructured patient data to find information about implants. Word embeddings has been identified as a possible tool to speed up this process. The purpose of this thesis is to evaluate this method, and that is done by training a Word2Vec model on patient journal data and analyzing the close neighbours of key search words by calculating cosine similarity. The 50 closest neighbours of each search words are categorized and annotated as relevant to the task of identifying risk patients ahead of MRI examinations or not. 10 search words were explored, leading to a total of 500 terms being annotated. In total, 14 different categories were observed in the result and out of these 8 were considered relevant. Out of the 500 terms, 340 (68%) were considered relevant. In addition, 48 implant models could be observed which are particularly interesting because if a patient have an implant, hospital staff needs to determine it’s exact model and the MRI conditions of that model. Overall these findings points towards a positive answer for the aim of the thesis, although further developments are needed.

(4)

Acknowledgments

To my supervisor at Cambio Healthcare Systems, Marcus Petersson for the continuous support and putting a lot of faith in me. To Hampus Arvå for your friendship and expert knowledge in Word embeddings. To Patricia Lindblad for always being there when you are needed.

(5)

List of Figures

4.1 Distribution of categories across all annotated terms . . . 13 4.2 Visualization of vector clusters for the search terms . . . 14

(7)

List of Tables

3.1 Search words . . . 8

3.2 Search categories, explanation in body text. . . 9

4.1 Distribution of categories for each search word A . . . 12

(8)

1 Introduction

Swedish health care faces increased challenges in a growing number of patients due to an ag-ing population (Statistics Sweden, 2018). In March 2019, the median waitag-ing time for an MRI examinations was 36 days across all regions of Sweden. These MRI examinations are vital for the diagnosis of a multitude of diseases such as various types of cancer or Multiple Sclerosis. Due to the increasing age of the population, there is an increasing amount of patients with different implants such as pacemakers and stents, implants designed to keep a passageway in a blood vessel. Magnetic resonance imaging (MRI) cameras make use of strong magnetic fields which means that metallic implants may cause harm to the patient because the mag-netic field causes the implants to shift in the body or heat up, causing internal burns. They also risk interfere with the functions of a pacemaker which could be lethal. This is why it is important to screen patients for implants before an MRI examination. Currently, the screen-ing is done by self-reportscreen-ing. The patient fills in a paper form where all implants are declared. For a number of reasons, this is not a stable method. It can be difficult for a healthy person to remember her medical history and patients often suffer from an impaired memory or have trouble communicating. The consequence is that in many cases health care staff has to man-ually review the entire medical history of the patient, a method that is resource-expensive in both time and manpower.

Significant gains in time can be made, if health care staff can receive computational sup-port in the task of MRI risk identification. Currently, patient journal data are largely unstruc-tured, the structured parts of the journals include data such as time stamps and descriptive data of the patient, but lacks any sort of indications of implants. To computationally pro-cess the data, a method that can handle unstructured data are required. Word embeddings have been identified as a potential solution to the problem of pre-MRI screenings. There is a large amount of patient data available to train models on, the data itself is unstructured free text, written by a large amount of different health care staff. Staff does not necessarily adhere to a strict terminology. In theory, different models of implants will be represented by similar word vectors, abbreviations will have similar vectors to their full words and even misspellings could be caught by the model.

1.1 Aim

The aim of this thesis is to evaluate the usefulness of Word embeddings in helping health care staff identifying risk patients ahead of MRI examinations by Word embeddings model generation from patient journal data. Furthermore, the aim is to develop a pipeline for the processing and evaluation of patient journal data in the context above. This will aid further development such as expansion of the data set to multiple departments or other areas of use with similar techniques.

1.2 Research question

This thesis will explore the following resaerch questions:

• Can Word embeddings models be used to identify abbreviations, synonyms and se-mantically similar and terms with spellings errors in patient journal data?

(9)

1.3. Delimitations

1.3 Delimitations

The focus for this thesis is the evaluation of Word embeddings as a solution to the specific problem of pre-MRI screening. It is not intended to develop novel methods for Word em-beddings. Another delimitation is the method of creating Word emem-beddings. This project will only consider Word2Vec. The main reason is that the purpose is to evaluate Word em-beddings as a method in this specific domain and not to compare different variants of Word embeddings.

1.4 Structure

This thesis is divided into six chapters. Chapter 2 provides an overview of the theoreti-cal background for Word embeddings and natural language processing (NLPin health care. Chapter 3 is a description of the method used for generating Word embeddings and how they are evaluated. Chapter 4 presents the results in a similar order to chapter 3. In chapter 5, method and results are discussed. Chapter 6 will conclude the thesis and offer suggestion for future research.

(10)

2 Theory

This chapter is divided into three sections. The first section covers the theoretical background for Word embeddings. The second section narrows the theory to Word2Vec specifically. The third section is an overview of NLP in the health care domain.

2.1 Word Embeddings

The fundamental idea of Word embeddings is to represent a word as a multidimensional vec-tor (Jurafsky & Martin, 2009), because then you are able to perform calculations on words and quantify them relative to each other within a certain data set. It is then possible to per-form semantic analyzes such as word or sentence similarity, because the vectors of word who are used in a similar manner end up near each other in the vector space. Two popular and modern implementations of Word embeddings are Word2Vec, created by Google engineers and GloVe, created by researchers at Stanford University. These are considered to perform on a similar level, with some empirical support towards Word2Vec being slightly ahead. In a study performed by Naili, Chaibi, and Ben Ghezala (2017) Word2Vec and GloVe was evalu-ated on topic segmentation in English and Arabic languages, Word2Vec performed slightly ahead in the study.

2.1.1 Vectors

A vector is a mathematical unit containing two attributes, magnitude and a direction (Ivanov, 2011). A vector is represented by a straight line and the magnitude of the vector decides how long the vector is and the direction decides where the vector is pointed in a space. Cosine similarity is a method to measure similarities between two vectors. By measuring cosine sim-ilarity, you measure the cosine of the inner angle between two vectors (Lei, Si, Wen, & Shen, 2017). This gives a measure between -1 and 1 where two vectors of the same orientation has a value of 1, two vectors at 90 degrees has a cosine similarity of 0 and two vectors at 180 degrees has a value of -1. Euclidean distance is another measure of vector comparison, this measure uses Pythagoras theorem by placing two vectors at a 90 degree angle and measuring the hypotenuse that is created between the two vectors. The main difference between cosine similarity and euclidean distance is that magnitude of the vectors involved matters for the latter.

2.1.2 Neural Networks

Artificial neural networks (ANNs) are an architecture for machine learning roughly inspired by the function of biological neural networks in human brains (Norvig & Russell, 2009).The fundamental idea that is being reproduced in ANNs is that activity in the brain is driven by brain cells, also known as neurons. These neurons are structured in large and complicated networks. There is not a strong consensus on the number of neurons in a human brain but the number is estimated to be in the billions. In simple terms, these neurons have N number of input connection from other neurons and one output connection, the output is a function from the sum of inputs and a neuron is said to activate once the output function reaches a specific threshold value.

An artificial neural network consists of N numbers of units (nodes) that are constructed to roughly mimic the function of biological neurons. There is an input layer, an output layer and

(11)

2.2. Word2Vec

N number of hidden layers in between. Every node has an activation function that summa-rizes all inputs and if the sum is above a certain threshold value, much like biological neuron, the node activates. This activation is binary, if the sum of all inputs is below the threshold the activation is 0 and if it is above the threshold the activation is 1. In later years, these arti-ficial neural networks have shown many promising results within the domain of learning. A notable example is Google DeepMinds chess AI, Alpha Zero (Silver et al., 2018).

2.1.3 Towards a Standard data set of Swedish Word Vectors

Towards a Standard data set of Swedish Word Vectors describes a comparison of different word embedding models by Fallgren (Fallgren, Segeblad, & Kuhlmann, 2016). The Skip-Gram and Continuous bag-of-word methods from Word2Vec and GloVe are compared. In this paper, the Word2Vec methods come out slightly ahead of GloVe. The comparison is measured on an intrinsic measure known as QVEC-CCA (Tsvetkov, Faruqui, & Dyer, 2016). The measure describes how well generated word vectors match manually created vectors. The rough idea behind these manually created vectors are that they represent the frequencies of terms where terms represent some kind of linguistic quality. According to Fallgren et al. this measure correlates well with downstream tasks. The scripts used in these generations are publicly available and provide an out of the box solution for generating Word embeddings.

2.1.4 Evaluation

A common perspective on the evaluation of artificial intelligence models is the Cranfield Paradigm (Dalianis, 2018). This paradigm is based on the idea of creating a gold standard, a set of data that is considered the target goal for the task that the AI model is supposed to perform. Through a gold standard, precision and recall can be calculated and parameters can be continuously tweaked with the goal of optimization in mind. These gold standards are often manually crafted. For manual annotation tasks it is recommended to have at least two different annotators working separately. After working separately the annotators convene and compare their annotations, out of this an Inter Annotation Agreement (IAA) score can be calculated as a measure of the agreement of the annotators. Generally a high IAA score sug-gests that the annotation has a higher rate of validity due to different annotators coming to a similar conclusion separately. A common way of doing this is to have an odd number of annotators, so that a majority a majority decision can be made when disagreements occur.

2.2 Word2Vec

Word2Vec is an architecture of Word embeddings created by Mikolov, Sutskever, Chen, Cor-rado and Dean (2013). It functions on a shallow neural network of two layers. There are two different methods for word vector generation available in Word2Vec, Continuous bag of words and Skip-gram. Word2Vec makes use of sub sampling of frequent words. The basic principle behind this is that words that appear very frequent provide less information than less fre-quents words. In Word2Vec the very frequent words are removed before creating a model, to give less frequent words a larger significance in the model. In Word2Vec, the training data is used as input into the neural network and a learned word vector is the output.

Word2Vec has 4 hyper parameters to consider when generating Word embeddings. Method (Skip-Gram or Continuous bag-of-words), dimensionality, window size and num-ber of iterations. Skip-Gram (SGNS) is a method that attempts to make predictions about a context, given a specific word in the training data (Mikolov et al., 2013). SGNS is less time efficient than CBOW, but performs better on low frequency words. Continuous bag-of-words (CBOW) compared to Skip-Gram, attempts to predict the current word by its context. It is called a bag-of-words architecture because it does not take the order of the surrounding words in consideration (Mikolov, Chen, Corrado, & Dean, 2013). Dimensionality controls

(12)

2.3. Patient Journal Data

how many dimensions the generated vectors will have. A two-dimensional vector can be represented in a coordinate system and will have an x value and a y value. Window size controls the size of the context that is being looked at by the algorithm. For example if a win-dow size of five is chosen, the algorithm will look five words ahead and five words behind when deciding the context. Number of iterations controls how many times the algorithm will optimize its weights and rerun the data.

2.2.1 Visualization

Visualization of Word embeddings help in the understanding and analysis of a model. Word2Vec is difficult to visualize because it is a multidimensional vector space, with N num-ber of dimensions corresponding to the numnum-ber of unique tokens in the data set being used. One method to approach this problem is the T-distributed Stochastic Neighbor Embedding (t-SNE) algorithm. The idea behind this algorithm is to reduce dimensionality, but to keep relative pairwise distance (van der Maaten & Hinton, 2008). Stochastic Neighbor Embed-ding (SNE) transforms high-dimensional Euclidean distance between vectors into conditional probabilities. In simplified terms, the conditional probability pj|iis a measure of how likely it is that xj would pick xias a neighbor. This can also be applied to low dimensional vector qj|ir. If a conditional probability for a low-dimensional vector can correctly model the proba-bility of a high-dimensional vector, then q_j|iis equal to p_j|i. The SNE algorithm strives to find low-dimensional data points that minimize the difference between q_j|iand p_j|i. To measure the difference between qj|iand pj|i, a measure known as Kullback-Leibler divergence is used. This measure is not symmetric in the sense that different types of errors between pairwise dis-tances in the low-dimensional space are not weighted equally. The SNE algorithm is a good method for visualization, but has been proven to T-distributed SNE is a method to solve this problem. The t-distribution means that a simpler, symmetric cost function is used and that a Student-t distribution is used rather than a Gaussian distribution that is used in the original SNE algorithm.

2.3 Patient Journal Data

In this section, the unique attributes of patient journal data will be presented in three sub-sections. The first subsection concerns patient records, the second section covers medical abbreviations and the third section covers the unique terminology of health care language.

2.3.1 Patient records

Patient records are mainly written for internal use, often in stressful environments. This means that patient records have several distinct properties that needs to be considered when performing NLP tasks on such data. A study by Pakhomov (2005) estimates that 30 per-cent of patient journal data are non-word tokens, abbreviations, acronyms, misspellings and wrongly used grammar. The authors and target audience are expert medical staff, who are using domain-specific technical jargon. The jargon is so specific for each discipline within medicine that jargon, abbreviations and terms may be incomprehensible by other disciplines (Dalianis, 2018). Patient records are created for efficiency, which means that sentences often are incomplete. on top of heavy use of abbreviations and jargon. Record systems often does not include support systems such as grammar checks because they often interfere with the very technical language of patient journal data.

2.3.2 Abbreviation Detection

Estimations are that patient records contain between 3 to 10 percent of abbreviations, many of these are ambiguous (Dalianis, 2018). It may be helpful to detect and expand

(13)

abbrevia-2.3. Patient Journal Data

tions for NLP tasks. There have been multiple approaches made from rule-based systems to machine-learning. Isenius (2012) created a prototype of a heuristic based system, A Swedish Clinical Abbreviation Normalizer (SCAN) that attempted to detect abbreviations by searching for common abbreviation patterns, such as hyphenations. In evaluation this method scored a precision of 85 percent and recall of 75 percent on a test set of 2050 abbreviations manually annotated by a senior physician.

2.3.3 Terminlogy

Patient journal data makes use of expansive terminologies, that are often domain specific. One of the main efforts to standardize labels for diagnoses is the ICD-10 system (Dalianis, 2018). In this case, ICD stands for International Statistical Classification of Diseases, and the number denotes what version of the standard is being used, with version ten being the latest version in use. The ICD system offers a codification of diseases and is used both for clinical reasons and for administrative methods. The codes have two parts, the first is one of 22 general categories signified with a letter from A to Z, followed by two numbers for subcategories. The second part designates a more specific diagnosis or illness by up to four numbers. For example the code Z95.810 is used for the presence of automatic (implantable) cardiac defibrillator.

Another effort to systematically categorize clinical terminology is the Systematized Nomen-clature of Medicine Clinical Terminology (SNOMED CT) database (Dalianis, 2018). SNOMED CT attempts to create a hierarchical structure of clinical terminology. There is a Swedish edi-tion that is being curated by the Naedi-tional Board of Health and Welfare (Socialstyrelsen). A public web browser is available, but a license is required to access and download a com-plete database. For example, the term "pacemaker" has the parents "heart implant" and "life sustaining equipment".

(14)

3 Method

This chapter describes the process of generatingWword embeddings from the specified data set, and how they are evaluated. It also describes how Word embeddings can be visualized

3.1 Data

This section describes what data was used and how it was pre processed before it was used for model generation.

3.1.1 About the data

The data set for the Word embeddings consists of patient journal data from the clinic of cardi-ology at Linköping university hospital. The data set consists of one large collection of patient journal notes. The journal notes are anonymized and unsorted. The data is stored in a csv file with the size of 543MB. After cleaning, the data is being stored in a text file of 492MB, where each entry is separated by a new line. After cleaning the data consisted of 660 thousand lines and about 66 million tokens.

3.1.2 Cleaning

The data needs to be structured into a format suitable for the Word2Vec model generation script. All scripts described in this thesis are written in python 3. According to Fallgren et al.(2016) the script used takes one sentence per line. In the case of the patient journal data set, it means that the full window size can not be utilized in many cases due to the fact that many sentences in the data set consists of one or only a few words. Using one sentence per line will lead either to decreased vector quality because the full window size will not be utilized in many cases or a substantial amount of training data being excluded, leading to less overall model quality. Instead the choice has been made to structure the data with one journal note per line instead to qualify a larger set of data as well as utilizing the window size of Word2Vec.

The csv file is processed by a script that extracts the data column that contains the journal notes. Then, non-word characters, with the exception of hyphens are removed from the jour-nal notes. Lastly, uppercase characters were converted to lowercase characters. After these character have been stripped away, the number of journal notes and tokens in the data set are counted. Finally, the cleaned journal notes are written onto a plain text file formatted with one journal note per line.

3.2 Word Embeddings

This section describes how the word embedding models are generated and how they are going to be evaluated.

3.2.1 Model generation

This thesis uses a word vector generation script developed by Fallgren et al (Fallgren et al., 2016). The Gensim python package is used to generate the word embedding models ( ˇReh ˚uˇrek & Sojka, 2010). The skip-gram method is chosen for the model generation due to performing better on less frequent words (Mikolov et al., 2013) and time not being a major constraint. The

(15)

3.3. Model evaluation

assumption is being made that misspellings and non-standard abbreviations are low frequent in the data. Other hyper parameters is set based on the findings in Towards a Standard data set of Swedish Word Vectors. The hyper parameters is set to Dimensionality = 300, Window size = 6 and Number of iterations = 10. The script is modified to generate 10 models from the data set at once. This modification adds a fifth parameter for running the scripts, the number of models that are to be generated. Word2Vec randomly initiates the values of the embeddings at the start of generation and this randomness leads to some variation in results. The cosine distances for each respective search word are aggregated across the 10 models by calculating a mean for each word embedding. This means that the cosine distances that this thesis is based on will be the mean of cosine distances from 10 models generated on the same data set.

3.3 Model evaluation

The generated models are evaluated by analyzing the cosine distance of the 50 closest neigh-bours to a set of search words determined by medical staff at the radiology clinic of Linköping university hospital to be relevant for MRI safety, specifically on the specific category of pa-tient journal notes in the data set The search terms are related to implants found in papa-tients. The search terms are presented in table 3.1, for descriptions of search terms, see appendix A

Table 3.1: Search words Search word English translation

Pacemaker Pacemaker

icd icd

crtp crtp

crtd crtd

klaff heart valve

stent stent

graft graft

implantat implant

elektrod electrode protes prosthesis

To speed up the evaluation process a script is written with the purpose of writing the 50 closest neighbors, their cosine distance to the search words in question and frequency into a csv file that can be directly imported into a spreadsheet. The cleaned data set is loaded into the script, and non-English characters are stripped because the word embedding generation script removes these characters. This is a necessity to match word frequencies with Word embeddings in the data set. A simple word count for each unique token is performed in the data set and the frequencies are stored in a dictionary with tokens stored as keys and their frequencies stored as values. Then the aggregated Word embeddings model is loaded. Each search word is written to a new csv file as a label, accompanied with its frequency. Then each of the closest 40 neighbours to the search words are written into the csv file, with its cosine distance to the search word and its frequency separated by commas. This csv file is then imported into a spreadsheet.

Each of the close neighbours are manually annotated with an appropriate category. A wide range of resources are used to correctly annotate the 50 closest neighbors to the search words. Documentation from manufacturers websites are used to both identify manufactur-ers in the results and identify models of implants. The internal resource for MRI safety of the radiology clinic known as SMRLink is being used to identify types of implants and exami-nations, procedures and treatments. Since the focus of the resource is on MRI safety, it is an especially valuable resource for deciding if a term is relevant or not. SNOMED CT is used as well, since it provides a large-scale database that covers a wide area of medical terms and

(16)

can be accessed through a public web browser. The annotation is performed by the author and validated by MRI physicists at the radiology clinic of Linköping university hospital. This arrangement was done because of a lack of time and resources. Then the observed categories are divided into two categories, categories that are useful for the identification of risk patients ahead of MRI examinations and categories that are not useful for this purpose.The criteria for a useful category is a category of terms that would stop, or at least postpone an MRI exam-ination until the context of the term has been investigated completely. The categories are data-driven in the sense that they were gradually created as the data was annotated. When-ever a term was annotated that did not fit an existing category, an appropriate category was created. Whenever a new category was created, already annotated terms were revisited and re-categorized if necessary. The categories are presented in table 3.2.

Table 3.2: Search categories, explanation in body text.

Number Swedish category English translation Example

1 EPT1 EPT1 Implanterad (English: Implanted)

2 Variant av sökord Variant of search word stentet (The stent)

3 Elektrisk enhet electrical unit Volt

4 Teknisk term Technical term devicen (The device)

5 Anatomisk term Anatomical term Mitralis

6 Typ av implantat Type of implant crt

7 Diagnos 1 Diagnosis 2 aneurysm

8 Implantatmodell Implant model Protecta

9 Stoppord Stop Words rakt (Straight)

10 Tillverkare Manufacturer Medtronic

11 Läkemedel Drugs chirocaine

12 Oidentiferad term Unidentified term er

13 EPT2 EPT2 Kirurgi(Surgery)

14 Diagnos 2 Diagnosis 2 Hudförändring(Skin Change)

1. Examination/Procedure/Treatment (EPT) 1 -(Swedish:Undersökning/Ingrepp/Behandling 1)

Category 1 is a broad category containing terms that suggest an action taken towards the betterment of a patient and are considered relevant for MRI safety.

2. Variant of search word (Swedish: Variant av sökord)

Category 2 represents variants of the search word in question. These terms could, for example be the search word with different determiners, or abbreviations of the search word. This category is seen as relevant for the identification of MRI risks.

3. Electrical Unit (Swedish: Elektrisk enhet)

Category 3 consists of terms that signify electrical units that only appear when electrical devices are present. This category is seen as relevant.

4. Technical term (Swedish: Teknisk term)

Category 4 consists of technical terms, concerning parts and adjustments made to elec-trical devices. This category is seen as relevant for the identification of MRI risks. 5. Anatomical term (Swedish: Anatomisk term)

Category 5 gathers all terms concerning body tissue and parts of the biological body. This category is not seen as relevant for the identification of MRI risks.

(17)

6. Type of implant (Swedish: Typ av implantat)

Category 6 is a collection of terms that relate to types of implants that are not variants of the search words. Other search terms may occur here, as well as types of implants with a hierarchical relation to the search term, such as subordinate terms. This category is considered relevant.

7. Diagnosis/Basis of diagnosis 1 (Swedish: Diagnos/Underlag för diagnos)

All terms that relate to a diagnosis or any kind of basis for a diagnosis that are consid-ered relevant for MRI safety are gathconsid-ered in category 7. This include illnesses and test results among other things.

8. Implant model (Swedish: Implantatmodell/nummer)

Category 8 collects all terms related to models of implants, these terms can, for example be serial numbers or names of implant models. This category is relevant for MRI safety. 9. Stop Words (Swedish: Stoppord)

Category 9 consists of general language terms that do not carry meaning by themselves but are rather used for linguistic reasons. This category is not relevant for MRI safety. 10. Manufacturer (Swedish: Tillverkare av implantat)

Category 10 consists of all terms related to manufacturers of implants. This category is relevant for MRI safety.

11. Drugs (Swedish: Läkemedel)

Category 11 consists of terms for drugs and anesthesia. This category is not relevant for MRI safety.

12. Undecided term (Swedish: Obestämda)

Category 12 consists of terms that neither the annotator or the MRI physicists were able to determine. As the meaning behind these terms are determined, the category is seen as not relevant for MRI safety.

13. Examination/Procedure/Treatment - 2 (Swedish:Undersökning/Ingrepp/Behandling - 2)

Category 13 consists of terms that belonged to category 1 but were determined not to be relevant for the task of identifying MRI risks.

14. Diagnosis/Basis of diagnosis - 2 (Swedish: Diagnos/Underlag för diagnos - 2)

Category 14 consists of terms that belonged to category 7 but were determined not to be relevant for the task of identifying MRI risks.

The findings in category eight are compiled into a list of detected implants. Some implant models may consist of multiple terms present in the description of the generated model. By using documentation of the models from manufacturers websites, a list of implant models can be compiled.

3.3.1 Visualization

Scikit-learn is a python library for machine learning (Pedregosa et al., 2011), it has an imple-mentation of the t-SNE algorthim that is being used in this thesis to reduce dimensionality. Another python library, matplotlib is being used to plot the diagram in a 2D space (Hunter, 2007). Because visualizations of many terms are prone to clutter and page space is limited, the ten closest terms were chosen for each search word, rather than the fifty closest terms that were chosen for the manual annotation.

(18)

To generate the visualization, the aggregated model is loaded. Then the 10 closest bours for each search words are extracted from the model. Each search words and its neigh-bors make up a cluster of terms. These clusters are converted to arrays using NumPy, a sublibrary of the SciPy package (Jones, Oliphant, Peterson, et al., 2001). Using these arrays, a t-SNE model in a 2D space is generated using the scikit-learn implementation of the algo-rithm. The implementation of t-SNE has three parameters to adjust. Perplexity can be seen as a smooth measure of the effective number of close neighbors, Van der Maaten & Hinton (2008) recommends a value between 5 and 50. Number of components decides how many dimensions the low-dimensional components of the algorithm should have, typically 2 or 3 dimensions. The third parameter decides the type of initial initiations to be used. The gen-erated t-SNE model is transformed back into a NumPy array and this NumPy array is then plotted onto a scatter plot using matplotlib. The plot is then exported as a png file

(19)

4 Results

In the following section, the results of the word embedding generation are presented. In table 4.1 and table 4.2 the distribution of terms between the 14 categories are presented.

Table 4.1: Distribution of categories for each search word A Category Useful Pacemaker ICD CRTP CRTD Heart Valve

1 x 1 8 7 5 3 2 x 1 4 6 2 1 3 x 13 1 0 0 0 4 x 10 8 4 5 2 5 5 0 2 2 19 6 x 8 10 11 15 10 7 x 0 2 0 0 0 8 x 0 12 15 12 3 9 1 2 1 0 2 10 x 0 0 4 6 0 11 0 0 0 0 0 12 8 0 0 1 1 13 1 1 0 1 1 14 2 2 0 1 8

Table 4.2: Distribution of categories for each search word B Category Useful Stent Graft Electrode Implant Prosthesis

1 x 3 6 2 11 8 2 x 2 0 0 0 0 3 x 0 0 0 0 0 4 x 2 2 3 7 2 5 10 16 4 8 12 6 x 12 14 11 10 19 7 x 1 0 0 0 2 8 x 14 0 17 0 0 9 2 2 2 5 0 10 x 0 0 2 0 1 11 0 0 1 0 0 12 4 0 5 2 1 13 0 5 1 2 1 14 0 5 1 5 0

(20)

4.1. Implant Model Detection

In figure 4.1 a graph of the distribution of categories is presented. The largest category iss category six, types of implants as 23% of all the terms are classified to this category. Category six is followed by category eight and five with 15.4% and 15.8% respectively.

Figure 4.1: Distribution of categories across all annotated terms

The result is that (68%) of the close neighbours can be considered useful for the identi-fication of MRI risk patients. Category two and six covers synonyms, semantically similar terms, abbreviations and misspelled terms, these two categories make up 26.2% of the total annotated terms.

4.1 Implant Model Detection

In total, 48 implant models could be found within the top 50 closest vectors for each search word. For a complete table see appendix A.

(21)

4.2. Visualization

4.2 Visualization

The top 10 closest neighbours for each search word are visualized in figure 4.2. Figure 4.2: Visualization of vector clusters for the search terms

(22)

5 Discussion

This chapter is divided into three sections. The first section is a discussion of the results. The second section is a discussion of the methods used. The third section will place the thesis in a wider context.

5.1 Results

The following section is divided into two subsections, the first one discussing the results from the model evaluation and the second subsection discussing the results from the visualization.

5.1.1 Model evaluation

68% of the 50 closest were considered useful for identifying risk patients ahead of MRI. This means that the prospects of using Word embeddings for this type of task are hopeful. The challenge will be to find a meaningful way to utilize these findings in a way that will help medical staff in reducing administrative load. The largest share of not useful terms were anatomical terms, which is logical because an implant often is mentioned in patient journal data in relation to its position in the body. Particularly interesting are the findings of implant models and numbers, because staff that decides if a patient is able to undergo an MRI exam-ination or not needs to know the specific implant model, because different models may be MRI safe with certain conditions. It is not as simple as a binary classification of unsafe or safe patients. Due to the typical naming of implant model, this task is likely easier on Swedish data than on English data. Typical implant names contains English words, such as ellipse or allure, and these does not exist naturally in the Swedish language. Another thing to note is that hyperparameter optimization likely can lead to better results, because the parameters that was used were optimized by Fallgren et al. (2016) on non-medical data, which is con-siderably different. It was not possible to optimize in this thesis due to the lack of a measure to optimize against. The results are considered good enough to demonstrate the potential in Word embeddings for the task of MRI risk identification.

5.1.2 Visualization

Other than providing a clear image of how the terms are related to each other, certain points about the model can be made from the visualization. The visualization can be used to verify that the model is semantically correct. The clusters for the search terms "crtd","crtp" and "icd" are are closely related, meaning that the model has understood that these are tightly connected. A CRT-D implant is essentially a combined CRT-P and ICD implant. One thing to note is that visualization using the t-SNE algorithm is a low-dimensional representation of a multidimensional space and thus not an exact visualization of the model. The idea behind the algorithm is to keep the relative distance between vectors. But since the representation is based on a stochastic algorithm, there may be some room for random variation on the same data.

5.2 Method

In the following section, the method used in the thesis will be discussed. This includes the following subsections, data preparation and model evaluation.

(23)

5.2. Method

5.2.1 Data preparation

In this subsection the data preparation is discussed. This includes how non-word tokens, stop-wordsa and abbreviations was handled.

5.2.1.1 Non-word tokens and stop-words

Non-word tokens are stripped from the data set before model generation due to them inter-fering with the vector generation. For example, final words in a sentence become separate vectors. This will lead to the algorithm generating different vectors for "crtp." and "crtp. An exception iss made for hyphens due to them carrying semantic value. For example "crtp" can be written as "crt-p". These variations in language does not seem to follow a pattern and both terms are used inconsistently. This means that it will be very difficult to create a rule based cleaning process that would preserve hyphens where they are needed for semantic value and remove them where they are not. Stop words are not removed, because they are not consid-ered to have a significant influence on the data set. Stop words are generally words that do not carry much meaning in themselves but are used for linguistic flow. Stop words do not have a significant presence because patient journal data are not written with linguistic flow in mind, but are rather written with the purpose of relaying very specific information in a time effective manner.

5.2.1.2 Abbreviation detection

Abbreviations are not expanded because, while there are lexicons for medical abbreviations (see section 2.3.2), these resources were created for emergency room and radiology notes. The abbreviations in the lexicon are a poor fit for the patient journal data used in this thesis. Key abbreviations such as "CRT" or "ICD" are not present in the lexicons. Even if abbreviations appear, one should be careful when expanding them due to their ambiguity. For example take the abbreviation "ICD", in this thesis it may mean both "International Statistical Classification of Diseases and Related Health Problems" and "Implantable Cardiovascular Defibrillator". A system that is designed to detect risks for MRI examinations that mistakes one of these meaning for the other is not a safe system.

5.2.2 Evaluating a model

This subsection discusses how the model was evaluated.

5.2.2.1 Creating a gold standard

The Cranfield paradigm (see section 2.2.2) outlines the value of creating a gold standard to use for evaluation of a model. This way you can calculate a standardized metric score, often in form of accuracy, recall and possibly F1-score. The design behind a gold standard is very important because careful consideration needs to be taken in regards to what is actually mea-sured. In the context of this thesis a valid gold standard captures the mechanisms of whether a Word embeddings model could capture risks for MRI examinations in patient data or not. One way to do this is to create a data set of patient journals categorized with patients that were sent to an MRI examination or not. This method is not possible however, because the data for MRI examinations were not present in the actual data set from the cardiology clinic and the patient journal data was anonymized, meaning that it was not possible to create pa-tient profiles from the data. Another possible way to create a gold standard is to compare the amount of implant models in the data set compared to the amount of implants found in the model within a certain range of cosine distance. This method requires a complete reference of the types of implants that are being used and have been used in the past. Unfortunately, this resource does not exist.

(24)

5.3. The work in a wider context

Furthermore, there are attempts to systematize medical terminology that could poten-tially be used in a gold standard in a similar manner as implant models. The issue with this approach is that patient journal data is not structured and thus the structured terminology models reflect the actual data poorly. There are no ICD-10 codes present in proximity to their respective search words and many of the terms used in the data have different names or are not represented at all in the SNOMED CT database. SNOMED CT does not cover abbre-viations, making the issues with abbreviation detection more critical and even if correctly expanded the terms may not be present in the database in its current form. For example, the term "ICD" is not present in the Swedish edition of SNOMED CT at the time of writing. The expanded version, "Implantable cardiac defibrillator" is not present either. The term "Im-plantable cardiac pacemaker" is present, but that term is ambiguous because it is uncertain if it used for the "ICD" type of implant or the more general term "pacemaker". Instead a descriptive measure is provided as a basis for answering the stated research question. The descriptive measure can be used to demonstrate the potential in this method for solving the issue of MRI safety.

5.2.2.2 Annotation

For annotation tasks you should use at least two annotators are used so that an IAA can be calculated (see section 2.2.2). Due to resource constraints, this is not possible. One annotator worked alone on the annotation, and the annotation was verified afterwards by the annotator and a group of MRI physicists. In the verification process, any uncertain terms were resolved and mistakes corrected. During this process, the categories were verified as well. Afterwards, the annotation group determined which categories were useful or not. It is important to note that the annotation group (the author and the three MRI physicists) does not have the same domain-specific knowledge of cardiology as experts in the field. A cardiologist may annotate the terms differently, but they have been annotated with a specific purpose in mind, and relative how MRI physicists would use the terms to identify risk patients. It is also important to note that errors may occur due to the very domain-specific nature of patient journal data. Even if MRI physicists are highly-skilled medical staff, and regularly read patient journals from the cardiology clinic they were not fully acquainted with language of cardiology notes.

5.3 The work in a wider context

Health care is a very interesting domain for NLP research for two main reasons. This thesis highlights both the potential of NLP in health care, and the potential roadblocks. Firstly, it is a field with high potential for usefulness. Lowering the administrative burdens of medical staff means that they can spend more time treating and otherwise helping people. Secondly, it provides a significant technological challenge for the field of NLP. Because of the nature of patient journal data, the threshold for effective systems is higher compared to language in general. At the same time, the room for error is much less than NLP in general. A system error may have dire consequences for both staff and and patients, in terms of both money and lives lost. A conclusion that may be drawn is that unsupervised systems are less desirable than decision support systems for medical staff. This is due to the stochastic nature of many machine learning algorithms, including Word embeddings.

(25)

6 Conclusion

The aim of this thesis was to answer if Word embeddings were an appropriate method to identify risk patients ahead of MRI examinations. To answer this questions, Word embed-dings was generated on patient journal data from the cardiology department of Linköping University Hospital by using the Word2Vec algorithm implemented through Gensim. To evaluate the generated Word embeddings model, interesting key words were chosen by staff at the radiology department at Linköping University Hospital and the 50 closest vectors by using cosine distance were selected and manually annotated with categories that were clas-sified as relevant or irrelevant for the task of identifying risk patients ahead of MRI exami-nations. This resulted in a distribution of 68% useful terms. Specifically interesting were the category of implant models and numbers, 49 implants were detected in the model evalua-tion. Word embeddings have been shown to have potential for the task of identifying risk patients, although due to limitations inherent in patient journal data further work is required to develop an effective method that may be implemented in existing journal systems.

Further developments should primarily be fundamental building blocks required for the implementation of an effective systems. To guarantee that a warning system for MRI risks suits the medical staff’s needs, through user research should be made. To effectively evaluate system performance, a gold standard should be generated consisting of patient journal data from multiple clinics, annotated with taken MRI measures, if a patient underwent an MRI examination or not. Extensive terminologies containing abbreviations will aid in medical NLP tasks, these terminologies would need to be clinic-specific

(26)

Bibliography

Dalianis, H. (2018). Clinical text mining: Secondary use of electronic patient records. New York: Springer International Publishing.

Fallgren, P., Segeblad, J., & Kuhlmann, M. (2016). Towards a standard dataset of swedish word vectors. In Proceedings of the sixth swedish language technology conference (sltc). UmeÃ¥, Sweden.

Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineer-ing, 9(3), 90–95. doi:10.1109/MCSE.2007.55

Isenius, N. (2012). Abbreviation detection in swedish medical records: The development of scan, a swedish clinical abbreviation normalizer (Master’s thesis, Stockholm University).

Ivanov, A. (2011). Vector. Retrieved April 18, 2019, from https://www.encyclopediaofmath. org/index.php/Vector

Jones, E., Oliphant, T., Peterson, P., et al. (2001). SciPy: Open source scientific tools for Python. [Online; accessed 2019-0527]. Retrieved from http://www.scipy.org/

Jurafsky, D. & Martin, J. H. (2009). Speech and language processing (2nd edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

Lei, K., Si, S., Wen, D., & Shen, Y. (2017). An enhanced computational feature selection method for medical synonym identification via bilingualism and multi-corpus training. (pp. 909–914). doi:10.1109/ICBDA.2017.8078771

Mikolov, T., Chen, K., Corrado, G. S. [Greg S.], & Dean, J. [Jeffrey]. (2013). Efficient estimation of word representations in vector space. Retrieved from http://arxiv.org/abs/1301. 3781

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. [Greg S], & Dean, J. [Jeff]. (2013). Dis-tributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 26 (pp. 3111–3119). Curran Associates, Inc. Retrieved from http : / / papers . nips . cc / paper / 5021 distributed representations of -words-and-phrases-and-their-compositionality.pdf

Naili, M., Chaibi, A. H., & Ghezala, H. H. B. (2017). Comparative study of word embedding methods in topic segmentation. Procedia Computer Science, 112, 340–349. doi:https : / / doi.org/10.1016/j.procs.2017.08.009

Norvig, P. & Russell, S. J. (2009). Artificial intelligence: : A modern approach. Upper Saddle River, NJ, USA: Prentice Hall.

Pakhomov, S., Pedersen, T., & Chute, C. G. (2005). Abbreviation and acronym disambiguation in clinical discourse. AMIA Annual Symposium Proceedings, 2005, 589. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/16779108

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

ˇ

Reh ˚uˇrek, R. & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45– 50). http://is.muni.cz/publication/884893/en. Valletta, Malta: ELRA.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., . . . Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144. doi:10 . 1126 / science . aar6404. eprint: https://science.sciencemag.org/content/362/6419/1140.full.pdf

Statistics Sweden. (2018). The future population of sweden 2018–2070. Retrieved May 27, 2019, from https : / / www. scb . se / en / finding - statistics / statistics - by - subject - area /

(27)

Bibliography

population/population- projections/population- projections/pong/statistical- news/ the-future-population-of-sweden-2018-2070/

Tsvetkov, Y., Faruqui, M., & Dyer, C. (2016). Correlation-based intrinsic evaluation of word vector representations. In Proceedings of the 1st workshop on evaluating vector-space repre-sentations for NLP (pp. 111–115). Berlin, Germany: Association for Computational Lin-guistics. doi:10.18653/v1/W16-2520

van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. Retrieved from http : / / www. jmlr. org / papers / v9 / vandermaaten08a.html

(28)

A

Descriptions of Search terms

• Pacemaker

A pacemaker is implanted to help recover the heart’s normal rhythm when it detects a slow or irregular heartbeat.

• Implantable Cardioversion Defibrillator (ICD)

An ICD is an implanted device that sends electrical impulses to restore your heart’s nor-mal rhythm when it detects a potentially dangerous and very fast heartbeat. (Abbott) • Cardiac Resynchronization Therapy Pacemaker (CRT-P)

A CRT-P device function much like a regular pacemaker, in addition it sends small elec-trical impulses to both ventricles of the heart, to help them pump blood in synchrony. • Cardiac Resynchronization Therapy Defribillator (CRT-D)

A CRT-D device is a CRT-P device combined with an ICD device. • Heart Valve

This term can both point towards the biological heart valves in a healthy heart, and this case artificial heart valves used to replace damaged biological heart valves. (Källa) • Stent

Coronary stents are mesh tubes made out of wire implanted into a clogged artery with the purpose of widing it and preventing the artery from closing again.

• Graft

Grafting is in general terms, a procedure where body tissue is moved from one part of a patient’s body to another. In this specific instance it is referred to grafts where the grafted tissue is manufactured artificially.

• Implant

Implant is a general term for artificial devices inserted into a patient to replace a dam-aged or missing biological structure.

• Electrode

An electrode is a conductive part of an implanted device used to send signals from the device to the heart. Electrodes can sometimes be left in a patient when the device is removed.

• Prosthesis

(29)

B

Table of detected Implants

Pacemakers CRT-D/ICD CRT-P Heart Valves

Vitatron G20A1 Amplia Allure Sapien3

Assurity Fortify Anthem Sapien

Endurity Fortify Assura Solara

Unify Syncra Unify Assura Protecta Ellipse Promote Vitality Viva Evera Quadra Assura Unify Quadra Brava

Stents Electrodes Prosthesises

Rotablator Surescan 4076 Carpentier-Edwards Perimount Biofreedom Surescan 5076

Integrity Tendril 2088tc

Emerge Quartet 1458q

Coroflex Durata 7122q

Synergy Durata 7120q

Orsiro 4968 CapSure Epi Xience Sprint Quattro 6935M

Onyx Quickflex

Absorb CapSure 5033

Promus premier CapSure 5034 CapSure 5023m

Word embeddings and Patient records : The identification of MRI risk patients

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2019 | LIU-IDA/KOGVET-G--19/002--SE

Word embeddings and Patient

records

The identiﬁcation of MRI risk patients

Erik Kindberg

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research question

1.3

Delimitations

1.4

Structure

2

Theory

2.1

Word Embeddings

2.1.1

Vectors

2.1.2

Neural Networks

2.1.3

Towards a Standard data set of Swedish Word Vectors

2.1.4

Evaluation

2.2

Word2Vec

2.2.1

Visualization

2.3

Patient Journal Data

2.3.1

Patient records

2.3.2

Abbreviation Detection

2.3.3

Terminlogy

3

Method

3.1

Data

3.1.1

About the data

3.1.2

Cleaning

3.2

Word Embeddings

3.2.1

Model generation

3.3

Model evaluation

3.3.1

Visualization

4

Results

4.1

Implant Model Detection

4.2

Visualization

5

Discussion

5.1

Results

5.1.1

Model evaluation

5.1.2

Visualization

5.2

Method