• No results found

Exploring Cross-lingual Sublanguage Classification with Multi-lingual Word Embeddings

N/A
N/A
Protected

Academic year: 2021

Share "Exploring Cross-lingual Sublanguage Classification with Multi-lingual Word Embeddings"

Copied!
57
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2020 | LIU-IDA/STAT-A--20/027--SE

Exploring Cross-lingual Sublanguage

Classification with Multi-lingual Word

Embeddings

Min-Chun Shih

Supervisor : Marco Kuhlmann Examiner : Oleg Sysoev

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Cross-lingual text classification is an important task due to the globalization and the in-creased availability of multilingual data. This thesis explores the method of implementing cross-lingual classification on Swedish and English medical corpora. Specifically, this the-sis explores the simple convolutional neural network (CNN) with MUSE pre-trained word embeddings to approach binary classification of sublanguages (“lay” and “specialized”) from Swedish healthcare texts to English healthcare texts. MUSE is a library that provides state-of-the-art multilingual word embeddings and large-scale high-quality bilingual dic-tionaries. The thesis presents experiments with imbalanced and balanced class distribution on training data and test data to examine the effect of class distribution, and also examine the influences of clean test dataset and noisy test dataset. The results show that balanced distribution of classes in training data performs significantly better than the training data with imbalanced class distribution, and clean test data gives the benefit of transferring the labels from one language to another. The thesis also compares the performance of the sim-ple convolutional neural network model with the Naive Bayes baseline. Results show that on this task a simple Naive Bayes classifier based on bag-of-words translated using MUSE English-Swedish dictionary outperforms a simple CNN model based on MUSE pre-trained word embeddings in several experimental settings.

(4)

Acknowledgments

I would like to thank my supervisor Marco Kuhlmann for the patient guidance and advice. Thank you for providing vital feedback through this thesis. I would also like to thank my external supervisor Marina Santini for taking me on this project. Thank you for your tremen-dous help, patience, and encouragement. I have been fortunate to have the supervisors guid-ing me throughout the process and helpguid-ing me complete this thesis.

Thanks to my examiner Oleg Syoev for taking on the role and giving me comments. Thanks to Hariprasath Govindarajan for helping me improve the thesis. Thanks to Matilda Guthartz Pålsson for helping my English writing. Many thanks to all the people who give me support, advice, and accompanying during this tough time.

I thank the wonderful people I met at LiU and in Sweden. Special thanks to my friend Jiawei Wu who has delighted my life at LiU.

Last but not least, thanks to my mom and my sister. Thank you for always being so supportive, trusting, and giving me unconditional encouragement, without your backing, I would not be able to have this journey.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Background . . . 2 1.2 Aim . . . 2 1.3 Research Questions . . . 3 1.4 Ethical Considerations . . . 3 2 Theory 5 2.1 Text Classification . . . 5

2.2 Cross-lingual Text Classification . . . 6

2.3 Convolutional Neural Network . . . 7

2.4 Word Embeddings . . . 9

2.5 Cross-lingual Word Embeddings . . . 10

2.6 Convolutional Neural Network in NLP . . . 12

2.7 Naive Bayes Classifier . . . 15

2.8 Evaluation . . . 16

2.9 Statistical Hypothesis Testing . . . 18

3 Data 22 3.1 Dataset . . . 22

3.2 Pre-trained Cross-lingual Word Embeddings . . . 26

4 Method 28 4.1 Modeling . . . 28

4.2 Simple CNN Model. . . 29

4.3 Mono-lingual Text Classification . . . 29

4.4 Cross-lingual Text Classification . . . 31

4.5 Naive Bayes Classifier . . . 32

4.6 Result Evaluation . . . 32

5 Results 33 5.1 Mono-lingual Classification . . . 33

5.2 Cross-lingual Classification . . . 36

(6)

6 Discussion 40 6.1 Results . . . 40 6.2 Differences and Challenges . . . 43 6.3 Method . . . 44

7 Conclusion 45

7.1 Research Questions . . . 45 7.2 Future Work . . . 46

(7)

List of Figures

2.1 CNN architecture . . . 8

2.2 The convolutional operation . . . 9

2.3 Max pooling . . . 9

2.4 Word vector representations of words in English and Swedish. . . 11

2.5 The method of MUSE. (Conneau et al. [conneau2017word]) . . . 12

2.6 The model architecture of CNN in NLP. (Adapted from Kim [kim2014convolutional]). . . 13

2.7 Detail of the CNN architecture on texts classification. (Adapted webpage.) . . . 14

2.8 Representation of sentence. Each row is a word, which is represented by a k-dimensional word vector. . . 14

2.9 ROC curve & AUC . . . 18

2.10 Rejection regions for different decision rules. . . 19

3.1 Distribution of word count in Swedish subcorpus. . . 23

3.2 Distribution of word count in English eCare subcorpus. . . 24

3.3 Distribution of word count in NHS-PubMed English subcorpus. . . 25

4.1 Preprocessing: prefixing "sv" for each token in Swedish documents and for every word in the Swedish word embedding. English subcorpus has the same prepro-cessing. (Adatped from Laippala et al. [laippala2019toward].) . . . 28

4.2 The approach of simple CNN. (Adapt from Laippala et al. [laippala2019toward]) . 29 4.3 Visualizing the CNN model in the experiments. . . 30

4.4 All data sets are split into training and test sets with 80% and 20% split, and 80% training set is applied 10-fold cross validation. . . 31

5.1 The ROC curve & AUC for five mono-lingual models. . . 36

5.2 ROC curve & AUC for cross-lingual models. Swedish eCare training data: lay(154) v.s. specialized(308). . . 38

5.3 ROC curve & AUC for cross-lingual models. Swedish eCare training data: lay(154) v.s. specialized(154). . . 38

(8)

List of Tables

2.1 Co-occurrence matrix . . . 10

2.2 Confusion matrix . . . 16

3.1 The sizes of the two classes in the three subcorpora. . . 22

3.2 Descriptive statistics of Swedish eCare subcorpus word count. . . 23

3.3 Descriptive statistics of English subcorpus word count. . . 24

3.4 Descriptive statistics of NHS-PubMed English subcorpus word count. . . 25

3.5 Unique words and the words covered in the word embeddings. . . 27

5.1 The results of evaluating on test dataset in different setting for number of filters. Epoch = 20, Filter size = 1, Input tokens = 500. . . 34

5.2 The results of evaluating on test dataset in different setting for filter size. Epoch = 20, Number of filters = 128, Input tokens = 500. . . 34

5.3 The results of evaluating on test dataset in different setting for input tokens. Epoch = 20, Number of filters = 128, filter size = 1. . . 35

5.4 The results of evaluating on test dataset in mono-lingual classification. Epoch = 20, Number of filters = 128, Filter size = 1, Input tokens = 500. . . 35

5.5 Cross-lingual classification. Epoch = 20, Number of filters = 128, Filter size = 1, Input tokens = 500. . . 37

5.6 All results in mono- and cross- training settings. . . 39

5.7 p-value of Wilcoxon signed-rank test on weighted average F1-score (‘*’ is 0.05 sig-nificance code). . . 39

6.1 DatasetA - Positive: 30 v.s. Negative: 90 . . . 42

(9)

1

Introduction

Text classification (a.k.a. text categorization) has been applied in many Natural Language Processing (NLP) tasks, such as document organization, spam filtering, information retrieval, and web searching. Automatic text classification has been in development for decades. Due to the increased availability of data in electronic format and the availability of more power-ful hardware, it became more important in the early ’90s as document management gained prominence. Nowadays, much larger amounts of texts are in various digital forms, such as e-mails, blogs, news, etc. Therefore, the importance of efficient classification and information retrieval has increased significantly. The automated classification of texts is a machine learn-ing technique; it learns the characteristics of the classes from a set of pre-classified documents and then assigns classes to the unlabelled documents automatically. Automatic classification of text documents without any human involvement not only saves the cost of manual la-bor power considerably, but it can also be easily transferred to similar tasks with different domains (Dalal and Zaveri [8], Sebastiani [30]).

With the globalization and the growth of the internet, the availability of multilingual data is increasing, therefore it is of interest to organize the documents in different languages into the same taxonomy of categories. For instance, a global company wants to organize docu-ments in various languages from branches into the same taxonomy of categories. However, documents do not always have labels for different language datasets. In order to categorize the new target language, we can approach this task in two ways. The first way is to manually label the new target dataset, which is time-consuming and usually expensive. The other way is by transferring the characteristics from one language to the target language, and this ap-proach is called cross-language text classification. Cross-language text classification (CLTC) is the task of categorizing unclassified documents in two different languages by learning a text categorization model from a set of pre-defined documents in one language Ls and then classify the new documents in another language Lt(Wei et al. [37], Xu and Yang [39]). Thus, if reliable and efficient methods could be found to accelerate the annotation of unlabelled data, it would benefit many cross-lingual classification tasks. Various language resources would be quickly available, and the tasks could be transferred to other languages immediately.

Cross-lingual text categorization has been researched in many prior studies. For instance, Bel et al. [4] used three different translation strategies (document translation, terminology translation, and profile-based translation) to approach cross-lingual (English and Spanish in their study) text classification. Later on, Wan [34] proposed machine translation and the

(10)

co-1.1. Background

training approach to improve the accuracy of cross-lingual review sentiment polarity identi-fication. However, these methods require a parallel corpus or a thesaurus, and they are not always available. In NLP, corpus (plural corpora) defines as a collection of texts.

Word embedding is a technique of word representations which maps a set of words to a high dimensional vector space. It has been found very useful in many mono-lingual NLP problems. In 2013, Mikolov et al. [22] proposed the idea of different languages had similar vector representations if the words have a strong semantic similarity, where semantic means the meaning of the words. The concept of the similarity on vector spaces helps the develop-ment of applying multi-lingual word embeddings on categorizing multi-lingual text. With multi-lingual word embeddings, what a model learns from one language can easily be trans-ferred to another language, which has a similar context to the training corpus. The goal of this thesis is to apply the pre-trained multi-lingual word embeddings to approach text classi-fication cross-linguistically.

1.1

Background

Medical terminology can be very difficult to understand for people who do not have a spe-cialized education. Healthcare experts use professional terminology to convey and communi-cate with each other, but patients and caregivers often have difficulties to fully comprehend medical texts. This lack of mutual understanding is potentially dangerous, since patients may misunderstand medical recommendations or apply them incorrectly, putting their own health at risk. This problem is common in all languages in the world because all languages are characterized by standard vs. domain-specific linguistic variations.

E-care@home1is a Swedish research environment which includes cross-disciplinary com-petences such as Artificial Intelligence, Semantic Web, Internet of Things and Sensors for Health. The eCare2corpus (whose creation was funded by the the Ecare@home project) has gathered texts from many Swedish web pages; it is a concept-specific medical collection that contains chronic diseases (e.g. “ansiktstics” or “lungemfysem”) descriptions. The English corpora include English eCare and NHS-PubMed, which have similar content as Swedish eCare corpus. The English eCare corpus is gathered by E-care@home project, which is col-lected from multiple websites, and the NHS-PubMed corpus is newly crawled from the NHS website and downloaded from PubMed FTP for this thesis. Based on the diverse data sources, we called the English eCare corpus noisy data and the NHS-PubMed clean data.

Now, the eCare project wants to label the new crawled English documents (English eCare and NHS-PubMed) to transfer the research domain from Swedish to the new language. Nev-ertheless, manually annotating the unlabelled English texts is time-consuming, so they want to find an efficient way to approach automatic labelling. Multi-lingual classification is the idea that they want to apply in this task; learning the features on Swedish corpus then trans-fer them to unlabelled English corpora without any Swedish resources.

1.2

Aim

This thesis aims to explore the feasibility and the efficiency of cross-lingual classification (“lay” and “specialized”) on Swedish and English corpora in the medical domain, identifying lay-specialized linguistic variations from one language to another without human interven-tion, thus reduces pre-processing overhead. The results achieved in this thesis will be used as an empirical baseline with which future experiments will be compared.

Rather than starting from scratch, we build upon the technique proposed in Laippala et al. [17] to explore the aim. Therefore, we focus on the simple convolutional neural network

1http://ecareathome.se

(11)

1.3. Research Questions

(CNN) model with MUSE3pre-trained word embeddings to implement the cross-lingual text classification. In order to assess the performance of our cross-lingual model based on MUSE word embeddings, we use a cross-lingual Naive Bayes classifier based on MUSE bilingual dictionary as the baseline.

We are aware of the distribution of class (“lay” v.s. “specialized”) is imbalanced, and this might affect the performance of models. Hence, we try different proportion of classes for training data and test data, and also consider the texts (noisy v.s. clean) of two English data sets. Therefore, in addition to examine the achievement of the CNN model with MUSE pre-trained word embeddings, there are three main parts we want to explore in this thesis:

• Examine the effect of the model on balanced and imbalanced training data. • Examine the effect of the model on balanced and imbalanced test data. • Examine the effect of the model on noisy and clean test data.

1.3

Research Questions

Given different class distributions of training data and test data to approach convolutional neural network model with MUSE pre-trained word embeddings, and given the Naive Bayes model as a baseline, there are four research questions that we want to answer in this thesis:

1. Does simple convolutional neural network with MUSE pre-trained word embeddings approach work in classifying cross-lingual corpora in the medical domain? What is the performance of the model with respect to the Naive Bayes baseline?

2. Does the class proportion of training data affect classifying performance? How does it affect the performance?

3. Does the class proportion of test data affect classifying performance? How does it affect the performance?

4. How is the model performance affected when using the noisy and clean test datasets?

1.4

Ethical Considerations

The text data in this thesis were gathered from various webpages, and all the texts are freely available on the web. The web texts do not contain any confidential data that requires any ethical consent for use and we use these articles only for academic research to detect linguistic variations.

Copyright

For the eCare corpus (eCare_sv), the eCare website4has the following disclaimer “Copy-right is held by the author/owner(s) of the web documents included in the corpus. The documents in the corpus can be used for research purposes ONLY. We are ready to delete any documents in the corpus upon the author/owner(s)’ request.”

For the NHS lay corpus, according to the terms and conditions stated by NHS5, it is al-lowed to use the NHS Website Content “including copying it, adapting it, and using it for any purpose, including commercially, provided you follow these terms and conditions and

3https://github.com/facebookresearch/MUSE 4http://santini.se/eCareCorpus/

(12)

1.4. Ethical Considerations

the terms of the OGL6(Open Government Licence)”. There is no personal data in the infor-mation included.

The PubMed specialized corpus is based on the PMC Open Access Subset. which is a part of the total collection of articles in PMC7. The articles in the Open Access Subset are made available under a Creative Commons or similar license8. We downloaded the documents from the FTP Service9, which is the official service that is used for downloading articles from the Open Access Subset.

6http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ 7https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

8https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 9ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2018-sample/

(13)

2

Theory

This chapter provides related work, relevant ideas and theoretical explanations of different components that are used in this thesis.

2.1

Text Classification

Text classification (categorization) is an important field. It has been applied in many Natural Language Processing (NLP) tasks such as document organization, spam filtering, informa-tion retrieval, and web searching (Dalal and Zaveri [8]). Along with the increased availability of documents in digital forms (such as e-mails, blogs, news, etc.) the importance of text clas-sification cannot be neglected. The goal of text clasclas-sification is to learn the model from a set of pre-defined class documents and automatically assign classes to the unclassified documents. Formally, let

t(d1, y1),(d2, y2), . . . ,(dn, yn)u

be a set of n pre-defined documents, where diP D, D is a domain of documents, and yi PC, C is the set of pre-defined classes. The text classification model learns the function φ : D Ñ C to approximate the unknown target functionΦ : D Ñ C, such that φ and Φ coincide as much as possible (Wei et al. [37]).

There are various approaches to text classification, for example, Naive Bayes (Kim et al. [14], Meena and Chandran [19]) and Support Vector Machines (Wang et al. [35], Zhang et al. [40]) are the common supervised machine learning methods of text classification tasks. On the other hand, Naive Bayes and Support Vector Machines with bag-of-words sentence representation are often applied as baselines for text categorization tasks (Wang and Manning [36]). Besides linear classifiers, many neural network models have been proposed recently (Collobert and Weston [6], Zhang et al. [41]), and pre-trained word embeddings have been proven to be very useful in NLP tasks. For instance, Socher et al. [32] used semi-supervised recursive autoencoders to predict sentiment distributions; Kim [15] proposed the architecture of CNN model combined with Word2Vec (Mikolov et al. [22]) pre-trained word embeddings for sentence classification.

(14)

2.2. Cross-lingual Text Classification

Sublanguage

Hirschman and Sager [13], Lehrberger [18] defined the notion of sublanguage as the partic-ular language (e.g., special terminology or mathematical formulas) that are used in circum-scribed fields, defined by specialists and applied when they convey and communicate with each other. A layman has difficulty understanding the subject matter of texts when they are written in sublanguages, and this can be a communication barrier between an expert and a layman.

Sublanguage Text Classification

Santini et al. [27] have explored monolingual binary classification (easy-to-read variety v.s. standard language) on three different resources corpora. In the paper, they used four different word representations to combine different classifying methods, and then compare the performances. One of the datasets Santini et al. [27] used is eCare, which is the same resource as our Swedish medical subcorpus. In their experiments, the best performance for weighted average F-measures achieved was 0.82 with the combination of feature presentation autoencoders and SVM methods.

2.2

Cross-lingual Text Classification

Cross-lingual text classification is a task of classifying undefined class documents in differ-ent two languages by learning the model from the source language Lsto the target language Ltinto the same taxonomy of categories. Much previous text classification research focused on classifying monolingual documents. As the availability of multi-lingual data has been increasing, more studies are concentrating on classifying documents with more than one lan-guage.

Cross-lingual text classification has been approached with various methods. Most of the approaches either need a parallel corpus or a bilingual dictionary, or use machine translation technology. For instance, Bel et al. [4] used document translation, terminology translation, and profile-based translation to implement cross-lingual text classification. Mihalcea et al. [20] approached cross-lingual classification by translating the subjectivity lexicon from the source language to the target language. Later on, Wan [34] proposed machine translation and the co-training approach to improving the accuracy of cross-lingual review sentiment polarity identification. Along with more multi-lingual word embedding methods being developed, cross-lingual word embeddings are applied to documents representations with a neural net-work model. For example, Schwenk and Li [29] extended the previous work from Klementiev et al. [16], they compared different methods to classify various languages of Reuters corpus by training a simple one-layer convolutional neural network on top of word embeddings, and the performance achieved great results.

Laippala et al. [17] leveraged English online data for Finnish online corpus categorized by using the English data as training data. Their approach applied simple convolutional neu-ral network and MUSE pre-trained word embeddings (Conneau et al. [7]), training only on English corpus while identifying the genre in Finnish corpus without any Finnish resources. MUSE1is a library that provides pre-trained multilingual word embeddings and large-scale high-quality bilingual dictionaries. Their method achieved good performance while predict-ing six registers (genres) from English to Finnish. The cross-lpredict-ingual classification does not need to involve any parallel data or pre-translation with the pre-trained word embedding, which is a benefit for the limited language pair resources.

While the research of Laippala et al. [17] was focusing on cross-lingual register identi-fication, our task in this thesis is exploring the feasibility and the efficiency of transferring

(15)

2.3. Convolutional Neural Network

the labels between Swedish and English in domain-specific, namely the lay-specialized sub-languages, approach in healthcare texts. Instead of starting from scratch, we build upon the method proposed in Laippala et al. [17] to explore our task. However, there are some differ-ences between our tasks and theirs that need to be considered.

From Crosslingual Register Classification to Cross-lingual Sublanguage Classification In Laippala et al. [17], they transferred six online registers (genres) classes from English to Finnish with a simple CNN model and MUSE pre-trained word embeddings, and their approach achieved competitive results. The register is an important predictor of language variation. The notion of a genre is about the purpose of a document, not its content, which means texts from the same genre can have distinct topics (Finn and Kushmerick [10]). For instance, the Narrative genre in Laippala et al. [17] includes News, Short stories, and Personal blogs as sub-registers, and these articles can describe various topics. In contrast, we want to approach a binary classification of sublanguages from Swedish healthcare texts to English healthcare texts. Sublanguage is jargon or a specialized language used by a particular group and characterized by domain-specific terms. The notion of sublanguage is theoretically and computationally easier to capture than the notion of the register (genre). However, there are still several differences and challenges when we adapt the method from Laippala et al. [17].

• The language pair: The experiment in Laippala et al. [17] trained the data on English documents then predicted on Finnish documents, while we want to transfer the labels from Swedish to English. Swedish and English belong to the Germanic language, and Finnish belongs to Uralic language. The language pair from the same language family might be beneficial when transferring the labels from one language to another.

• The content: Since our data is the collection of medical texts, and MUSE is the word vectors trained from fastText (which were trained from Wikipedia corpora) and only included the first 200k most common words. The domain-specific terminology in our datasets might not be included in MUSE.

• The sizes of the data: Comparing to the data used in Laippala et al. [17], there are a limited amount of data in Swedish eCare and English eCare. The small sizes of data might impact the performance of the models.

• The distribution of classes: The proportion of lay and specialized labels are imbalanced in all datasets.

• Clean v.s. noisy data: The eCare Swedish corpus that we use as training data is noisy because it contains typographical inaccuracies caused by the crawler when removing boilerplates; it occasionally contains also URLs and even words in other languages. We see a value in building models based on noisy data because the cleaning data step is time-consuming and sometimes alters the actual noise inherent to textual data. This is a notable difference with the Finnish paper we were inspired by, since the authors used a curated corpus as training data. As test data, we use both noisy data (the eCare English corpus) and clean data (the NHS-PubMed English corpus). By clean data here, we mean that the texts in the corpus contain fewer inaccuracies and inconsistencies, although they have not been curated.

2.3

Convolutional Neural Network

Neural networks are one of the approaches applied in machine learning algorithms, and they were inspired by the process of the human brain. Convolutional neural network (CNN) is a type of neural network which has accomplished remarkable success in many computer vision

(16)

2.3. Convolutional Neural Network

tasks in recent decades. For instance, object detections and face recognition are examples of the application.

Figure2.1shows the architecture of a Convolutional Neural Network. It is composed of two convolutional layers, two Pooling layers, and one Fully connected layer. These are the main layers in Convolutional Neural Network. In the next three subsections, we will explain the details of these three main steps.

Figure 2.1: CNN architecture

2.3.1

Convolution

Convolutional Layer is the first layer of building block which extracts features from an input matrix such as an image, for instance. It includes convolution operation and an activation function, for instance, Figure2.1uses ReLU and it is one of the most commonly used activa-tion funcactiva-tions in deep learning. The convoluactiva-tion operaactiva-tion is a mathematical operaactiva-tion, by multiplying an input matrix and filters (or so-called kernels). The convolution operation is shown in Figure2.2, where the operation is denoted by the notationÂ. The method is as following: in the first big matrix in Figure2.2, the upper left with grey background is a 3 ˆ 3 matrix, multiplying each component of the matrix with each component of the filter. That is,

3 ˆ 1+0 ˆ 0+1 ˆ 0+1 ˆ 0+1 ˆ 1+2 ˆ 0+2 ˆ 0+4 ˆ 0+2 ˆ 2=6

The value 6 is the component of the result matrix on the upper left. Next is moving the filter matrix one-unit to the right and do the convolution operation for each move. By repeating the procedure by moving the filter until the lower right, we can obtain the feature map.

Different filters can extract various information from single or multiple input data. The moving step is called stride; the filter can be moved one-unit or multiple units, it will extract different features.

2.3.2

Pooling

The pooling layer is also called sub-sampling. It summarizes the presence of features in patches of the feature map to reduce the number of parameters and prevent overfitting. The computational cost is also reduced because of parameter reduction. Max pooling (see Fig-ure2.3) and average pooling are the two most common pooling methods (Christlein et al. [5]). Max pooling operates the maximum values from each sub-region. Instead of maximum values, while average pooling computes the mean values from each sub-region.

(17)

2.4. Word Embeddings

Figure 2.2: The convolutional operation

Figure 2.3: Max pooling

2.3.3

Fully Connected

A fully connected layer (FC) is connected with a pooling layer at the end of the CNN struc-ture. Convolution layers and pooling layers extract features; a fully connected layer flattens the output of the pooling layer, transfers them to a single vector, and gives the final classifi-cation decision.

2.4

Word Embeddings

Human beings can easily comprehend and understand the meaning and the context of the texts, however, the corpora are just "strings" for a computer. In order to mine the texts, we need to represent the texts in the ways that computers can understand. Word embedding is a state-of-the-art technique of word representations. Word embeddings map words to a high dimensional vector space; the idea is based on the distributional hypothesis (Harris [12]), which states that words occurring in similar contexts tend to have similar meanings. In other words, the words represented by word embeddings are semantically close if they are close

(18)

2.5. Cross-lingual Word Embeddings

in a dimensional vector space. Close can be in euclidean or cosine distance. There are many approaches to the matter, the method with a co-occurrence matrix is probably one of the most intuitive approaches. Consider the sentences: "I like dogs." and "I like cats.". Table2.1shows the co-occurrence matrix that represents the words that occur in the same sentence. Each row of the matrix is a vector for a word, for example, "dogs" can be represented as a four-dimensional vector c(1, 1, 0, 0). The dimensions of the vectors will increase if we have more texts.

I like dogs cats

I 0 2 1 1

like 2 0 1 1

dogs 1 1 0 0

cats 1 1 0 0

Table 2.1: Co-occurrence matrix

There are many popular word embedding training techniques involving neural networks that have been developing and widely used, such as Tomas Mikolov’s word2vec [21], Stan-ford University’s GloVe [24] and AllenNLP’s Elmo [25], etc. The benefit of representing the words as vectors is the languages are mapped to vector spaces with precise mathematical properties, which means we can treat the language study as a vector space mathematical problem. For instance, the relationship among these four words, king, man, woman, and queen can be understood by an equation: v(king)´v(man) +v(woman) « v(queen), where v() represents the vector of a word.

2.5

Cross-lingual Word Embeddings

In 2013, Mikolov et al. [22] proposed the Skip-gram model and found that the vector spaces in different languages have similar geometric arrangments, even between substantially different language pairs like English and Vietnamese. Figure2.4illustrates the word embeddings of words in English and Swedish, which shows that semantically similar words in different languages have a similar arrangement.

With the idea of continuous word embedding spaces have a strong semantic similarity, documents can be represented by the similarly structured word vectors, word vectors with some benefits that can be leveraged. For example, we can compare the meaning of words cross-linguistically by the cross-lingual word embeddings, which can be applied to machine translation or cross-lingual information retrieval. In addition, a model can be transferred between languages since cross-lingual word embeddings represent words in a joint embed-ding space. This means learning a model in one language and then sharing it with another language without the target language involved in the model training process (Ruder et al. [26]).

Multilingual Unsupervised and Supervised Embeddings (MUSE)

Since the idea of cross-lingual word vectors was proposed (Faruqui and Dyer [9], Xing et al. [38], Ammar et al. [2], Artetxe et al. [3], Smith et al. [31]), more and more multi-lingual word embedding methods have been developed, but most of the methods need bilingual dictionaries or parallel corpora as anchor points to learn a linear mapping from the source language to the target language. Thus, Conneau et al. [7] presented the unsupervised ap-proaches, without parallel data, of learning cross-lingual word embeddings which achieved or outperformed other supervised methods on cross-lingual tasks. Multilingual Unsuper-vised and SuperUnsuper-vised Embeddings (MUSE) library2 was built by Conneau et al. [7], which proposed the word translation method without using any parallel data.

(19)

2.5. Cross-lingual Word Embeddings

Figure 2.4: Word vector representations of words in English and Swedish.

Figure2.5shows the method of MUSE. Assume that there are two different languages word embedding are trained independently. (A) X and Y are denoted as English and Italian word embedding, respectively. The dots represent the words, and the size of dots shows the frequency of the words in the training corpus. Now we want to align (or translate) these two word embeddings. (B) Using adversarial learning to train a linear mapping matrix W, such that the W is roughly aligning the distributions of two word embeddings (XW)«Y. A model called discriminator is trained to discriminate the randomly sampled green stars (XW and Y) to identify the origin of an embedding, and the mapping is trained jointly to fool the discriminator.

Discriminator objectiveThe discriminator loss can be written as:

LD(θD|W) =´ 1 n n ÿ i=1 logPθD(source=1 | Wxi)´ 1 m m ÿ i=1 logPθD(source=0 | yi)

where θD is the discriminator parameters and PθD(source = 1 | z)is the probability of a vector z is from a source embedding (source = 0 as to target embedding) which determined by the discriminator.

Mapping objectiveThe mapping is jointly trained with discriminator to fool the discrim-inator. LD(W | θD) =´1 n n ÿ i=1 logPθD(source=0 | Wxi)´ 1 m m ÿ i=1 logPθD(source=1 | yi)

(20)

2.6. Convolutional Neural Network in NLP

(C)Refining the matrix W by Procrustes solution with orthogonality constraint on W (Schöne-man [28]). The formula of fine-tuning the W is

W˚ =arg min WPOd(R)

}WX ´ Y}F =UVT, with UΣVT=SVD(YXT)

where Od(R)is the d ˆ d metrics space of real numbers with orthogonal constraint, X and Y are d ˆ n matrices of two word embeddings with parallel vocabulary, and W is update by the rule to stay close to an orthogonal matrix during the iteration of the training.

W Ð(1+β)W ´ β(WWT)W

where β=0.01 has good performance. (D) With the refined matrix W the word pairs can be found by nearest neighbors.

Figure 2.5: The method of MUSE. (Conneau et al. [7])

Conneau et al. [7] used the method they proposed to align the pre-trained word vectors from fastText, then provided the 30 languages pre-trained multilingual word embeddings in the MUSE library. In this thesis, we apply the MUSE pre-trained word vectors in our experiments to implement cross-lingual text classification.

2.6

Convolutional Neural Network in NLP

CNN model was originally invented for computer vision and has achieved outstanding re-sults. In 2014, Kim [15] transferred the model on text classification; he proposed the ideas of convolutional neural networks for sentence classification. Kim [15] used word2vec pre-trained word vectors with one hidden layer to approach the convolutional neural network (CNN) in NLP; the model structure is shown in Figure2.6. The model architecture in NLP is similar to the simple CNN architecture (Figure2.6) applied to images that we introduced in section2.3, but slightly variant. It contains a convolutional layer, pooling layer, and fully connected layer in the end.

See Figure2.7to disassemble each step in detail. The first row of the figure is the input sentence matrix. A sentence matrix represents each word as a k-dimensional word vector in a sentence, Figure2.8show as zoomed into detail. Let xi P Rk be a k-dimensional word vector, which represents the i-th word in the sentence. A sentence composed by n words is represented as x1:n = x1x2. . . ‘ xn, the concatenation operator is denoted by the nota-tion ‘. Let xi:i+j be the concatenation of words from i-th word to j-th word in a sentence. Instead of 2D filter, 1D filter is applied in convolution layer for natural language process-ing CNN model. Let the 1D filter w P Rhk, where h is the window of h words to produce new features. The second row of Figure2.7shows different filters with various windows of words. Hence, a filter w with a window of words xi:i+h´1can generate a feature ci, such that ci = f(w ¨ xi:i+h´1+b), where b is the bias and f is an activation function. After moving the 1D filter from the top to the end x1:h, x2:h+2, . . . , xn´h+1:n, we can obtain a feature map

(21)

2.6. Convolutional Neural Network in NLP

Figure 2.6: The model architecture of CNN in NLP. (Adapted from Kim [15])

c = [c1, c2, . . . , cn´h+1], where c P Rn´h+1. Hence, the convolution operation of every filter with activation function can be represented as a formula

C= f(W ¨ X+b)

where f is an activation function, and b is bias. ReLU is a common activation function which defined as the positive part of its argument:

f(x) =max(0, x) Thus, the function can be rewritten as

C= f(W ¨ X+b) =max(0, W ¨ X+b)

Max pooling is applied after the convolution layer. The max pooling operation captures the maximum values from each feature map. In Figure2.7, there are multiple filters with different region sizes which extract multiple features. Dropout regularization is applied after max pooling layer. The dropout is a technique of regularization proposed by Srivastava, et al. [33] for preventing overfitting when training a neural network model. It is especially useful when we only have a small amount of training data. The result passes to the fully connected softmax layer. A softmax layer is commonly used in the final layer of a classification neural network; it squashes the output into the values between 0 to 1, a probability distribution for the classes and the sum of the probabilities is 1. The softmax function is also called softargmax or normalized exponential function; it maps a K dimensional real number vector to another K dimensional real number vector σ :RRK, and is defined by the formula:

σi(z) = e zi řK

j=1ezj

, for j=1, . . . , K

Each element of the input vector z is applied by ezi, where i is the index of the element in the vector, then divide the values by the sum of the exponentials (řK

j=1ezj), which normalizes these values and ensuresř

σi(z) =1. Therefore, the output of softmax layer gives the prob-ability to each class, and the model labels the input documents to the class with the highest probability.

(22)

2.6. Convolutional Neural Network in NLP

Figure 2.7: Detail of the CNN architecture on texts classification. (Adaptedwebpage.)

Figure 2.8: Representation of sentence. Each row is a word, which is represented by a k-dimensional word vector

(23)

2.7. Naive Bayes Classifier

2.7

Naive Bayes Classifier

Naive Bayes classifier is one of the models that often used as a baseline for other methods in text classification tasks (Wang and Manning [36]), and it has shown the comparable perfor-mance with other classifiers, such as neural network models and decision tree (Mitchell et al. [23]).

Naive Bayes classifier is a probabilistic classifier based on applying Bayes’ theorem with the naive assumption that the features (words) are statistically independent of each other. Statistics Notation and The Chain Rule

• P(A)refers to the probability of event A occurs.

• P(A | B)is a conditional probability, which means the probability of event A occurring when event B has already occurred.

• P(A, B)refers to the probability of event A and B occurring at the same time. • Chain rule: For event A and B, the chain rule is P(A, B) =P(A | B)¨P(B) Bayes’ Theorem

The mathematical equation of Bayes’ theorem is P(A | B) = P(B | A)P(A)

P(B) where A and B are events and P(B)‰0.

Naive Bayes Assumption

Let X = xi, i = 1, . . . , n be the features, and C = tcj | j = 1, . . . , mu be a set of classes, using the Bayes’ theorem, the conditional probability for class cj, cj PC can be decomposed as

p(cj|X) =

p(X | cj)p(cj) p(X) with the chain rule, the denominator can be written as

p(cj, x1, ¨ ¨ ¨ , xn) =p(cj)p(x1, ¨ ¨ ¨ , xn|cj) = p(cj)p(x1|cj)p(x2, ¨ ¨ ¨ , xn|cj, x1) = p(cj)p(x1|cj)p(x2|cj, x1)p(x3, ¨ ¨ ¨ , xn|cj, x1, x2) = p(cj)p(x1|cj)p(x2|cj, x1)p(x3|cj, x1, x2)p(x4, ¨ ¨ ¨ , xn|cj, x1, x2, x3) =¨ ¨ ¨ = p(cj)p(x1|cj)p(x2|cj, x1)p(x3|cj, x1, x2)¨ ¨ ¨p(xn|cj, x1, x2, ¨ ¨ ¨ , xn´1) (2.1) By the naive conditional independence assumption, the feature xk is conditionally inde-pendent of xl, where k, l P t1, 2, . . . nu and k ‰ l. This means

(24)

2.8. Evaluation

where k, l P t1, 2, . . . nu, j P t1, 2, . . . mu for k ‰ l, hence the formula can be simplified as

p(cj, x1, ¨ ¨ ¨ , xn) =p(cj)p(x1|cj)p(x2|cj)p(x3|cj)¨ ¨ ¨p(xn |cj) =p(cj) n ź i=1 p(xi|cj) (2.2)

Since the p(X)is independent to C, it can be neglected. Thus the conditional distribution of class cjPC can be expressed as

p(cj|X) = 1 Zp(cj) n ź i=1 p(xi |cj)

where Z is p(X)and independent on C, i.e., Z is a constant if the values of p(X)are known. Naive Bayes Classifier

The Naive Bayes classifier assigns the most likely class by the function

ˆy= arg max jPt1,2,...,mu p(cj) n ź i=1 p(xi|cj) where ˆy P C.

2.8

Evaluation

There are some common metrics of evaluating the performance of a classifier, such as accu-racy, precision, recall, F1-measure and weighted averaged F1-score. Table2.2is a confusion matrix that we can easily compute these evaluation metrics with.

True class

Positive Negative

Predictive class Positive True positive (TP) False positive (FP) Negative False negative (FN) True negative (TN)

Table 2.2: Confusion matrix

Accuracy

Accuracy is the proportion of correctly classified texts among all texts. It is defined as

Accuracy= True positive+True negative

True positive+False positive+True negative+False negative (2.3) However, accuracy does not show the detail of incorrectly classified documents, espe-cially when the distribution of class is skewed the accuracy can be misleading. For instance, we do not know whether false negative or false positive is more common from the value of accuracy. Due to this accuracy alone is insufficient, so precision and recall are the other evalu-ation metrics that allow us to zoom in on the details for each class. Below are the definitions of precision and recall.

(25)

2.8. Evaluation

The denominator in the formula is (True positive+False positive) which is the total pre-dicted positive(the first row of the confusion matrix), and the numerator is True positive. Thus, the meaning of precision is that while accuracy is the predicted positive, precision can determine when the number of False Positive documents are high.

Precision= True positive

True positive+False positive (2.4) Recall

The denominator of the recall formula is (True positive+False negative), which is the actual positives(the first column of the confusion matrix). Thus, recall measure the accuracy in actual positives.

Recall= True positive

True positive+False negative (2.5) The F1-score considers both precision and recall metrics which balances between precision and recall, and is a better metrics when the class distribution is uneven.

F1-measure

F1= 2 ˆ True positive

2 ˆ True positive+False negative+False positive =

2 ˆ Recall ˆ Precision Recall+Precision (2.6)

Nevertheless, the F1-score is calculated from the aspect of a label since a binary classifying model may perform well in one label but perform poorly in another one. Weighted Average F1-score weights the F1-score of 2 classes by multiplying the number of instances from each class.

Weighted Average F1-score

Weighted Average F1= (F1 ClassA ˆ InstancesClassA) + (F1 ClassB ˆ InstancesClassB) Instances (ClassA + ClassB)

(2.7) AUC

AUC stands for "Area Under The Curve", the curve referring to the ROC (Receiver Oper-ating Characteristics) curve. Figure2.9illustrates the ROC curve and AUC. The ROC curve is a graph that plots the true positive rate (TPR = TP+FNTP ) versus the false positive rate (FPR= FP+TNFP ) at different threshold values, where the threshold can be any value between 0 to 1. It shows the ability of a binary classifier distinguishing between two classes. The AUC is the area under the ROC curve, its value is between 0 to 1. The AUC is 0, if the prediction of the classifier is one hundred percent wrong. On the contrary, the AUC is 1 means the clas-sifier predicts the class perfectly. If the AUC is 0.5, it means the model performs as a random classifier. (Hanley and McNeil [11])

(26)

2.9. Statistical Hypothesis Testing

Figure 2.9: ROC curve & AUC

2.9

Statistical Hypothesis Testing

A hypothesis test is a statistical method to determine whether if the assumption about the population parameter valid, and the statistical methods are called statistical hypothesis test-ing or significance tests.

Nonparametric Hypothesis Testing

Nonparametric hypothesis testing, in contrast to parametric hypothesis testing, requires no or very limited assumptions about the distribution of the sample (e.g., normal distribu-tion).

2.9.1

Key Terms and Concepts

Null Hypothesis

Null hypothesis (H0): A statistical hypothesis about the value of a population parameter. Null hypothesis is always presumed to be true until the evidence indicates to reject it. Alternative Hypothesis

Alternative hypothesis (H1): A statistical hypothesis is contrary to the null hypothesis. The alternative hypothesis is true when the null hypothesis is rejected.

Level of Significance

Level of significance is also called the significance level, and is denoted by α. The level of significance is the probability of rejecting the null hypothesis, and the value 0.05 is a common value used for α.

The significance level decides the critical point of rejection and non-rejection region, and the decision rule is made with the critical point. The decision rule is a statement of the rejec-tion region. For instance, reject H0if T ě 1.812. There are three different hypothesis testing:

(27)

2.9. Statistical Hypothesis Testing

• Upper-Tailed Test: H0: θ ď θ0, H1: θ ą θ0 • Lower-Tailed Test: H0: θ ě θ0, H1: θ ă θ0

where θ is population parameter, and θ0is one possible value of the parameter. Figure2.10shows the rejection regions for different hypothesis testing.

P-value

The p-value is the probability of finding the observed (or more extreme) results when the null hypothesis is true. Therefore, a small p-value indicates a strong evidence of rejecting null hypothesis.

The decision rule of p-value:

• If p-value ě α, then fail to reject the null hypothesis H0. • If p-value ă α, then reject the null hypothesis H0.

(28)

2.9. Statistical Hypothesis Testing

2.9.2

T-test

A t-test is a statistical hypothesis test used to determine if there is a statistically significant difference between the means of two groups. There are different kinds of t-test, such as one sample t-test, independent sample t-test, and paired samples.

T-distribution

T-distribution is a continuous probability distribution, also called Student’s t-distribution. It has similar shape to normal distribution but with fatter tails. T-distribution is used for es-timating the population mean of a normal distribution where the population standard devi-ation is unknown and the sample size is small. In other words, if the populdevi-ation standard deviation is known and the sample size is big enough (more than 30 samples) then normal distribution should be applied to estimate the population mean.

T-distribution is the foundation for t-test, which applies significance test for two sample means. It improves Z-test without knowing the population standard deviation in advance and applying to small sample size data.

T-distribution Description

Let X1, X2, ..., Xn be random samples from the normal distribution N(µ, σ2), the

t-distribution can be written as

T= X ´ µ¯ s ?

n

where ¯X is the mean of sample X1, X2, ..., Xn, s is the sample standard deviation, and n is the number of samples.

Paired Samples T-test

The paired sample t-test is used to compare the mean difference between two sets of ob-servations. The assumptions for the paired samples t-test:

• The dependent variable are continuous.

• The observations (the differences of the pairs) are sampled independently

• The distribution of the differences in the dependent variable pairs is normally dis-tributed.

The hypotheses of paired samples t-test can be expressed as: • H0: µdi f f =0

• H1: µdi f f ‰0

(29)

2.9. Statistical Hypothesis Testing

2.9.3

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a nonparametric statistical hypothesis test, it compares paired data samples. The paired samples might be evaluating the same algorithm on different data sets or evaluating different algorithms on the same datasets. The Wilcoxon signed-rank test is an alternative to the paired t-test when one or more of the data sets is not normally distributed. It means that Wilcoxon signed-rank test can be applied when comparing paired data but the normality assumption is violated.

Assumptions

Note that (xi, yi) where i = 1, ¨ ¨ ¨ , n are the pairs of samples, and di = yi´xi where i=1, ¨ ¨ ¨ , n are the paired differences. The assumptions for the Wilcoxon signed-rank test:

• The differences are continuous.

• Distribution of the differences is symmetric. • The differences are independent.

• The differences all have the same median. Hypotheses

• H0: Mdi f f =0 • H1: Mdi f f ‰0

(30)

3

Data

This chapter describes the data and the datasets used in our experiments.

3.1

Dataset

The data used in our experiments comes from a domain-specific corpus that contains web documents about diseases. More precisely, we used three eCare subcorpora, one in Swedish and two in English, namely the labelled Swedish eCare subcorpus, the English eCare la-belled subcorpus, and the NHSPubMed English lala-belled subcorpus. All the documents in the subcorpora are labelled as “specialized” or “lay”. The Swedish eCare contains web docu-ments about chronic diseases; the English eCare includes docudocu-ments about chronic diseases and disease-related concepts; finally the English NHS-PubMed collects documents about all kinds of health conditions. The subcorpora were created at different stages. The Swedish and English eCare have been crawled from multiple websites during the E-care@home1project. The English NHS-PubMed was created during the preparation of this thesis with documents coming from the NHS website and from PubMed FTP service. The Swedish and English eCare are noisy, because they come from multiple websites and because they contain messy characters that we left on purpose to test the resistance to noise of the resulting models based on them. The English NHS-PubMed is clean because it comes from two well-defined web-sites and does not contain messy characters. The datasets used in these experiments were extracted from these three subcorpora. Table3.1shows the size of the corpora.

Swedish eCare English eCare NHS-PubMed

Lay 154 (33%) 319 (78%) 964 (6%)

Specialized 308 (67%) 92 (22%) 15,377 (94%)

Total Documents 462 411 16,341

Total Words 424,457 536,799 3,833,968

Table 3.1: The sizes of the two classes in the three subcorpora.

(31)

3.1. Dataset

3.1.1

Swedish eCare Subcorpus

Figure 3.1: Distribution of word count in Swedish subcorpus.

The Swedish eCare contains texts which were collected from 462 web pages, amounting to 424,457 words. The documents used in our experiments have labelled applied by a lay annotator. There are around 67% of texts annotated as "specialized" and 33% of the data is "lay". Each of the documents is normalized to lower case; all the punctuation marks are removed (including single quotes, double quotes, etc.), but stopword removal, lemmatization and stemming are not applied. We kept stopwords and morphological information because we think that they positively affect the sublanguage classification. For example, it has been observed that stopwords, which are essentially in articles, prepositions and other function words, are more frequent in lay texts than in specialized texts. An expression like “lots of sugar in the blood” that we can find in lay texts contain four stopwords. The corresponding specialized expression is “diabetes”, which is a single medical term.

Figure3.1and Table3.2show the words count distributions of specialized and lay class. Overall, specialized documents are longer than lay documents and the maximum word count of specialized texts is 9,974 words; for lay texts, the longest document only has 2,879 words. However, we see from the distribution plot (Figure3.1), the word count distribution of spe-cialized class is sparse after 4,000 words.

Specialized Lay Documents 308 154 mean 1,084 586 std 1,317 533 min 10 37 25% 326 232 50% 685 420 75% 1,291 721 max 9,974 2,879

(32)

3.1. Dataset

3.1.2

English eCare Subcorpus

Figure 3.2: Distribution of word count in English eCare subcorpus.

Specialized Lay Documents 92 319 mean 1,891 1,137 std 1,460 943 min 12 85 25% 503 452 50% 1,752 875 75% 3,071 1,554 max 4,753 5,104

Table 3.3: Descriptive statistics of English subcorpus word count.

The English eCare used in these experiments contains texts were collected from 411 web pages, which includes 319 lay documents and 92 specialized documents, amounting to 536,799 words. Hybrid annotation method is applied for labelling the subcorpus. First a text is annotated using the semantics of the url, an approach that has been proven to be rewarding in other classification tasks (Abramson and Aha [1]). For example, the url http://physioforall.co.ukhas the domain physioforall, which means physio (physiother-apy) for all, it is semantically clear that the articles from this website are for everyone (expert and non-expert), so all the documents from this domain are annotated as “lay”. If a url is semantically opaque, a human will solve the case. All texts in the subcorpus are lowercased and all punctuators removed, and has the same data preprocessing as Swedish subcorpus.

Figure3.2shows the word count distribution of English documents and Table 3.3gives the descriptive statistics of English documents word count. The average words counts for English texts are 1,891 words for specialized labels and 1,137 for lay annotation. Figure3.2 shows that the distribution of specialized documents is dispersed and lay documents are concentrating on smaller word count.

(33)

3.1. Dataset

3.1.3

NHS-PubMed English Subcorpus

Figure 3.3: Distribution of word count in NHS-PubMed English subcorpus.

Specialized Lay Documents 964 15,377 mean 151 1,561 std 73 1,349 min 12 95 25% 98 587 50% 140 1,025 75% 192 2,280 max 740 8,413

Table 3.4: Descriptive statistics of NHS-PubMed English subcorpus word count.

NHS is the biggest website in the UK. It provides clinical articles with information regard-ing health conditions and related symptoms, helpregard-ing people who do not have specialized education to make the best choices about their health. PubMed is a search engine which com-prises biomedical literature abstracts and citations. The texts were crawled by a customized Python script from NHS2and downloaded from the PubMed FTP Service3.

Figure3.3illustrates the distribution of word count for NHS-PubMed subcorpus and Ta-ble3.4shows the detail of word count. There are 16,341 documents in total; lay class and specialized class have 946 documents and 15,377 documents, respectively. All documents in both classes consist of a total of 3,833,968 words. The shape of specialized distribution looks similar to a gamma distribution, and the lay distribution is skewed to the left. The average number of words for the specialized labels is 151 words, and the average number of words for lay labels is 1,561 words, over ten times the specialized labels.

When we see the details, the distribution of the words in the overall vocabulary is ex-tremely skewed; 63.5% of words only occur in specialized texts, 21.3% of words only appear

2https://www.nhs.uk

(34)

3.2. Pre-trained Cross-lingual Word Embeddings

in lay texts, and only 15.2% are shared among the two classes. Furthermore, although 63.5% of words only appear in specialized class, no word within this 63.5% occurs in more than 30% of all specialized documents. This means a word which is included in the 63.5% of words that only occur in specialized documents does not show in more than 30% of these documents, implying a diverse vocabulary for these 63.5% of words.

On the other hand, within the 21.3% of words which only occur in lay class documents, “your” appears in 99% of lay documents, and “someone” and “yourself” occur in more than 30% of lay documents. Besides these three words, the other words only occur in less than 30% of lay documents. This shows that the vocabulary in lay documents is also diverse. For the 21.3% concurrent words, there are 42 words that occur in more than 70% of lay documents but appear in less than in 30% of specialized documents, and we also notice that some of these words repeat very often in a single document, indicating that these words can be highly predictive words. The classifier can easily detect the words in a document and predict its class. Thus, in the follow-up experiments we have deleted the words that we think might be highly detectable.

3.1.4

Comparison

In this section we describe in detail the differences between the three subcorpora whose sizes are summarized in Table3.1.

Swedish eCare v.s. English eCare

The Swedish eCare and English eCare corpora are noisy data, and the number of Swedish documents is about fifty times more than English eCare documents. Furthermore, the distri-butions of labels (lay v.s. specialized) in these two datasets are flipped, Swedish subcorpus has more specialized data, but English subcorpus has more lay data. The distribution of class are imbalanced in both corpora, and the number of English corpus is smaller than the Swedish corpus. However, the total number of words in the English eCare dataset is 110,000 more than in the Swedish dataset. For each document, the averages of word count in English eCare documents are larger than Swedish documents in both specialized and lay texts. Swedish eCare v.s. NHS-PubMed English

Because the Swedish eCare and English eCare corpora are imbalanced, noisy, and respec-tive distributions are flipped, we add the NHS-PubMed subcorpus to manually sample the number of two labels to make a comparison between NHS-PubMed subcorpus and English eCare subcorpus. To clarify, the purpose of including the NHS-PubMed dataset is to under-stand whether it is the noise of the English eCare subcorpus that is a problem and/or if it is the inverted proportions relative to the Swedish eCare subcorpus that hinders the learning. For the experiments in this thesis, we will sample the same proportion of specialized and lay class as the Swedish eCare data, English eCare corpus, and equal proportion of two labels in both datasets for the NHS-PubMed dataset.

3.2

Pre-trained Cross-lingual Word Embeddings

We use the English and Swedish pre-trained multilingual word embedding provided from the MUSE library4. The word embeddings represent a word in a 300-dimensional vector, and there are 200,000 words for each language. In other words, each language is represented by a 200,000 ˆ 300 matrix. Words are represented by their token vectors, so we deal with word-level token. In NLP tasks, tokenization is the process of splitting a document into atomic elements, and tokens are the output of the process. A token can be a word, a number, a

(35)

3.2. Pre-trained Cross-lingual Word Embeddings

symbol, etc., in our case, the tokens are the words included in the MUSE pre-trained word embeddings.

Table 3.5shows a breakdown of unique words, across the three corpora, together with their coverage in the word embeddings and the corresponding ratio. From Table3.5, we can see that English eCare data has the highest ratio of total unique words included in the MUSE word embedding, while NHS-PubMed English data has the lowest ratio; Swedish eCare data ratio is in the middle. However, if we see the breakdown for lay and specialized class, Swedish has the lowest ratio of the unique words covered in the word vectors, English and NHS-PubMed English have the highest and the second highest, respectively.

In this phase, we labelled each unique word by language and we merge all the unique la-belled words. The labelling by language was necessary to avoid ambiguities such as: en_barn (“a large building on a farm where animals, crops, or machines are kept”) vs. sv_barn (“child/children”) or en_kiss (“to touch someone with your lips”) vs. sv_kiss (“urine”). This operation ended up with 48,457 unique words (Swedish plus English). Each pre-trained word embedding in the MUSE library has 200, 000 most common words. Hence there are 200, 000 ˆ2 word vectors in total for English plus Swedish. In order to run the experiments more efficiently, we only retain these 48,457 word vectors when training the models in all experi-ments.

Swedish eCare English eCare NHS-PubMed

unique words 42,655 30,070 68,272

Total coverage in the word embeddings 17,457 17,484 26,651

ratio 0.41 0.58 0.39

unique words 38,329 17,051 53,702

Specialized coverage in the word embeddings 15,755 10,447 22,367

ratio 0.41 0.61 0.42

unique words 11,560 21,578 24,934

Lay coverage in the word embeddings 7,568 14,263 13,682

ratio 0.35 0.66 0.55

(36)

4

Method

This chapter presents and explains the experimental setup and the methods that are imple-mented in this thesis.

4.1

Modeling

4.1.1

Documents & Word Embeddings Preprocessing

To differentiate the language of texts, each document is lowercased and stripped of all punc-tuation marks. Next, all tokens are prefixed with "en" to English datasets and "sv" to Swedish datasets. Aside from all the documents, each work from word embeddings also is prefixed with "en" or "sv" in front. These prefixes allow tokens and word vectors to be easily matched, and avoid confusing words in English and Swedish with the same word form, ex: men (Swedish meaning "but"; English plural form of "man"). Figure4.1illustrates the document representations preprocessing example for Swedish subcorpus. The "sv:men" is a token in a document that has been added "sv:" in the front. All the words in the Swedish word embed-ding also have prefixed "sv:". When the "sv:men" is represented as a vector that finds in the word embedding. English subcorpus has the same preprocessing as Swedish subcorpus.

Figure 4.1: Preprocessing: prefixing "sv" for each token in Swedish documents and for ev-ery word in the Swedish word embedding. English subcorpus has the same preprocessing. (Adatped from Laippala et al. [17].)

(37)

4.2. Simple CNN Model

4.2

Simple CNN Model

Our mono-lingual and cross-lingual text classification approach is a simple CNN architecture, and the model structure follows Kim [15] and Laippala et al. [17]. Figure4.2illustrates the architecture.

Figure4.3illustrates how the layers of a model are built in our experiments, and we use (the number of input tokens) equals to 500 as an example. The Embedding layer in the figure shows that the dimension of the embeddings is 48, 457 ˆ 300, which Subsection ?? has men-tioned that there are only 48,457 words occur among 200, 000 ˆ 2 total words in the Swedish and English word embeddings, hence we reduced the number of vectors from 200, 000 ˆ 2 to 48,457 to make the model training process more efficient. The details of document rep-resentation and word embeddings have been described in Section4.1. After processing the Embedding layer, a document is represented as a (the number of input tokens)ˆ300 dimen-sional matrix, which represents each word in a document by the word embeddings from the MUSE library. The model applied 128 filters and the filter size to one word as an example. Hence, the kernel (filter) described in Figure4.3at Conv1D layer is 1 ˆ 300 ˆ 128, it means there are 128 filters with 1 ˆ 300 dimensions, and the ReLU activation function is applied for the convolution operation. Max pooling is applied after the convolution layer, in the figure that is GlobalMaxPooling1D. We use dropout regularization with the rate equals to 0.2 be-fore passing to a fully connected and softmax layer. Because we only have limited training data on Swedish corpus (462 documents), which is likely to overfit, we set the dropout rate to 0.2 for the models which randomly selects 20% of nodes to drop out in every weight up-dating calculation. In the last layer, we apply softmax as activation function which returns probabilities for each class, and the document is labeled as the highest probability class.

Figure 4.2: The approach of simple CNN. (Adapt from Laippala et al. [17])

4.3

Mono-lingual Text Classification

Laippala et al. [17] used MUSE pre-trained word embeddings with simple convolution neural network to approach mono-lingual identification of online registers on Finnish and English, respectively. In our experiments, we run mono-lingual text classification on five datasets. Two of them are Swedish eCare and English eCare, the other three are randomly selected subsets from NHS-PubMed English. Since the Swedish eCare and English eCare have flipped distributions of lay label and specialized, we want to examine the effect between balanced and imbalanced class distributions. These three datasets from NHS-PubMed English have different numbers of documents and the proportion of the two classes. Thus, there are five

References

Related documents

When generating test data, the QuickCheck framework will determine which subset of the type to generate from, by choosing a upper size bound ( n).. QuickCheck

12 Figure 4: The pH of maple, pine, oak and beech leachate after 24 hours leaching test at different solid/liquid (S/L) ratio, for distilled-, rainwater and after washing

To crack a 7 character long mnemonic password with this algorithm a 7 word phrase would have to be created which, even with a small word list of 1000 words, would result in a cost

Gratis läromedel från KlassKlur - KlassKlur.weebly.com - Kolla in vår hemsida för fler gratis läromedel -

They are all related to word

The contributions of this paper can be summed as follows: i) we show that multilingual input representations can be used to train an STS sys- tem without access to training data for

Based on the fact that zero-shot translation systems primarily learn language invariant features, we use cross-lingual word embeddings as the only knowledge source since they are

Using the same datasets, training scheme, and hyper-parameters as the Longformer model, we hoped to investigate if extending the context on one language, English, also improved