Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings: An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios

(1)

Low Supervision, Low Corpus size, Low

Similarity! Challenges in cross-lingual alignment of word embeddings

An exploration of the limitations of

cross-lingual word embedding alignment in truly low resource scenarios

Andrew Dyer

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits October 25, 2019

Supervisors:

(2)

Abstract

Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages. However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world’s languages. Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs. Our study examines the performance of cross- lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment. Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages. We also find that the performance for these languages only increases slowly with corpus size. Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction.

We infer that methods other than cross-lingual alignment may be more

appropriate in the case of both low resource languages and heterogeneous

language pairs.

(3)

Acknowledgements 5

1. Introduction 6

1.1. Structure . . . . 8

2. Background 10 2.1. Vector representation of words . . . 10

2.2. FastText . . . 11

2.3. Cross-lingual alignment of word vectors . . . 13

2.4. Vecmap . . . 15

3. Minimal supervision 17 3.1. Motivation for unsupervised or weakly supervised cross-lingual word embeddings . . . 17

3.2. Identical words . . . 18

3.3. Numbers . . . 20

3.4. No bilingual seed (unsupervised setting) . . . 20

3.5. Related work . . . 21

4. Data size 23 4.1. Related work . . . 24

5. Graph properties of embedding spaces 25 5.1. Graph isomorphism . . . 25

5.2. Graph isospectrality . . . 25

5.3. Related work . . . 27

6. Experiments 28 6.1. Experimental setup . . . 28

6.1.1. Experimental phase 1: Supervision settings . . . 28

6.1.2. Experimental phase 2: Corpus size . . . 28

6.1.3. Experimental phase 3: Isospectral difference . . . 28

6.2. Evaluation languages . . . 28

6.3. Data and resources . . . 29

6.3.1. Preparing our embeddings . . . 29

6.3.2. Dictionaries . . . 30

6.4. Evaluation metric . . . 31

6.5. Hypotheses . . . 31

6.6. Results . . . 31

6.6.1. Overall . . . 31

6.6.2. Dictionary settings . . . 32

6.6.3. Data size . . . 32

(4)

6.6.4. Isospectrality . . . 34

6.7. Discussion . . . 37

6.8. Limitations . . . 39

7. Further research 41 8. Conclusion 43 A. Full results of cross-lingual alignment experiments 44 A.1. 5 million . . . 44

A.2. 20 million . . . 45

A.3. 50 million . . . 46

A.4. 100 million . . . 47

(5)

Acknowledgements

This work was conducted under the supervision of Ali Basirat. I am grateful for his guidance, insights and support.

This work was performed on the Abel Cluster, owned by the University of Oslo and Uninett/Sigma2, and operated by the Department for Research Computing at USIT, the University of Oslo IT-department – http://www.hpc.uio.no/. I am grateful to them for maintaining this resource and making it available to researchers and students throughout Northern Europe.

In particular, this work was undertaken under the Nordic Language Processing Laboratory (NLPL) project – NN9447K. I am grateful to the NLPL consortium for the wealth of open source data that they have provided. I was privileged to attend the NLPL Winter School in February 2019, which provided knowledge and inspiration for my research.

Thanks to all the teachers on the Masters Programme in Language Technology at Uppsala University for the knowledge and wisdom they have imparted. Thanks also to my classmates of the 2017-2019 cohort. Their camaraderie and the collegial environment that we shared throughout the programme has been a sustaining source of joy and inspiration.

Finally, my loving thanks to my parents for their lifelong, unwavering support.

(6)

1. Introduction

A major breakthrough in various methods of Natural Language Processing (NLP) has been the representation of words as vectors in a multidimensional space;

known as word embeddings. This method of word representation is based on the distributional hypothesis of words: that the meaning of words can be inferred from the contexts in which they appear (Evert, 2010). By representing words in a matrix whose dimensions are based on the window co-occurrence of a word with other words in a corpus, the latent meanings of words can be represented by their position in the n-dimensional vector space of this matrix, and their relations with other words represented by pairwise spatial relations between vectors, such as their distance from each other. Using linear algebraic operations for reducing dimensionality, these massively dimensional spaces can be reduced to more tractable numbers of dimensions, while still encoding meaning.

With the increasing ubiquity of neural network methods, word embeddings have likewise become ubiquitous in applications such as text classification, part of speech (POS) tagging, dependency parsing, natural language inference, and many more.

Cross-lingual word embeddings refers to the extension of this method to a bilingual or multilingual setting. Given a source language X and a target language Z, the aim is to represent the words of Z with a similar vector to their translation counterparts in X (and vice-versa). By doing this, we enable the transfer of meaning from one language to another. This is attractive, for example, in tasks where there is plentiful annotated data in the source language but little in the target language.

There are many approaches and methods to deriving these bilingual or multi- lingual spaces, and we refer to Ruder et al. (2019) for a comprehensive listing of these methods.

¹

Our research focuses on an alignment-based method. This is the method whereby, given a source language space X and a target language space Z, the objective is to learn a transformation matrix W that roughly maps the word vectors of X to their translation counterparts in Z, such that XW ≈ Z.

²

Recently, interest in cross-lingual word embeddings has been piqued due to their use in unsupervised machine translation (Artetxe, Labaka, and Agirre, 2018b; Artetxe, Labaka, Agirre, and Cho, 2018; Lample, Conneau, et al., 2018;

Lample, Ott, et al., 2018). In these methods, the aligned embedding spaces form the basis for translation tables in unsupervised statistical machine translation, or shared word vector representation in neural machine translation. This is of interest because it has the potential to overcome a long-running problem in

1

http://ruder.io/cross-lingual-embeddings/

2

The mapping does not necessarily have to be monodirectional. It is possible – and common

– for the mapping to be performed from both languages to a common space.

(7)

machine translation: the lack of aligned data in the majority of the world’s language pairs.

³

Despite this interest, however, most research on cross-lingual alignment of word embeddings has suffered from a very narrow focus on pairs or sets of European languages, and inferences on cross-lingual methods in general are often made on the back of these similar pairs. These sets of European languages are not necessarily Indo-European; non-Indo-European languages such as Finnish and Turkish are common in experiments and are used as examples of algorithms’

performance on typologically different pairs (Artetxe et al., 2017; Glavaš et al., 2019; Søgaard et al., 2018). However, we posit that this is still a limitation because these languages still share similarities beyond typology, such as their shared script, lexical borrowing, and common topics and cultural references that could be reasonably expected of languages that developed with geographical proximity and long cultural contact.

By contrast, relatively few studies focus on language pairs that are very far apart both typologically and geographically. Unsurprisingly, those that do (Conneau et al., 2017; Hoshen and Wolf, 2018; Joulin et al., 2018) typically find much lower results on these pairs. Moreover, even where non-European languages are used in these studies, it is almost always in a pairing with English as a source or target language. This is understandable as a a result of the available dictionary resources for experiments, which are usually derived from machine translation systems.

⁴⁵

. However, we maintain that limiting research to only these pairs of similar languages – or pairs consisting of English and a divergent language – limits the set of observations we can make about the interplay of the world’s possible language pairs, in all their diversity.

For example, how does cross-lingual alignment perform when two languages are genetically similar, but have been influenced by different languages? We know that lexical alignment between morphologically simple and morphologically rich languages is a challenge, but what of the case where both languages are morphologically rich in different ways? What is the influence of differences in script, or word segmentation? We miss such observations if we are only prepared to evaluate for a few language pairs.

Another limitation of previous research is that they rarely control for the corpus size on which monolingual embedding spaces are trained. We expand on the matter of “low resource languages” and “data size” more in Chapter 4.

In summary, there is an important distinction to be made in the context of cross-lingual word embeddings between:

(1) Language pairs for which there exists little bilingually annotated corpus data, such as dictionaries or aligned corpora.

(2) Language pairs for which there exists little corpus data monolingually in at least one of the languages.

3

Possibly the largest collection of publicly available aligned corpora between languages is available from OPUS. http://opus.nlpl.eu

4

https://github.com/facebookresearch/MUSE

5

In most cases, state of the art machine translation – of the kind needed to make non-noisy

translation pairs – relies on large amounts of curated parallel data, and the most common

source and target of translations for any given language is English.

(8)

The first of these is addressed by unsupervised and semi-supervised approaches to cross-lingual word embeddings. The second, however, is understudied in alignment-based methods. If we are to conceive of cross-lingual word embeddings as a means of overcoming the barrier of data sparsity, it is unrealistic to consider the first case and not the second.

We define three main directions of our investigation:

1. We investigate the effect of various supervision settings on bilingual lexicon induction (BLI) for 30 source-target language pairs (between six languages).

2. We control for corpus size, taking 50 million tokens as our default corpus size, and investigate the effect of varying the corpus size on BLI accuracy.

3. We investigate the correlation between isospectral difference and BLI accuracy, and track this effect across supervision types and corpus sizes.

The last of these will be explained further in Chapter 5. To briefly summarise here, isospectral difference is a measure of the graph similarity between two embedding spaces. This has been shown to correlate with bilingual lexical induction accuracy (Søgaard et al., 2018); the greater the isospectral difference, the lower the accuracy we can generally expect in BLI.

1.1. Structure

We structure the rest of this thesis as follows:

Chapter 2 covers the theory behind the approaches and resources that we use. This includes an explanation of the vector representation of words (Section 2.1); an explanation of the embedding system, FastText, that we use to train our word vectors (Section 2.2); an explanation of the theory behind cross-lingual alignment of word embedding spaces (Section 2.3);

and finally an explanation of Vecmap, the algorithm which we use to align the word embedding spaces (Section 2.4).

Chapter 3 covers the motivation behind reducing the supervision re- quirements of cross-lingual alignment methods, and the types of minimal supervision that are available (including the case where there is no super- vision). We also outline previous research in this area in Section 3.5.

Chapter 4 discusses the problem of data size and how it applies both to monolingual word embeddings and to cross-lingual word embeddings, as well as other areas of NLP. We also outline previous research in this area in Section 4.1.

Chapter 5 explains the application of graph theory to word embeddings, including graph isomorphism (Section 5.1) and graph isospectrality (Section 5.2). We also outline previous research in this area in Section 5.3.

Chapter 6 describes our experiments. In this chapter we outline our experimental setup (Section 6.1), with its three main experimental phases;

the languages that we use for our evaluation and some properties of them (Section 6.2) the data and resources we use for our experiments (Section 6.3);

the evaluation metric that we use (Section 6.4); our hypotheses (Section

6.5); the results of the experiments (Section 6.6); and a discussion of the

implications of the results (Section 6.7).

(9)

Chapter 7 outlines our ideas for further research in the future based on our findings.

Finally, Chapter 8 presents our conclusions.

The Appendix gives the full results of cross-lingual alignment experiments

across all corpus sizes (the presentation of the results in Chapter 6 is

abridged for clarity).

(10)

2. Background

2.1. Vector representation of words

In mathematics and computer science, a vector is a sequence of real numbers in an ordered list, of the form ®v = [v

0

,v

1

, ...,v

n

] .

A vector can also be seen a point in n-dimensional space. In this view, each element of a vector refers to a point along an axis of the space. A vector of one element corresponds to a point on a single line; of two elements, a point on a two dimensional grid; of three elements, a point in three dimensional space. Figure 2.1 shows a representation of two vectors in two-dimensional space.

Words can also be represented this way. Figure 2.2 shows a toy example of words represented by vectors in two-dimensional space. In this space, the direction and magnitude of each vector corresponds to its latent meaning representation, such that meronyms such as animal and dog, eat and eating, car and machine are similar in their directions. These vectors can then be used as features in NLP applications, replacing one-hot representations of words with continuous representations. As can be seen in Figure 2.2, the relations between words can be understood in terms of their geometric relations as vectors; for example, the closer cosine distance between car and machine than between car and animal is a representation of the intuition that the meaning of car is much closer to machine than to “animal”.

Words in a language can be converted to vector representations based on their appearances in context in a corpus. A typical example is window co-occurrence.

For example, in the sentence The cat jumped over the moon, jumped would co-occur with { The, cat, over, the} in a linear window size of 2. There are two predominant categories of methods for converting these observations to vectors:

x y

v ®

− ® v

Figure 2.1.: A pair of vectors in two-dimensional space. The vector ® v has the elements [1, 1], and so its location is at 1 on the x axis and 1 on the y axis. Its negative,

− ® v, has the elements [−1, −1], and so is exactly opposite to ® v

(11)

x y

doд animal

machine car eat

eatinд

Figure 2.2.: A set of words represented by vectors in two-dimensional space. Note that these are toy illustrative examples, and do not correspond to any particular embedding space.

(1) Count-based methods create a co-occurrence matrix between each word and its context – for example, the words within a linear window around the target word – such that a the resultant matrix has the shape n × n, where n is the number of words in the corpus. These counts are then weighted and normalised, and the context dimension of the space is reduced by a dimensionality reduction technique such as singular value decomposition (SVD) or Principal Component Analysis (PCA), such that the shape of the matrix is n × k, where k is a desired number of columns that approximate the variance of the words’ count vectors in a lower matrix rank. Examples of count-based methods include Latent Semantic Analysis (Deerwester et al., 1990), which uses SVD; and Principal Word Vectors, which uses PCA (Basirat, 2018).

(2) Prediction-based methods initialise vectors randomly as parameters in a shallow neural network. The model trains these parameters on the task of predicting either the word given the context, or the context given the word.

Among the most prominent of these methods are the CBOW and SGNS algorithms of word2vec (Mikolov, Sutskever, et al., 2013), which we explain in 2.2; and GloVe, which factors global co-occurrence statistics into its shallow neural prediction (Pennington et al., 2014). More recent methods such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) also use deep neural networks and optimisation on multiple prediction tasks.

In any method, word output embeddings represent words in a matrix of vectors, as shown in Figure 2.3.

2.2. FastText

In our study, we use FastText to train our embeddings (see §6.3.1). FastText is

based on the word2vec family of algorithms (Mikolov, Yih, et al., 2013), which

(12)











 the 2.4 0.58 0.1 · · · cat 0.15 −1.0 0.6 · · · jumped 0.06 −0.19 0.07 · · · over −0.59 −0.15 −0.56 · · · moon 0.86 −0.61 −0.21 · · ·

Figure 2.3.: A matrix of vector representations of words. The rows of this matrix each represent a word vector. The columns represent dimensions in n-dimensional space, derived either from a reduced co-occurrence matrix or trained network parameters.

Figure 2.4.: An illustration of the basic architecture of Continuous Bag of Words (CBOW) and Skip-gram (SG) algorithms. In this diagram w

t

refers to the target word in the window; w

_{t ±n}

denotes the context words. (Mikolov, Yih, et al., 2013)

are prediction-based methods. There are two basic algorithms in word2vec, which are illustrated in Figure 2.4:

• In Continuous Bag of Words (CBOW), the vectors of the words in the linear window context of a word are optimised on the task of predicting the target word.

• In Skip-gram, this is inverted: the vector of the target word is optimised on the task of predicting the words in the linear window context.

The model trains its parameters to maximise the log probability of the obser- vations of target and context words within the window, for each set of target words and contexts. Eq. 2.1 defines this in the case of skip-gram:

arg max

θ T

Õ

t =1

Õ

c ∈Ct

log p(w

_c

w ¯

_t

) (2.1)

where t is the target word, T is the set of all target words in the corpus, c ∈ C

t

is the set of context words for the given t, and θ is the elements of the word

(13)

vectors. The model trains the parameters of the context words w

c

based on the observation of w

t

.

In addition, word2vec uses the Negative Sampling technique, whereby word- context pairs that were not observed in the training data are added as noise.

¹

The model’s optimisation is then based on not only correctly positively predicting the correct context words c ∈ C given the target word (implicitly maximising the probability), but of correctly negatively predicting the negative samples n ∈ N . Given the logistic loss function `, and s a scoring metric to parametrise the probability of the co-occurence, the training objective then becomes:

arg max

θ T

Õ

t =1

"

Õ

c ∈Ct

`(s(w

t

, w

c

)) + Õ

n ∈N

`(−s(w

t

, n))

#

(2.2) FastText is an extension of these methods which uses sub-word information to form the basis of the scoring function s. The model vectorises both words and character n-grams in separate matrices. When predicting the context words given the target word, the model predicts the context based on each n-gram sub-word of the target word, and then represents the prediction of the context given the word as the sum of these predictions. For example, given the word where as a target word, the model would predict based on the n-gram sub-words

<wh, whe, her, ere, re>

as well as the full word unit <where>. For a more detailed explanation, we refer to Bojanowski et al. (2016).

The benefit to adding sub-word information is that it allows for a better representation of rare words, since each word can be expected to share some similarity with an orthographically similar word. For example, if the word earlier is of low frequency in the corpus, it can at least share some representation with similar words such as early and earliest, whereas it would have a weaker representation alone.

2.3. Cross-lingual alignment of word vectors

A linear transformation is the process by which vectors are transformed by their linear combination with a transformation matrix, such that when a matrix or vector space X is passed through a transformation matrix W , it maps to an output space Z.

The central task to cross-lingual mapping methods is to learn a transformation matrix from a source to a target language, W : X → Z, such that the translation pairs of words between X and Z are close to each other in terms of their spatial distance.

Mikolov, Le, et al. (2013) noted that the spatial distances between sets of translation pairs are similar between languages, despite some differences in use.

The reason for this, they argue, is that

1

Hence Skip-gram is often referred to as Skip-gram with Negative Sampling (SGNS) in

literature.

(14)

Figure 2.5.: An illustration of how relations between translation pairs of words between languages are similar. This illustration is post-rotation. (Mikolov, Le, et al., 2013)

“...as all languages share common concepts that are grounded in the real world (such as that a cat is smaller than a dog), there is often a strong similarity

between the vector spaces.” pp. 1-2

Figure 2.5 gives an illustration of this. This being the case, a linear mapping can be learnt to map a set of word vectors from the source to the target language, while preserving the spatial relations in both.

Other approaches treat this as a formulation of the Orthogonal Procrustes Problem: the problem of learning an orthogonal transformation that maps the source space to the target space with the minimum distance between points (Xing et al., 2015). An orthogonal transformation matrix is one where the basis vectors of all dimensional axes are perpendicular to each other. Figure 2.6 shows an example of such a transformation. The result of an orthogonal transformation matrix W acting on a set of vectors X is that for any pair of vectors in X, their cosine distance in the output space Z will be the same as in X. Because of this constraint, an orthogonal transformation must implicitly take the form of a rotation or a reflection (or a combination thereof).

Finding the optimal orthogonal transformation was proven by Schönemann (1966) to be solvable using singular value decomposition (SVD). SVD decomposes

a matrix into three components:

SV D(M) = U ΣV

^T

(2.3)

U and V

^T

are left and right orthonormal unitary matrices (where the lengths

of all row and column vectors is 1). These can be seen as rotation or reflection

(15)

x y

a ® b ® a ®

⁰

b ®

⁰

Figure 2.6.: A rotation is an example of an orthogonal transformation. In this instance, a rotation has been applied to a and b (orange), to make a

⁰

and b

⁰

(cyan). Note that the transformation has preserved the cosine of the two vectors.

matrices, representing, roughly, the directions of the vertical and horizontal vectors respectively. Σ is a diagonal matrix of singular values, which determine the scaling of U and V

^T

necessary to reconstruct the shape of the original matrix M. The original matrix M can then be seen as the composition of the matrices U ΣV

^T

.

Solving the Orthogonal Procrustes problem uses SVD to find the optimal orthogonal transformation. Let X be a source matrix, and Z be a target matrix of equal size and dimensionality. The task of the Orthogonal Procrustes problem is to learn an orthogonal transformation R such that R ÛX ≈ Z. The solution is found by performing SVD on the inner product of X and Z, such that U ΣV

^T

= SV D(XZ

^T

) , setting each of the the scaling singular values Σ to 1, and then taking the inner product of UV

^T

, such that R = UV

^T

. For a formal proof we refer to Schönemann (1966).

Cross-lingual alignment methods – whether regression-based or orthogonal – typically require a bilingual training lexicon of aligned terms, from which the rotation can be learned. Given the terms illustrated in 2.5, for example, the model would learn the linear transformation to most closely map the two align the two spaces by minimising the squared distance of the norm Í kW x

i

− z

i

k

²

in the case of regression methods such as that of Mikolov, Le, et al. (2013); or the cosine distance Í(W x

i

)

^T

z

_i

in orthogonal methods such as that of Xing et al.

(2015). More recent work has reduced the size of the required seed dictionary (Artetxe et al., 2017; Smith et al., 2017) or removed the need entirely.

2.4. Vecmap

The mapping algorithm that we use in our experiments is Vecmap, developed by Artetxe et al. (2016).

²

. Vecmap is an orthogonal mapping algorithm that learns a linear transformation from the source language to the target language, and uses a multi-step series of transformations of the source and target spaces that generalises many previous approaches (Artetxe, Labaka, and Agirre, 2018b).

2

https://github.com/artetxem/vecmap

(16)

Vecmap reduces the seed lexicon requirement by employing a semi-supervised, self-learning method. As in other alignment methods, the algorithm starts with a bilingual seed dictionary of source and target words. The algorithm learns an orthogonal transformation W using the procedure defined above. Once this is done, if self-learning is enabled, the algorithm finds the pairs of source and target words that are close to each other according to the retrieval criterion (described in the next paragraph). The dictionary is then re-formed, with words added probabilistically depending on their closeness according to the retrieval criterion. This continues until the convergence criterion is reached: that there are no further gains in terms of a closer orthogonal mapping. For details of the algorithm, including the pseudo-code, we refer to Artetxe et al. (2017).

The typical retrieval criterion for bilingual lexicon induction – the metric by which the similarity of words is defined – in word embeddings is cosine similarity:

the inner product of two vectors.

sim = cos (A, B) = A · B

kAk · kBk (2.4)

Vecmap also has the option to use the alternative retrieval criterion known as Cross-Domain Similarity Local Scaling (CSLS) (Conneau et al., 2017). This is a neighbourhood-based metric aimed at reducing the hubness problem in cross- lingual word embeddings, whereby large clusters of similar words form hubs, diminishing the informativeness of cosine similarities (Dinu and Baroni, 2014).

In CSLS, the cosine similarity cos (x

s

,y

_t

) between a mapped source language word W x

s

and a target language word y

t

is weighted by the similarity of the source language word to the target language word’s K nearest neighbourhood (and vice versa). This function is denoted by:

CSLS(W x

s

,y

t

) = 2 cos (W x

s

,y

t

) − rT (W x

s

) − rS(y

t

) (2.5) where rT or rS is the function measuring similarity to the target word’s neighbours.

In this way, target words with few neighbours are emphasised, while target words

with many neighbours are de-emphasised. For more details on this retrieval

criterion, we refer to Conneau et al. (2017).

(17)

3. Minimal supervision

In this section we explain the types of minimal supervision that are available for mapping our embeddings. As previously stated, most embedding algorithms require some seed dictionary in order to begin the process of alignment, and this is true of Vecmap. These need not necessarily be gold dictionaries, however, and we describe the methods of weak supervision that are available to us.

3.1. Motivation for unsupervised or weakly supervised cross-lingual word embeddings

In their comparison of supervised and unsupervised algorithms for cross-lingual alignment, Glavaš et al. (2019) posit that it is reasonable to expect a fairly large training dictionary of between 1000 and 5000 words. However, while it is true that large dictionaries of several thousand words may be found between English and the languages in Google Translate, this is far from the case in all language pairs – even those that are well-resourced.

For example, a common method of deriving a dictionary between two languages is to simply use available translation tools – for example Google Translate.

¹

This is the method used by Glavaš et al., and for convenience we use a similar approach (see Chapter 6). Google Translate is a state of the art system and produces very reliable results between English and a target language, and vice versa. However, it is considerably weaker when translating between pairs of languages that are heterogeneous, geographically distant, or do not have a lot of parallel data.

An example of this is seen when we attempt to translate between two distant languages, neither of which is English. For example, Figure 3.1 shows a translation from Japanese to French. As we can see from this translation, the word karui (light) in Japanese has been mistranslated into French as lumière – mistranslated both to a different word sense and to a different POS tag (from an adjective to a noun). The correct translation into French for this sense of light should be léger – as in une sensation légere. This should not be an ambiguous translation between the two languages; karui is an entirely different word to hikari, which would be the actual Japanese translation of lumière. the mistranslation is easily explained if, instead of translating directly from Japanese to French, the translation used English as pivot – in which light is an ambiguous word.

Though we are not privy to the details of how Google Translate manages its translations, we strongly suspect that, for most language pairs, its method of translation is simply to translate from the source into English as a pivot, and from there to the target language. We concede that this choice has its advantages

1

https://translate.google.com/

(18)

Japanese French

karui kimochi → sensation de lumière a light [weight] feeling A feeling of light [luminosity]

Figure 3.1.: A sample translation using Google Translate.

fr ru fa zh ja

en 51477/17.8% 17790/3.8% 12490/3.7% 20316/8.5% 18782/6.4%

fr 16436/3.5% 11242/3.3% 17017/6.9% 16674/5.6%

ru 8702/1.8% 13364/3.4% 12551/2.8%

fa 9746/3.7% 9132/2.9%

zh 30292/14.6%

Table 3.1.: The coverage of identical words between language pairs, as defined by the intersection over union of words in language A and B:

^A∩B_A∪B

.

at the phrase and sentence levels, since most languages pairs lack the parallel sentences necessary for state of the art translation systems.

However, if a state of the art system still makes such elementary errors at the lexical level, we find it unreasonable to then conclude that obtaining a large and well-curated bilingual seed dictionary between any language pair is as trivial as many researchers make it out to be. This is to say nothing of truly low-resource languages which do not appear in Google Translate, and for which bilingual corpora with English are sparse, let alone with other languages. Nor is this necessarily solved by looking through publicly available translation datasets.

Even among the OPUS corpora, translations between many large languages are in short supply, often limited to, for example, technical manuals such as Gnome or Ubuntu documentation.

In other words, we maintain that lexical translation using available machine translation tools is not necessarily a trivial task, and even when it is the case that creating a reliable bilingual dictionary is a trivial task, it is still an extra step, susceptible itself to error, and obviating this step can make the task of mapping smoother and less burdensome for non-expert users.

3.2. Identical words

One form of weak supervision that we can use is to simply accept any string that is identical between two languages as a dictionary entry.

While the supervision gained from identical words is noisy, for certain language

pairs it is far from “weak”. Table 3.1 shows the number of identical words found

between each language pair in our experiment. Unsurprisingly, the most closely

related language pair, English and French, has a very high coverage of identical

words. Likewise, Japanese and Chinese, which are genetically different languages

but share a lot of words through logographic script, also have a wide coverage of

identical words. However, even for distant languages with completely different

scripts, we find a fairly high coverage of identical words, which exceeds the

highest gold dictionary size that we have available.

(19)

Figure 3.2.: The distribution of identical word vectors among the 1000 most frequent words in two language spaces – in this case English and French. The coloured dots represent words that appear in both language spaces; the grey dots represent words that only appear in one language.

The language pair of Japanese and Chinese gives the second largest coverage, and this is attributable to the two languages’ shared use of Chinese characters.

Chinese characters are logographic, and the composition of individual characters, as well as their use in compounds, often indicate their meaning. Japanese uses a mixed script system with both Chinese characters to represent semantic concepts and Japanese syllabic characters to represent the morphology of Japanese. Since Japanese extensively borrows from Chinese, there are many of these shared character representations, and barring some false friends, most of them share the same meaning.

As expected, language pairs which are unrelated and do not share the same

script have fewer identical word pairs. Examining the shared vocabulary of these

languages, we find that the word pairs are overwhelmingly loanwords in one or

both of the languages – primarily English words. Interestingly, we can see this in

the spatial locations of identical words throughout the space: whereas in related

language pairs the words are distributed fairly evenly throughout the space, in

the heterogeneous pairs they are distributed in a much narrower space in at

least one of the languages. Figures 3.2 and 3.3 show a projection of two language

vector spaces into two-dimensions, and the effect is visible in that the points

which correspond to identical words in the target language space are dispersed

fairly evenly throughout the space in the case of French, but are clustered tightly

at the periphery in the case of Japanese.

(20)

Figure 3.3.: The identical words between two unrelated languages with different script.

Note that the identical words are clustered towards the periphery, rather than interspersed with the rest of the vocabulary.

3.3. Numbers

Numbers form a subset of the identical word seed setting. Simply, these are instances of Arabic numeral strings of any length. In their experiments, Artetxe et al. (2017) found that using numbers as a bilingual seed, in practice, provided a wide coverage of the embedding space and improved performance above small dictionary baselines.

Although we expect this to be less noisy than the identical words setting, we have concerns about its use. The main concern is that numbers are already close to each other in the monolingual space, and their similarities can be expected to also be quite close. In Figure 3.4 we can see an example of a cluster of numbers among random words. What we see is that the number vectors occupy a fairly narrow space, compared with the whole space of the embeddings. This is similar to the case of word vectors representing identical strings in heterogeneous languages, and it is intuitive that many of these would be numbers.

3.4. No bilingual seed (unsupervised setting)

The final seed setting that we investigate is simply the case where there is no bilingual seed lexicon. This is a fully unsupervised approach.

As before, the alignment is based on the Orthogonal Procrustes problem,

relying on finding an orthogonal transformation that rotates or reflects the

source and target embedding spaces such that the difference in the squared

norm between the two spaces is minimised. However, as there is no bilingual

data in this approach, the algorithm must use a heuristic to generate a seed

(21)

Figure 3.4.: Vectors corresponding to number tokens among the top 10000 words in the English embedding space. Number vectors are labelled in orange; non-number vectors in grey. This projection was made using TSNE.

dictionary from scratch. Its method for doing so relies, again, on the isomorphic assumption: that translation pairs in languages will have the same distributions in their respective languages.

The heuristic used is the difference between similarity distributions. Let X

sim

= XX

^T

be the similarity matrix of the source language embedding space’s unitary matrix X, representing the cosine similarity between each pair of word vectors in X. Let Z

sim

= ZZ

^T

be the same for the unitary matrix of the target space Z. Then sim is the result of the inner product of these two similarity matrices, such that sim = X

sim

· Z

_sim

. For each source language word x

s

in X, the its closest target language word z

t

in Z according to the retrieval criterion is found. Finally, the n source-target language word pairs with the highest similarity are added probabilistically to the dictionary. From this point, the self-learning procedure begins as in the semi-supervised settings and runs until a convergence criterion is reached. For more details of Vecmap’s unsupervised learning method, we refer to Artetxe, Labaka, and Agirre (2018b).

3.5. Related work

Early approaches to cross-lingual alignment methods were supervised, assuming

the presence of a large bilingual lexicon (Faruqui and Dyer, 2014; Mikolov, Le,

et al., 2013; Xing et al., 2015). More recently, various approaches have managed

to alleviate this requirement somewhat. Vulić and Moens (2015) induce a seed

dictionary using corpora which are document-aligned but not sentence or word-

aligned. Weak supervision methods rely on heuristic methods of deriving a seed

dictionary, such as identical strings (Hauer et al., 2017; Smith et al., 2017) or

numbers (Artetxe, Labaka, and Agirre, 2018a; Zhou et al., 2019).

(22)

Fully unsupervised methods of alignment optimise on various heuristic criteria.

Many of these use Generative Adversarial Networks (GANs), with a generator function optimising on the objective of tricking a discriminator function into misclassifying the target language space as the source language space (Lample, Conneau, et al., 2018; Zhang et al., 2017). Other methods, such as that of Artetxe, Labaka, and Agirre (2018b) generate a seed dictionary heuristically, without using a GAN, using similarity measures such as the distance between the two spaces’ similarity matrices. Other examples of the latter approach include the Iterative Closest Point (ICP) method of Hoshen and Wolf (2018).

Experiments on types of supervision in alignment-based methods across multi- ple language pairs are few and far between. Glavaš et al. (2019) do not directly compare supervision types in their experiments, but in their comparison of different algorithms they group algorithms into supervised (with 1000-4500 dic- tionary pairs as training data) and unsupervised methods (with no training data). They find that supervised methods typically outperform unsupervised methods on bilingual lexicon induction, though non-GAN based algorithms such as VecMap and ICP tend to be more robust across heterogeneous language pairs.

Experimenting using MUSE

²

, a GAN-based system, Søgaard et al. (2018) find that unsupervised learning tends to be vulnerable to variations such as hetero- geneous language pairs, domain difference, and differences in the algorithms and hyperparameters with which the monolingual embeddings are trained. They also find that weak supervision in the form of identical strings alleviates these differences to some extent.

2

https://github.com/facebookresearch/MUSE

(23)

4. Data size

“Low resource” is a commonly used term for languages with relatively low amounts of data available, but in practice whether a language is low-resource is often highly dependent on the task at hand. Take, for example, the task of dependency parsing.

Universal Dependencies Nivre et al. (2016) is a collection of dependency grammar- annotated treebanks for 83 languages

^{1 2}

. Among these languages, however, the amount of annotated data varies considerably. As of version 2.4, German has over three million annotated tokens, representing hundreds of thousands of annotated sentences. Comparatively, Marathi, a language spoken as a native language in India by around 70 million people (a similar number to the number of native speakers of German), has only 3000 annotated tokens. Likewise, Ancient Greek has 400,000 annotated sentences despite being a dead language with no native speakers; modern Greek, spoken by 13 million native speakers, has only 63,000 annotated tokens. There is therefore a clear mismatch between the languages spoken in the world and the languages annotated for a given task.

The word embedding methods described in Chapter 2 are unsupervised, which is to say that they require only raw text corpora to learn their distributional representations of word meanings. This overcomes the problem of lack of anno- tated data to some extent. However, this also moves the problem from the size of available annnotated data to the size of available unannotated data.

For example, Wikipedia offers millions of articles in hundreds of languages.

Agong these languages, however, the size of available data varies widely. For example, currently there are 15 language Wikipedias with more than one million articles, comprising in each case around 100 million tokens. The largest of these, of course, is English, with around 3 billion tokens, while, for example, Portuguese has around 391 million tokens.

³

However, some widely spoken languages such as Korean have just 110 million tokens; Telugu 36 million tokens; and Swahili 7 million tokens. This presents a problem since, for example GloVe (Pennington et al., 2014) is trained on billions of tokens of Common Crawl data.

While, for example, Grave et al. (2018) have trained embeddings for 157 languages by using language detection and scraping the Common Crawl, and these embeddings are available off the shelf

⁴

, thus overcoming this problem to some extent, we nevertheless argue that our target should be to cater for the data sizes that are easily accessible to the common user. There is also the consideration that even a scrape of the common crawl may not recover much more data than what is available on Wikipedia.

Some research has gone into generating word embeddings for low resource languages. For example, (Jiang et al., 2018) use positive unlabeled learning in a

1

As of the time of writing, in the v.2.4 release

2

https://universaldependencies.org

3

https://meta.wikimedia.org/wiki/List_of_Wikipedias, accessed 2019/09/01

4

https://fasttext.cc/docs/en/crawl-vectors.html

(24)

matrix factorisation approach. Other methods include cross-lingual learning of word embeddings (Duong et al., 2016, 2017), whereby a well resourced language and a low resource language share parameters of a network; and projection based methods, whereby embeddings from one or more well resourced languages are projected through annotations such as word alignment (Basirat et al., 2019).

These are beyond the scope of this study.

For cross-lingual alignment methods, however, most work that we are aware of has assumed the largest available corpus size, and we identify this a problem that needs quantifying. Is there a lower bound beyond which alignment-based methods are no longer effective?

4.1. Related work

Despite the ubiquity of word embeddings in various NLP tasks, relatively little research has been done on the performance of word embeddings in truly low resource languages, with most monolingual and cross-lingual approaches assuming large corpora, sometimes on the order of billions of tokens (and at least tens of millions).

In monolingual embeddings, Jiang et al. (2018) apply positive unlabeled learning to a matrix factorisation method to better utilise the information from unseen co-occurrences. Bojanowski et al. (2016) examine the performance of FastText at lower corpus sizes on rare word similarity tasks, finding that sub-word information helps to induce meaning representations even when occurrences of a particular word are few. For the purpose of simply increasing the amount of available data, Grave et al. (2018) train a language identifier to identify in- language data from the Common Crawl

⁵

and supplement Wikipedia dump data with it, finding that the addition of even noisy data greatly increases performance on languages with a small Wikipedia corpus size.

Some research into cross-lingual training of word embeddings has focused on low resource scenarios, assuming low corpus sizes and supervision only at the lexical level. For example, Duong et al. (2016, 2017) include languages with lower corpus size such as Serbian in their experiments, and train using a field lexicon as bilingual supervision in a bilingual CBOW model. Adams et al. (2017) also use this method in their experiments on cross-lingual language modelling as a downstream task. Wada et al. (2019) train embeddings using joint task-based optimisation on language modelling, sharing parameters to learn underlying representations for both the source and target spaces.

As for cross-lingual alignment based methods, we are unaware of any previous research where corpus size was an experimental variable, and most research that we are aware of has either used off-the-shelf embeddings trained on full size Wikipedia and Common Crawl data; or else trained on large corpora in the course of the experiments.

5

https://commoncrawl.org

(25)

5. Graph properties of embedding spaces

5.1. Graph isomorphism

An underlying assumption in cross-lingual alignment methods is that the source and target languages are isomorphic with respect to each other. In graph theory, two graphs X and Y are isomorphic with respect to one another if and only if there is a bijective function (i.e. a direct, one-to-one mapping) between them, such that each node x ∈ X has one and only one counterpart y ∈ Y. Each pair of nodes x → y must have the same number of edges in and out.

In visual terms, if X can be rearranged (without changing any edges) so that, without labels, it exactly resembles Y, then the two graphs are isomorphic. An example is shown in Figure 5.1.

A matrix of word embeddings can be converted to a graph representation by its nearest neighbour relations. Given a word embedding space X, for each x ∈ X we can form a directed graph whose edges represent the cosine nearest neighbour to x

s

.

¹

Let X and Z be a source language embedding space and a target language space respectively. Then, theoretically, we could test for isomorphism by representing both spaces as adjacency matrices and checking to see if there exists a bijective function to map the adjacency matrix A(X) to A(Z).

In practice, this assumption cannot be tested across the whole embedding spaces, since testing for isomorphism is an NP-hard problem. However, Søgaard et al. (2018) probe the extent to which this assumption is correct on sub-graphs of words in translation. They find that isomorphism does not hold even among direct translations of words in closely related languages. This is easily explained by several linguistic properties of language pairs, including gender, case marking, agglutination, and articles, among others. In the example of English and French, for example, the possessive pronoun my would have mon, ma and mes as translations; affected by number and case. Likewise, verb conjugations such as goes and go correspond to a several more conjugations in French, according to person, number and gender. Even common nouns can be masculine or feminine in French, meaning that for each English source word we can expect to see two target French words.

5.2. Graph isospectrality

In place of isomorphism, which is binary (True/False) property of a graph, Søgaard et al. (2018) use isospectrality as a continuous measure of the similarity of two graphs. In graph theory, the spectrum of a graph refers to the multiset of

1

The graph is directed because the nearest neighbour relation is not symmetric; that is to

say that N N (a) = b does not necessarily imply that N N (b) = a. It is possible – and likely – that

there exists a node c which is closer to b than a.

(26)

Figure 5.1.: In this illustration, the two graphs are isomorphic, despite their vi- sual appearance, since every node in G can be mapped to a counter- part in G

⁰

to produce the exact same shape. Credit: geeksforgeeks.org/

mathematics-graph-isomorphisms-connectivity/

its eigenvalues. The eigenvectors ®v ∈ V of a matrix A are those vectors of the same dimensionality as A which, when linearly combined with A, do not change direction, but are instead scaled by an eigenvalue λ, such that A®v = λ ®v.

To find the spectrum of a graph (which roughly describes the relatedness between nodes of the graph), we find the eigenvalues of its Laplacian matrix. Let A be an adjacency matrix representing the nearest neighbour relations of the top k most frequent terms in an embedding space. Let D be the degree matrix of this adjacency matrix: a diagonal matrix representing the number of edges into each node in the graph. The Laplacian matrix L is then D − A. Having found this Laplacian matrix, we can then obtain its eigenvalues using eigendecomposition.

We follow the method of Søgaard et al. to compare two matrices, taking the k largest non-negative Laplacian eigenvalues of each matrix such that the sum of k is approximately 90% of the eigenvalues.

min

j

Í

k i=1

λ

_ji

Í

n

i=1

λ

_ji

> 0.9 (5.1)

We then simply take the square difference of the two sets of eigenvalues to get the squared eigenvalue difference.

∆ = Õ

k i=1

(λ

_1i

− λ

_2i

)

²

(5.2)

Isospectral difference is on a range from 0 to infinity, with 0 representing complete isospectrality (realistically only likely to be the case when source and target space are identical), and the difference approaching infinity as the difference in the spectra of the graphs becomes greater.

Our method differs from that of Søgaard et al. in one key way: instead of

sampling subgraphs of translation pairs from the source and target language, we

sample the top 1000 words (ordered by corpus frequency) of both languages and

perform the spectral analysis on these. This is done for convenience, and while

(27)

we note that this is a more noisy method, we can make the naive assumption that the top 1000 words of each language will be similar types of words.

²

In their experiments, Søgaard et al. (2018) found a strong negative correla- tion between the eigenvalue difference of a pair of language spaces and their performance in unsupervised alignment. We expect to find a similar effect in our case.

5.3. Related work

Although common word relations and isomorphism are underlying assumptions in cross-lingual alignment methods, until recently there was no concrete mea- sure for the extent to which this actually applies in reality between pairs of languages. Søgaard et al. (2018) introduced a measure of the geometric similarity of embedding spaces in the form of Laplacian eigenvalue difference, based on the method of Shigehalli and Shettar (2011). To our knowledge this is the first instance of spectral graph theory being used to derive a metric for the match between two word embedding spaces. Relating to the properties of monolingual spaces, Wendlandt et al. (2018) examined the stability of word embeddings across different random initialisations, using the overlap between top K nearest neighbours for each word in the space as a measure of the stability of the space.

We do not directly examine this measure in our work, but we speculate that this has some relation to the eventual isospectral difference between two spaces – if words in either of the spaces are unstable, this noise will likely increase isospectral difference between the two spaces. We leave this for future research.

To our knowledge, however, no work has tracked the correlation between isospectral difference and bilingual lexicon induction (BLI) accuracy across data sizes.

2

We also note that the mapping algorithm itself does not have access to information about

the words it is mapping.

(28)

6. Experiments

6.1. Experimental setup

We experiment on six language pairs, performing cross-lingual alignment in both directions for each language pair. This yields 30 mapping directions between the 15 language pairs. We report both directions of mapping (e.g. English to Japanese and Japanese to English).

6.1.1. Experimental phase 1: Supervision settings

For each mapping direction, we experiment with the following supervision set- tings:

• Unsupervised, where a seed dictionary is bootstrapped in the first in- stance from the similarity matrix.

• Identical Words, where the seed dictionary is induced by taking all identical strings between the two languages as dictionary pairs.

• Numbers, where the seed dictionary is induced from Arabic numerals (including floats) in either language (a subset of identical strings).

• Gold supervision, where n dictionary entries from the training dictionary are used as the seed dictionary. We experiment with 50, 100, 500, 100, 2500, and 4500 entries.

All of these settings also use the iterative self-learning method. We fix the corpus size at 50 million tokens in the first set of experiments to control for data size.

6.1.2. Experimental phase 2: Corpus size

We next perform the same experiment on embedding spaces trained on different corpus sizes: 5 million, 20 million, and 100 million tokens. In order to reduce the number of experiment runs, we only consider the following supervision contexts:

Unsupervised, Identical words, 1000 words, and 4500 words.

6.1.3. Experimental phase 3: Isospectral difference

Having obtained the results for all of the previous experiments, we examine the Laplacian eigenvalue difference as described in Chapter 5 to see how well this measure correlates with actual scores.

6.2. Evaluation languages

The languages we chose for our experiments are: English, French, Russian,

Persian, Chinese, and Japanese. Some properties of these languages are shown

in Table 6.1. Though four of these languages are Indo-European, each is of a

(29)

Language Family Order Type Genders Script

English IE, Germanic SVO Analytic None Latin

French IE, Romance SVO Analytic 2 Latin

Russian IE, Slavic SVO* Fusional 3 Cyrillic

Persian IE, Iranian SOV Agglutinative None Arabic

Chinese Sinitic SVO Isolating None Chinese

Japanese Japonic SOV Agglutinative None Mixed**

Table 6.1.: Languages evaluated in our experiments. “IE” means “Indo-European”. *Russian is nominally SVO in word order but has a high degree of word order flexibility.

**Japanese uses a mix of Chinese logographic characters and Japanese syllabic characters.

different branch of the family, and two of the four do not use Latin script. Four of the six languages have genders, while the others do not. Three of the languages are isolating, two agglutinative, and one synthetic.

There are other features of the languages which we expect to have an impact on their isospectral difference and their performance in bilingual lexicon induction.

For example, English and French both have articles, which none of the other languages do; the four Indo-European languages all have number (represented according to to their type), whereas Chinese and Japanese typically do not explicitly state number.

6.3. Data and resources

6.3.1. Preparing our embeddings

For the purpose of training our embeddings, we downloaded Wikipedia data from Wikipedia’s general dump, which contains text data from content articles, captions, and edit discussions.

¹

We extract the text data, remove markup and other errata, using WikiExtractor.

²

Next, we tokenise the corpora into discrete words. For tokenisation of English, we used the word tokenisation function of NLTK (Loper and Bird, 2002).

³

. For French, we tokenise using a customised regular expression tokeniser. For Japanese, we use Mecab (Kudo et al., 2004). For the remaining languages – Russian, Persian, and Chinese – we use Polyglot (Al-Rfou et al., 2013).

⁴

.

We train the monolingual embeddings using the implementation of FastText provided in the Gensim software (Řehůřek and Sojka, 2010).

⁵

We use the Skip- gram algorithm with 5 negative samples, and a minimum word frequency of 2.

All other hyperparameters are left as default.

To control for corpus size, we train on 50 million tokens for each language; we do this by making a custom corpus reader that stops iteration when it has read more than 50 million tokens.

1

https://dumps.wikimedia.org/backup-index-bydb.html

2

https://github.com/attardi/wikiextractor/wiki

3

http://www.nltk.org

4

http://polyglot.readthedocs.org

5

https://radimrehurek.com/gensim/index.html

(30)

For the data size experiments, we also train on data sizes of 5 million, 20 million, and 100 million tokens. In these cases, we stop iteration at the specified number of tokens.

⁶

6.3.2. Dictionaries

Facebook Research’s MUSE library

⁷

provides ground truth bilingual dictionaries between English and 55 other languages (and vice versa); and between six European languages. Unfortunately, dictionaries do not exist between the set of languages that we use in our experiments. Instead, these languages only have mappings to English. To derive dictionaries between every pair of languages in our experiments, we take the set of English source words which have at least one unigram translation in each target language as entries, and map the word translations between each pair of languages.

This is a naive approach, since it assumes transitivity of translations: i.e. that words in two target languages that map to the same English word are therefore correct translations of each other. As we alluded to previously, this property does not hold in reality, owing to differences such as morphology (some languages have gender and case inflections) and polysemy (that some English words have multiple meanings), and these map to more than one word in a target language.

Recall from Section 3 the example of the polysemous mistranslation of light. We have this problem in our dictionaries, where the weight sense of light maps to the luminosity sense of light. There are also mistranslations such as masculine possessive pronouns to feminine possessive pronouns.

We mitigate this in that, for each source language word in a dictionary, there is a set of possible target language translations rather than only one, and the mapping need only retrieve one to get an accuracy point. For example, the English possessive pronoun my could map to any one of mon (masculine); ma (feminine); or mes (plural) in French. In the case of French to Japanese, léger (light) could map to either hikari (luminosity) or karui (light-weight). Therefore, we can expect the P@1 accuracy score to slightly overestimate the number of correct translations.

Using this procedure, we end up with ≈ 7000 dictionary entries, where a dictionary entry is a unigram in English that maps to one or more unigrams in each target language. We split these entries into a training/development/test split of 4500, 500, and 1500 entries.

It is worth noting that the dictionaries are ordered roughly by word frequency, so the training data contains much more frequent words, and the development and test data much less frequent words.

With these splits, we create training dictionaries in each direction between the evaluation languages containing the entries from the training split, and development and test dictionaries by the same method. We create further,

6

In the case of the 100 million token experiment, we found that the Persian Wikipedia dump that we used had only ≈ 90 million tokens available, and this is the amount we used in this experiment. All others have more than 100 million tokens, and we used 100 million.

Though this is a mismatch, we do not expect that it has a drastic impact on the experimental setting.

7

https://github.com/facebookresearch/MUSE

(31)

smaller versions of the training dictionary in each direction by sampling the n first entries from the training dictionary.

6.4. Evaluation metric

The evaluation metric that we consider is bilingual lexicon induction (BLI) accuracy, measured as Precision at Rank One (P@1). This means that for each word pair x

s

and y

t

in the test dictionary, y

t

must be the nearest word in the target space to x

a

in the source space according to the retrieval criterion. The retrieval criterion we use is CSLS, described in Section 2.4.

6.5. Hypotheses

We predict the following:

• In line with previous research (Conneau et al., 2017; Glavaš et al., 2019;

Søgaard et al., 2018), we predict that the more distant the language pairs, the lower the results we can expect in the cross-lingual mapping overall, but especially in unsupervised mapping which relies on the isomorphic assumption. More concretely, we predict that English and French will have high accuracy in all cases, and a narrow gap between the unsupervised and supervised performances.

• We expect that identical words will be a noisy but useful feature in most cases, but especially where the scripts are the same, as these will provide high coverage and a more diverse set of words (whereas in heterogeneous pairs the identical strings in one of the languages will be mostly composed of loan words, technical terms or foreign words, and will mostly be in a separate space).

• We expect that the number context, as a subset of identical words, will provide less noisy coverage, but also less diverse coverage than the identical words setting. We expect that this will perform similarly well with identical words across heterogeneous pairs, but less well across pairs with the same script.

• We expect that eigenvalue difference will correlate negatively with accuracy in unsupervised settings. In supervised – gold or weak – settings, we expect the negative effect of eigenvalue difference to be reduced.

• Regarding data size, we expect to see that accuracy degrades in all cases, but especially in unsupervised settings. We expect that more distant languages will show faster degradation with lower data sizes. We expect also that the gap between supervised settings and unsupervised settings will be greater with lower data sizes.

6.6. Results

6.6.1. Overall

Table 6.2 shows a sample of the results. These are further plotted in Figure

6.1. A full display of the results for each language pair and on each supervision

(32)

Supervision en→fr zh→ja ru→fr fa→zh fr→ja avg Unsupervised 13.58 3.15 0.6 1.29 0.07 3.738

50 words 13.37 3.15 0.8 1.08 0 3.68

100 words 14.45 2.6 0.33 0.94 0.2 3.704

500 words 14.18 3.28 1.2 0.94 0.27 3.974

1000 words 14.58 3.21 1.34 1.37 0.07 4.114

2500 words 14.31 3.56 1.54 0.79 0.07 4.054 4500 words 14.38 3.08 1.47 1.08 0.27 4.056

Numbers 13.04 3.08 0.4 1.15 0 3.534

Identical 14.78 3.28 0.87 0.86 0 3.958

Table 6.2.: Percent accuracy scores on a sample of language pairs with various supervision types.

setting is available in the Appendix.

Overall, we find that very few language pairs perform well on cross-lingual mapping. Of the 15 language pairs under investigation in this thesis, only English and French and Chinese and Japanese score reliably above zero on the default corpus size of 50 million. The rest of the pairings perform poorly in almost all cases.

6.6.2. Dictionary settings

As expected, the numbers context gives lower results than the identical word context overall, and particularly for the English/French pairing, where there are many identical words. Interestingly, this was not the case with the Japanese/Chi- nese pairing. In fact, for this pairing we see barely any difference at all between different supervision contexts.

In heterogeneous pairs such as Persian and French, French and Japanese, etc, we see the best results when gold dictionary data is used, though the improvement is slight.

6.6.3. Data size

Table 6.3 and Figure 6.2 show the effect of data size on BLI accuracy. Again, full results over all data sizes, language pairs, and supervision contexts are available in the Appendix.

As expected, we find that the accuracy of cross-lingual mapping increases with data size in general, though for most pairs this progression is slow.

When the embeddings are trained on 5 million tokens, all language pairs show an accuracy of close to zero. As the corpus size increases, there is an increase in the performance of some language pairs. However, for most language pairs accuracy remains almost at zero, regardless of the size of the training corpus.

The profile of the language pairs which do show consistent improvement with

increased corpus size is in line with our intuitions: a closely related language

pair (English and French) and a pair with extensive lexical borrowing (Chinese

and Japanese).

(33)

Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings: An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios

Low Supervision, Low Corpus size, Low

Similarity! Challenges in cross-lingual alignment of word embeddings

An exploration of the limitations of

cross-lingual word embedding alignment in truly low resource scenarios

Andrew Dyer

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits October 25, 2019

Supervisors:

Abstract

We infer that methods other than cross-lingual alignment may be more

appropriate in the case of both low resource languages and heterogeneous

language pairs.

Contents

Acknowledgements 5

1. Introduction 6

1.1. Structure . . . . 8

2. Background 10 2.1. Vector representation of words . . . 10

2.2. FastText . . . 11

2.3. Cross-lingual alignment of word vectors . . . 13

2.4. Vecmap . . . 15

3. Minimal supervision 17 3.1. Motivation for unsupervised or weakly supervised cross-lingual word embeddings . . . 17

3.2. Identical words . . . 18

3.3. Numbers . . . 20

3.4. No bilingual seed (unsupervised setting) . . . 20

3.5. Related work . . . 21

4. Data size 23 4.1. Related work . . . 24

5. Graph properties of embedding spaces 25 5.1. Graph isomorphism . . . 25

5.2. Graph isospectrality . . . 25

5.3. Related work . . . 27

6. Experiments 28 6.1. Experimental setup . . . 28

6.1.1. Experimental phase 1: Supervision settings . . . 28

6.1.2. Experimental phase 2: Corpus size . . . 28

6.1.3. Experimental phase 3: Isospectral difference . . . 28

6.2. Evaluation languages . . . 28

6.3. Data and resources . . . 29

6.3.1. Preparing our embeddings . . . 29

6.3.2. Dictionaries . . . 30

6.4. Evaluation metric . . . 31

6.5. Hypotheses . . . 31

6.6. Results . . . 31

6.6.1. Overall . . . 31

6.6.2. Dictionary settings . . . 32

6.6.3. Data size . . . 32

6.6.4. Isospectrality . . . 34

6.7. Discussion . . . 37

6.8. Limitations . . . 39

7. Further research 41 8. Conclusion 43 A. Full results of cross-lingual alignment experiments 44 A.1. 5 million . . . 44

A.2. 20 million . . . 45

A.3. 50 million . . . 46

A.4. 100 million . . . 47

Acknowledgements

This work was conducted under the supervision of Ali Basirat. I am grateful for his guidance, insights and support.

Finally, my loving thanks to my parents for their lifelong, unwavering support.

1. Introduction

A major breakthrough in various methods of Natural Language Processing (NLP) has been the representation of words as vectors in a multidimensional space;

With the increasing ubiquity of neural network methods, word embeddings have likewise become ubiquitous in applications such as text classification, part of speech (POS) tagging, dependency parsing, natural language inference, and many more.

There are many approaches and methods to deriving these bilingual or multi- lingual spaces, and we refer to Ruder et al. (2019) for a comprehensive listing of these methods.

Our research focuses on an alignment-based method. This is the method whereby, given a source language space X and a target language space Z, the objective is to learn a transformation matrix W that roughly maps the word vectors of X to their translation counterparts in Z, such that XW ≈ Z.

Recently, interest in cross-lingual word embeddings has been piqued due to their use in unsupervised machine translation (Artetxe, Labaka, and Agirre, 2018b; Artetxe, Labaka, Agirre, and Cho, 2018; Lample, Conneau, et al., 2018;

http://ruder.io/cross-lingual-embeddings/

The mapping does not necessarily have to be monodirectional. It is possible – and common

– for the mapping to be performed from both languages to a common space.

machine translation: the lack of aligned data in the majority of the world’s language pairs.

. However, we maintain that limiting research to only these pairs of similar languages – or pairs consisting of English and a divergent language – limits the set of observations we can make about the interplay of the world’s possible language pairs, in all their diversity.

Another limitation of previous research is that they rarely control for the corpus size on which monolingual embedding spaces are trained. We expand on the matter of “low resource languages” and “data size” more in Chapter 4.

In summary, there is an important distinction to be made in the context of cross-lingual word embeddings between:

(1) Language pairs for which there exists little bilingually annotated corpus data, such as dictionaries or aligned corpora.

(2) Language pairs for which there exists little corpus data monolingually in at least one of the languages.

Possibly the largest collection of publicly available aligned corpora between languages is available from OPUS. http://opus.nlpl.eu

https://github.com/facebookresearch/MUSE

In most cases, state of the art machine translation – of the kind needed to make non-noisy

translation pairs – relies on large amounts of curated parallel data, and the most common

source and target of translations for any given language is English.

We define three main directions of our investigation:

1. We investigate the effect of various supervision settings on bilingual lexicon induction (BLI) for 30 source-target language pairs (between six languages).

2. We control for corpus size, taking 50 million tokens as our default corpus size, and investigate the effect of varying the corpus size on BLI accuracy.

3. We investigate the correlation between isospectral difference and BLI accuracy, and track this effect across supervision types and corpus sizes.

1.1. Structure