Random Word Vectors

(1)

http://www.diva-portal.org

Postprint

This is the accepted version of a paper presented at Swedish Symposium on Deep Learning (SSDL).

Citation for the original published paper:

Basirat, A. (2019) Random Word Vectors In: Norrköping

N.B. When citing this work, cite the original published paper.

Permanent link to this version:

(2)

Random Word Vectors

Ali Basirat

Department of Linguistics and Philology Uppsala University

Uppsala, Sweden

ali.basirat@lingfil.uu.se

1 Introduction

Word vectors, also known as word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018), are one the key components in the neural network-based approaches of natu-ral language processing (NLP). Word vectors are real-valued vectors associated with words in such a way that the word similarities are reflected by the corresponding vector similarities. The process of embeddings words into a vector space is called the word embedding.

The conventional approaches of word embed-ding (Sahlgren, 2006; Mikolov et al., 2013; Pen-nington et al., 2014; Basirat, 2018) are based on the frequency of the presence of words in differ-ent contextual environmdiffer-ents. In these approaches, words are distributed into a high-dimensional vec-tor space based on their frequency of occurrence in different contexts forming the dimensions of the vector space. This vector space undergoes a di-mensionality reduction technique resulting in a set of low-dimensional vectors associated with words. Although the conventional word vectors can en-code many linguistic properties of the words, they do not have enough capacity to model the poly-semy and homonymy in natural languages (Basirat and Tang, 2019). Polysemy and homonymy refer to the fact that a word (or a word form to be more accurate about the linguistic definition of a word) can convey different meanings depending on the context it appears in.1

The conventional approaches of word embed-dings associate a polyseme, a word with mul-tiple meanings, with a vector that points some-where between the possible meanings of the word. Although this single point has some information

1_{Polysemy refers to a word that has several meanings, but}

homonymy refers to multiple words with different meanings that are written or pronounced in the same ways.

about the multiple meanings of the word, it does not tell us about the exact meaning of the word in a context. In practice, we need a model that as-sociate a word to different vectors with respect to the context of the word. Such word embedding methods have recently been developed as part of a language model where a recurrent neural net-work (RNN) is trained to predict a word given a sequence of words (i.e., context) (McCann et al., 2017; Peters et al., 2018).

Although the context-aware approaches of word embeddings have been successfully used in many NLP applications, they do not give us any clear information about how the word vector are gen-erated since as a discriminative model they are trained to classify between multiple classes, words in case of language models. This lack of infor-mation can be mitigated by using the generative methods that learn the joint distribution between contexts and words.

In this abstract, we introduce a random word vectoras an extension to a word vector in its con-ventional form. A random word vector is a random vector associated with a word. It is a generative model that generate word vectors with respect to the contextual environment of words. In this ap-proach, a word is associated with a distribution as opposed to the conventional word vectors in which words are associated with vectors.

Random word vectors are similar to the context-aware word vectors in the sense that both generate word vectors with respect to the contextual envi-ronments of words. The main difference between the two approaches is that random word vectors are based on the generative approaches of machine learning and are trained as part of a joint distribu-tion with contexts. However, the other approach is based on the discriminative approaches of ma-chine learning which are trained to classify be-tween words with respect to contexts.

(3)

A joint distribution of words and contexts can be modeled by a restricted Boltzmann machine (RBM) (Smolensky, 1986) that takes a vector rep-resentation of a context as its input and gener-ates a feature vector on its hidden units (Cogener-ates et al., 2011). The feature vector is considered as a random word vector. Using the contrastive diver-gence algorithm, the Kullback-Leibler diverdiver-gence between the context vector and the feature vec-tor is expected to be minimized (Hinton, 2002). Hence, the feature vector provides us with a dense summary of the context. In the followings, we present preliminary results obtained from some small scale experiments with a basic RBM archi-tecture.

2 RBM Architecture

A restricted Boltzmann machine (RBM) is an un-derected graphical model with a set of hidden and visible units. We use a standard RBM with Bernoulli variables as visible and hidden units. The inputs to the RBM are context vectors whose elements are binarized with respect to their signs. The elements larger than or equal to zero are set to 1 and the elements smaller than zero are set to 0.

Linguistically, the context of a word can be de-fined in different ways. It can be dede-fined as a se-quence of surrounding words or words that are in certain syntactic and semantic relationship with a target word. If we limit the context of a word to its surrounding words, a limited number of words before and after a target word, then the context vector of a target word is simply a vector built by the concatenation of the word vectors of its sur-rounding words. This type of context is called a window-based context and the number of sur-rounding words is referred to as the length of the context.

3 Preliminary Experiments

We train a Bernoulli-RBM as explained above with a window-based context of length 8 (4 words before and 4 words after a target word) on a fairly small corpus containing 1000 English sentences. The context vectors are built based on a set of 300-dimensional word vectors collected by Grave et al. (2018). This result in a long context vector with 2400 elements, equal to the number of vis-ible units of the RBM. We train two RBMs with 10 and 300 hidden units, the dimensionality of the random word vectors.

20 40 60 80 100 0 20 40 60 80 CWV CWV + 10d RWV CWV + 300d RWV

Figure 1: Parsing scores obtained from conventional word vectors (CWV) and its concatenation with 10 and 300 dimensional random word vectors (CWVC + 10d / 300d RWV).

The RBMs can be embedded as independent feature extractor components in NLP systems to generate random word vectors with respect to the context of the words. We use the RBM in a de-pendency parser (de Lhoneux et al., 2017), which uses a BiLSTM as its core feature extractor. The BiLSTM takes a conventional word vector in its input layer end encodes it into a feature vector that is passes it to a multi-layer perceptron to make a decision between different possible actions of the parser. We concatenate a random word vec-tor, generated by the RBM model and the conven-tional word vector and feed it into the BiLSTM. The parser is trained on a small corpus consisting of 100 sentences of the English part of the corpus of universal dependencies (Nivre et al., 2016).

Figure 1 shows the learning curves of the parser trained with the conventional word vectors (CWV) and the conventional word vectors concatenated with the random word vectors (CWV + 300d RWV and CWV + 10d RWV). The figure shows that ran-dom word vectors are not beneficial to the ing task. With 10-dimensional vectors, the pars-ing results are as good as our baseline model that is trained with conventional word vectors. Does this observation mean that random word vectors do not capture meaningful information about pars-ing? What are the effective parameters of a ran-dom word vector to be improved? These are part of the questions that are planned to be answered in future work.

(4)

4 Future Work

In future, we will use large scale standard data sets to train the RBM and the parser. We will also use more advanced architectures such as deep belief networks with multiple layers and different types of hidden units. In terms of word embeddings, dif-ferent types of contexts and initial word vectors are expected to be explored. The context represen-tation is another important parameter that we are going to study in details. It can be done by a sim-ple vector concatenation operator, as we showed in this abstract, or through more complicated op-erators such as convolutional neural networks or different variants of recurrent neural networks. Conclusion

We introduced a random word vector as an exten-sion to a conventional word vector. Random word vectors are generative models that associate a dis-tribution to each word of a language. The gen-erative nature of random word vectors provide us with a mathematical model to simulate the pol-ysemy and homonymy in natural languages, i.e., a word can convey different meanings depending on the context where it appear in. We proposed a restricted Boltzmann machine for this aim that takes a vector representation of context as its vis-ible units and maps it onto a feature vector as its hidden units. This feature vector is considered as a random word vector. The proposed model is tested in a dependency parsing task with a small scale training data. The preliminary results on a fairly small data set does not show any improvement to our baseline results. In future work, we will train our model on a large enough data sets and explore different techniques to improve the quality of ran-dom word vectors.

References

Ali Basirat. 2018. Principal Word Vectors. Ph.D. the-sis, Uppsala University.

Ali Basirat and Marc Tang. 2019. Linguistic informa-tion in word embeddings. In Agents and Artificial Intelligence, pages 492–513, Cham. Springer Inter-national Publishing.

Adam Coates, Honglak Lee, and Andrew Y. Ng. 2011. An analysis of single-layer networks in unsuper-vised feature learning. In International Conference on Artificial Intelligence and Statistics (AISTATS).

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-mand Joulin, and Tomas Mikolov. 2018. Learning

word vectors for 157 languages. In Proceedings

of the International Conference on Language Re-sources and Evaluation (LREC 2018).

Geoffrey E. Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural Com-putation, 14(8):1771–1800.

Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and

Joakim Nivre. 2017. From raw text to Universal

Dependencies – Look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Pars-ing from Raw Text to Universal Dependencies, pages 207–217.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation:

Con-textualized word vectors. In I. Guyon, U. V.

Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 30, pages 6294– 6305. Curran Associates, Inc.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. Joakim Nivre, Marie-Catherine de Marneffe, Filip

Gin-ter, Yoav Goldberg, Jan Hajiˇc, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Dan Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In The 10th edition of the Lan-guage Resources and Evaluation Conference. Jeffrey Pennington, Richard Socher, and Christopher D

Manning. 2014. Glove: Global vectors for word representation. volume 14, pages 1532–1543. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word

rep-resentations. In The 16th Annual Conference of

the North American Chapter of the Association for Computational Linguistics.

Magnus Sahlgren. 2006. The Word-space model.

Ph.D. thesis, Stockholm University.

Paul Smolensky. 1986. Parallel distributed processing. pages 194–281.