Machine Learning for NLP Lecture 4: Neural networks

(1)

Machine Learning for NLP Lecture 4: Neural networks

UNIVERSITY OF GOTHENBURG

Richard Johansson

(2)

the deep learning tsunami

I in several elds, such as speech and image processing, neural network or deep learning models have led to dramatic improvements

I Manning: 2015 seems like the year when the full force of the [deep learning] tsunami hit the major NLP conferences

I out of the machine learning community: NLP is kind of like a rabbit in the headlights of the deep learning machine, waiting to be attened

I so, what's the hype about?

(3)

overview

I neural networks (NNs) are systems that learn to form useful abstractions automatically

I learn to form larger units from small pieces

I appealing because it can reduce the feature engineering eort

I image borrowed from Josephine Sullivan:

I NNs are excellent for noisy problems such as speech and image processing

(4)

causes of the NN resurgence

I NNs seem to have a hype cycle of about 20 years

I there are a number of reasons for the one we're currently in

I the most important is increasing computational capacity

I for instance, the famous cat paper by Stanford/Google required 1,000 machines (16,000 CPUs)

I Le et al: Building high-level features using large scale unsupervised learning, ICML 2011.

I much of the recent research is coming out of Google (DeepMind), Microsoft, Facebook, etc.

I using GPUs from graphics cards can speed up training

I also, a number of new methods proposed recently

(5)

overview

I today: NNs for classication

I Monday: NNs for sequences (e.g. tagging, translation)

(6)

overview

basic ideas in neural network classiers overview of neural network libraries

word embeddings and distributional semantics

(7)

recap: linear separability

I some datasets can't be modeled with a linear classier!

I a dataset is linearly separableif there exists a w that gives

(8)

example: XOR dataset

X = numpy.array([[1, 1], [1, 0], [0, 1], [0, 0]]) Y = ['no', 'yes', 'yes', 'no']

clf = LinearSVC() clf.fit(X, Y)

# linear inseparability, so we get less than 100% accuracy print(accuracy_score(Y, clf.predict(X)))

(9)

abstraction by forming feature combinations

I recall from a previout lecture that we can add useful combinations of features to make the dataset separable:

very good very-good Positive very bad very-bad Negative not good not-good Negative not bad not-bad Positive

(10)

example: XOR dataset with a combination feature

# feature1, feature2, feature1&feature2 X = numpy.array([[1, 1, 1],

[1, 0, 0], [0, 1, 0], [0, 0, 0]]) Y = ['no', 'yes', 'yes', 'no']

clf = LinearSVC() clf.fit(X, Y)

# now we have linear separability, so we get 100%

print(accuracy_score(Y, clf.predict(X)))

(11)

expressing feature combinations as sub-classiers

I instead of dening a rule, such as x₃=x₁ AND x₂, we could imagine that the combination feature x3 would be computed by a separate classier, for instance LR

I we could train a classier using the output of sub-classiers

(12)

neurons

I historically, NNs were inspired by how biological neural systems work hence the name

I as far as I know, modern NNs and modern neuroscience don't have much in common

I Andrew Ng: A single neuron in the brain is an incredibly complex machine that even today we don't understand. A single `neuron' in a neural network is an incredibly simple mathematical function that captures a minuscule fraction of the complexity of a biological neuron. So to say neural networks mimic the brain, that is true at the level of loose inspiration, but really articial neural networks are

(13)

recap: the logistic or sigmoid function

def logistic(scores):

return 1 / (1 + numpy.exp(-scores))

(14)

a multilayered classier

I afeedforward neural network or multilayer perceptron consists of connected layers of classiers

I the intermediate classiers are calledhidden units

I the nal classier is called theoutput unit

I let's assume two layers for now

I each hidden unit h_i computes its output based on its own weight vector w_h_i:

hi =f (w_h_i ·x)

I and then the output is computed from the hidden units:

y = f (wo·h) the function f is called the activation

(15)

two-layered feedforward NN: gure

(16)

implementation in NumPy

I recall that a sequence of dot products can be seen as a matrix multiplication

I in NumPy, the NN can be expressed compactly with matrix multiplication

h = logistic(Wh.dot(x)) y = logistic(Wo.dot(h))

(17)

expressivity of feedforward NNs

I Hornik's universal approximation theoremshows that feedforward NNs can approximate any (bounded) mathematical function

I Hornik (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251257.

I and this is true even with a single hidden layer!

I however, this is mainly of theoretical interest

I the theorem does not say how many hidden units we need

I and it doesn't say how the network should be trained

(18)

expressivity of feedforward NNs

I Hornik's universal approximation theoremshows that feedforward NNs can approximate any (bounded) mathematical function

I Hornik (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251257.

I and this is true even with a single hidden layer!

I however, this is mainly of theoretical interest

I the theorem does not say how many hidden units we need

I and it doesn't say how the network should be trained

(19)

deep learning

I why the deep in deep learning?

I although a single hidden layer is sucient in theory, in practice it can be better to have several hidden layers

(20)

training feedforward neural networks

I training a NN consists of nding the weights in the layers

I so how do we nd those weights?

I exactly as we did for the SVM and LR!

I state anobjective function with a loss

I log loss, hinge loss, etc

I and then tweak the weights to make that loss small

I again, we can use (stochastic)gradient descentto minimize the loss

(21)

training feedforward neural networks

I training a NN consists of nding the weights in the layers

I so how do we nd those weights?

I exactly as we did for the SVM and LR!

I state anobjective function with a loss

I log loss, hinge loss, etc

I and then tweak the weights to make that loss small

I again, we can use (stochastic)gradient descentto minimize the loss

(22)

example

I let's use two layers with logistic units, and then the log

loss h = σ(W_h·x)

y = σ(Wo ·h) loss = − log(y)

I so the whole thing becomes

loss = − log σ(W_o · σ(W_h·x))

I now, to do gradient descent, we need to compute gradients w.r.t. the weights W_h and W_o

I ouch! it looks completely unwieldy!

(23)

example

I let's use two layers with logistic units, and then the log

loss h = σ(W_h·x)

y = σ(Wo ·h) loss = − log(y)

I so the whole thing becomes

loss = − log σ(W_o · σ(W_h·x))

I now, to do gradient descent, we need to compute gradients w.r.t. the weights W_h and W_o

ouch! it looks completely unwieldy!

(24)

the chain rule of derivatives/gradients

I NNs consist of functions applied to the output of other functions

I the chain rule is a useful trick from calculus that can be used in such situations

I assume that we apply the function f to the output of g

I then the chain rule says how we can compute the gradient of the combination:

gradient of f (g(x)) = gradient of f (g) · gradient of g(x)

(25)

the general recipe: backpropagation

I using the chain rule, the gradients of the weights in each layer can be computed from the gradients of the layers after it

I this trick is called backpropagation

I it's not dicult, but involves a lot of book-keeping

I fortunately, there are computer programs that can do the algebra for us!

I in NN software, we usually just declare the network and the loss, then the gradients are computed under the hood

(26)

optimizing NNs

I unlike the linear classiers we studied previously, NNs have non-convex objective functions with a lot of local minima

I so the end result depends on initialization

−1 0 1 2

0.150

0.150 0.300

0.300 0.450 0.450

0.600 0.600

0.750 0.900

(27)

training eciency of NNs

I our previous classiers took seconds or minutes to train

I NNs tend to take minutes, hours, days, weeks . . .

I depending on the complexity of the network and the amount of training data

I NNs use a lot of linear algebra (matrix multiplications) so it can be useful to work to speed up the math

I parallelize as much as possible

I use optimized math libraries

I use a GPU

(28)

overview

(29)

neural network software: Python

I scikit-learn currently has very limited support for NNs

I the main NN software in the Python world used to beTheano

I developed by Yoshua Bengio's group in Montréal

I http://deeplearning.net/software/theano

I last year, Google released their NN library called TensorFlow

I https://www.tensorflow.org

(30)

neural network software: Python (2)

I Theano and TensorFlow do a lot of useful math stu, and integrates nicely with the GPU, but they can be a bit low-level

I so there are a few libraries that package Theano or TensorFlow in a more user-friendly way, similar to scikit-learn

I Keras: https://github.com/fchollet/keras

I skow: now included in TensorFlow

(31)

other neural network software

I Cae: http://caffe.berkeleyvision.org/

I Torch: http://torch.ch/

(32)

coding example with Keras

keras_model = Sequential() n_hidden = 3

keras_model.add(Dense(input_dim=X.shape[1], output_dim=n_hidden)) keras_model.add(Activation("sigmoid")) keras_model.add(Dense(input_dim=n_hidden,

output_dim=1)) keras_model.add(Activation("sigmoid"))

keras_model.compile(loss='binary_crossentropy',

(33)

coding example with (high-level) TensorFlow

cols = [real_valued_column("", dimension=2)]

classifier = DNNClassifier(feature_columns=cols, hidden_units=[3], n_classes=2,

model_dir="/tmp/tftest_model") classifier.fit(x=X, y=Y)

(34)

overview

(35)

representing words in NNs

I NN implementations tend to prefer dense vectors

I this can be a problem if we are using word-based features

I recall the way we code word features as sparse vectors:

tomato → [0, 0, 1, 0, 0, . . . , 0, 0, 0]

carrot → [0, 0, 0, 0, 0, . . . , 0, 1, 0]

I the solution: represent words with low-dimensional vectors, in a way so that words with similar meaning have similar vectors

tomato → [0.10, −0.20, 0.45, 1.2, −0.92, 0.71, 0.05]

carrot → [0.08, −0.21, 0.38, 1.3, −0.91, 0.82, 0.09]

(36)

discovering meaning automatically

I there's a growing interest in methods that pick up some sort of word meaning simply by observing raw text

I these methods require large amounts of text but little or no investment in knowledge engineering

I you can go home after this talk and try out the software I'll mention, while building an ontology would take you years

I text is cheap nowadays

(37)

vector space models of lexical meaning

I in a word vector space, the meaning of a word is represented as a vector

pizzasushifalafel spaghetti

rock

techno soulfunk jazz punk

router touchpad laptop

(38)

distances/similarities in a word space

I in a word space, similarity of words corresponds to geometry

I being near each other in the space

I . . . or pointing in a similar direction

I pizza is kind of like sushi, but not so much like touchpad

I on the other hand, it seems that we have lost the knowledge structure: we don't know how pizza and sushi are similar

(39)

how could this work? the distributional hypothesis

I you shall know a word by the company it keeps

[Firth, 1957]

I two words probably have a similar meaning if they tend to appear in similar contexts

I the distributional hypothesis: the distribution of contexts in which a word appears is a good proxy of the meaning of that word

I this is the weak DH: we use the distribution as a practical representation of meaning because we think it can be useful

I the strong DH on the other hand claims that this is an

(40)

so what are contexts?

I two words probably mean about the same thing if they

I . . . appear in the same documents?

I . . . tend to have the same words around them?

I . . . are illustrated with similar images?

(41)

example: most frequent verbs near cake and pizza

I what are the activities we do with cakes and pizzas?

I cake: eat, bake, throw, cut, buy, get, decorate, garnish, make, serve, order

I pizza: eat, bake, order, munch, buy, serve, garnish, name, get, make, heat

(42)

example: tårta and pizza in Swedish text

(43)

the context is the document

D1: The stock market crashed in Tokyo.

D2: Cook the lentils gently in butter.

D1 D2

stock 1 0

market 1 0

crashed 1 0

Tokyo 1 0

cook 0 1

lentils 0 1

gently 0 1

butter 0 1

(44)

using a context based on syntax

has subject I has object book has object banana

wrote 1 1 0

I for instance

[Padó and Lapata, 2007, Levy and Goldberg, 2014a]

(45)

using a multimodal context

The black cat stared at me.

black stared banana

cat 1 1 0 1 1 0

(46)

similarity of meaning vectors

I given two word meaning vectors, we can apply a similarity or distance function

I Euclideandistance: multidimensional equivalent of our intuitive notion of distance

I it is 0 if the vectors are identical

I cosinesimilarity: multiply the coordinates, divide by the lengths

I it is 1 if the vectors are identical, 0 if completely dierent

I . . . and a whole bunch of other similarities and distances: see J&M or one of the survey articles

(47)

example: the nearest neighbors of Berlin

I simple vector space just counting the words before and after

I computed using a subset of Europarl

I the nearest neighbors according to the cosine similarity:

0.933 Cairo

0.925 Johannesburg 0.925 Stockholm 0.924 Madrid 0.922 Bonn 0.919 Abuja

0.918 Thessaloniki 0.917 Helsinki

(48)

example: dierent types of context (cosine similarity)

Reinfeldt: 1 Skavlan: 0.897105 Ludl: 0.878873 Lindson: 0.874008 Gertten: 0.871961 Stillman: 0.871375 Adolph: 0.86191 Ritterberg: 0.852531 Böök: 0.848459

Kessiakoff: 0.834909 Strage: 0.82995 Rinaldo: 0.825585

Reinfeldt: 1 Bildt: 0.973508 Sahlin: 0.960694 rödgröna: 0.960072 Reinfeldts: 0.958742 Juholt: 0.958644 uttalade: 0.956048 rådet: 0.954971

statsministern: 0.952898 politiker: 0.952712 Odell: 0.952376 Schyman: 0.952065

(49)

weighting the context counts

I the useful signal may be drowned out by general function words

I we can use an association measure (e.g. the PMI) to amplify the relative importance of the useful contexts

city sh borsch and the

Gothenburg 26 14 0 450 675

Moscow 31 2 12 389 712

⇓

city sh borsch and the Gothenburg 8.5 9.8 0 0.01 0.005

(50)

dealing with high-dimensional vectors

I typically, there aremany possible contexts, so the distributional vectors will have a high dimensionality

I so they are costly to process in terms of time and memory

I dimensionality reduction: an operation that transforms a high-dimensional matrix into a lower-dimensional one

I for instance: 1 million → 100

I the idea of the dimensionality reduction is to nd

(51)

examples of dimensionality reduction methods

I the popular Latent Semantic Analysis (LSA)

[Landauer and Dumais, 1997] uses a matrix operation called singular value decomposition

I random projection [Kanerva et al., 2000] is a very ecient alternative to advanced matrix operations: see for instance the

(52)

vector-space models derived from machine learning

I so far, we have built the distributional models by counting the contexts: context-counting methods

I counting + weighting + dimensionality reduction

I another more recent family of vector-space models is derived from machine learning: context-predictingmethods

I we train a classier or language model that learns to recognize typical patterns that we see in a corpus

I the vectors come out as a by-product of the training

I in this world, the vectors are often referred to asdistributed representationsor embeddingsfor historical reasons

(53)

what is best? counting or predicting?

I context-predicting methods have generated a bit of hype in the last couple of years

I there has been some debate about the pros and cons of the two types of methods:

I the terms context-counting and context-predicting come from a recent paper by [Baroni et al., 2014], which argued strongly in favor of the latter

I a more skeptical viewpoint was taken by [Levy and Goldberg, 2014b]

I the most common view today is probably that the two types of methods both work fairly well: seethis postby Magnus

(54)

example: skip-gram with negative sampling (word2vec)

I the skip-gram with negative sampling(SGNS) model [Mikolov et al., 2013a]:

I we have one set of vectors for the target words, and another for the contexts

I e.g. one target word vector for pizza, and a context vector for is the object of eat

I for each wordcontext pair, generate some randomnegative examplesof contexts

I e.g. pizza + is the object of persuade

I SGNS trains a classier similar to logistic regression so that

I VT(pizza) · VC(object of eat) is high

I VT(pizza) · VC(object of persuade) is low

(55)

SGNS: pseudocode

for each word token W in the corpus

nd the context C of that occurrence for each part c in the context C

update the context vector V_c for w update the word vector V_w for c

for each random negative example context n update the context vector V_n for w update the word vector Vw for n

(56)

some software (a small sample)

I word2vec: the software by Mikolov when he was at Google

I implements the SGNS model (and a few others)

I includes a word space built by Google using a huge collection of news text

I gensim: a nice Python library by eh˚u°ek

I includes a reimplementation of SGNS but also several other useful algorithms, such as LSA and LDA

(57)

word analogies and relational similarity

I relational similarity: how similar is the relation between cat:tail to that between car:tyre?

I word analogy(Google test set): Moscow is to Russia as Copenhagen is to X ?

I in some vector space models, we can get a reasonably good answer by a simple vector operation:

V (X ) = V (Copenhagen) + (V (Russia) − V (Moscow))

I then nd the word whose vector is closest to V (X )

I see [Mikolov et al., 2013b] and [Levy and Goldberg, 2014b]

(58)

gender in the word space (example by Mikolov)

(59)

example: countries and cities

0.6 0.4 0.2 0.0 0.2 0.4 0.6

Berlin

Tyskland

Stockholm

Sverige Paris

Frankrike Moskva

Ryssland

Köpenhamn

Danmark Oslo

Norge Tallinn

Estland Rom

Italien

(60)

distributional models as simple semi-supervised learning

I distributional models often give an improvement when added as features to standard machine learning-based NLP systems

I this can be seen as a simple approach to semi-supervised learning:

I we have a small hand-labeled corpus (for instance, for training a parser or named entity recognizer)

I we have a large unlabeled corpus: how can we do something useful with it?

I the classical paper is by [Turian et al., 2010]; several more recent ones [Guo et al., 2014]

(61)

intuition (for instance, named entity recognition)

I instead of

Gothenburg → [0, 0, . . . , 0, 1, 0, . . .]

Hargeisa → [0, 0, . . . , 0, 0, 1, . . .]

I . . . we can have

Gothenburg → [0.010, −0.20, . . . , 0.15, 0.07, −0.23, . . .]

Hargeisa → [0.015, −0.05, . . . , 0.11, 0.14, −0.12, . . .]

(62)

adding the distributional information

I straigtforward approach [Turian et al., 2010]:

I [Guo et al., 2014] and [Bansal et al., 2014] report that it's better if weclusterthe vectors before adding them as features

I example: word2vec vectors computed from a Spanish corpus clustered by k-means into 1000 clusters

282: juan, lázaro, joaquín, jaime, hugo, gastón, ...

441: mosca, mono, perro, pájaro, mamut, mandril, ...

844: estocolmo, gotenburgo, gotemburgo, dinamarca, ...

mostly trial and error: what is best depends on the problem

(63)

next lecture: predicting sequences and trees

United Nations ocial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC O

A C A T G G T C T G A A

N N C C C C C C C C C N

(64)

references I

I Bansal, M., Gimpel, K., and Livescu, K. (2014).Tailoring continuous word representations for dependency parsing.In ACL.

I Baroni, M., Dinu, G., and Kruszewski, G. (2014).Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors.In ACL.

I Firth, J. (1957).Papers in Linguistics 19341951.OUP.

I Guo, J., Che, W., Wang, H., and Liu, T. (2014).Revisiting embedding features for simple semi-supervised learning.In EMNLP.

I Kanerva, P., Kristoersson, J., and Holst, A. (2000).Random indexing of text samples for latent semantic analysis.In Proceedings of the 22nd Annual Conference of the Cognitive Science Society.

I Landauer, T. K. and Dumais, S. T. (1997).A solution to Plato's problem:

The latent semantic analysis theory of acquisition, induction and

(65)

references II

I Lenci, A. (2008).Distributional semantics in linguistic and cognitive research.Rivista di Linguistica, 20(1).

I Levy, O. and Goldberg, Y. (2014a).Dependency-based word embeddings.In ACL.

I Levy, O. and Goldberg, Y. (2014b).Linguistic regularities in sparse and explicit word representations.In Proceedings of CoNLL.

I Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a).

Distributed representations of words and phrases and their compositionality.

In NIPS.

I Mikolov, T., Yih, W.-t., and Zweig, G. (2013b).Linguistic regularities in continuous space word representations.In Proc. NAACL.

I Padó, S. and Lapata, M. (2007).Dependency-based construction of

(66)

references III

I Turian, J., Ratinov, L.-A., and Bengio, Y. (2010).Word representations: A simple and general method for semi-supervised learning.In Proc. ACL.

Machine Learning for NLP Lecture 4: Neural networks