Using a Character-Based Language Model for Caption Generation

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/095--SE

Using a Character-Based

Lan-guage Model for Caption

Gener-ation

Användning av teckenbaserad språkmodell för generering av

bildtext

Simon Keisala

Supervisor : Rita Kovordanyi Examiner : Marco Kuhlmann

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Using AI to automatically describe images is a challenging task. The aim of this study has been to compare the use of character-based language models with one of the current state-of-the-art token-based language models, im2txt, to generate image captions, with fo-cus on morphological correctness.

Previous work has shown that character-based language models are able to outperform token-based language models in morphologically rich languages. Other studies show that simple multi-layered LSTM-blocks are able to learn to replicate the syntax of its training data.

To study the usability of character-based language models an alternative model based on TensorFlow im2txt has been created. The model changes the token-generation architec-ture into handling character-sized tokens instead of word-sized tokens.

The results suggest that a character-based language model could outperform the cur-rent token-based language models, although due to time and computing power constraints this study fails to draw a clear conclusion.

A problem with one of the methods, subsampling, is discussed. When using the orig-inal method on character-sized tokens this method removes characters (including special characters) instead of full words. To solve this issue, a two-phase approach is suggested, where training data first is separated into word-sized tokens where subsampling is per-formed. The remaining tokens are then separated into character-sized tokens.

Future work where the modified subsampling and fine-tuning of the hyperparameters are performed is suggested to gain a clearer conclusion of the performance of character-based language models.

(4)

Acknowledgments

I would like to thank my supervisor, Rita Kovordanyi, for all the guidance throughout the thesis work, without her help the study would never have been completed. I also wish to give my gratitude to my examiner, Marco Kuhlmann, whom inspired me in performing this study.

(5)

Acronyms

API Application Programming Interface. 17

CBOW Continuous-Bag-of-Words. 2, 13

CNN convolutional neural network. 3, 7, 8

COCO Common Objects in Context. 3

GloVe Global Vectors. 2

LCS Longest Common Subsequence. 18

LSA Latent Semantic Analysis. 2

LSTM long-short term memory. 3, 7, 11

NLP Natural Language Processing. 1, 4–6

NN neural network. 4

OOV out-of-vocabulary. 2, 16

PMI Pointwise Mutual Information. 15

POS Part-of-Speech. 5

RNN recurrent neural network. 7, 11

SOTA state-of-the-art. 1, 2

SVD Singular Value Decomposition. 2, 15, 16

TF-IDF Term Frequency Inverse Document Frequency. 19, 20

(6)

List of Figures

2.1 Common activation functions . . . 7

2.2 Kernel matrix with stride set to 1 . . . 8

2.3 Kernel matrix with stride set to 2 . . . 8

2.4 Pooling of data . . . 9

2.5 Input padding . . . 9

2.6 Inception-v3 illustration . . . 10

2.7 Visualization of an LSTM-Block . . . 11

2.8 LSTM-Block with peepholes . . . 12

3.1 General architecture of the models. . . 22

3.2 Token generation structure . . . 23

3.3 MultiRNNCell with two LSTM-Blocks, unfolded two iterations. . . 24

4.1 high scoring word100k . . . 29

4.2 high scoring char100k . . . 29

4.3 high scoring char200k . . . 30

4.4 low scoring word100k . . . 31

4.5 low scoring char100k . . . 31

(8)

List of Tables

2.1 Skip-Gram word pairs . . . 14

3.1 Model hyperparameters . . . 25

3.2 Hardware specification . . . 26

4.1 Automated scores . . . 27

4.2 Model dictionary and sentences . . . 28

4.3 Trainable weights of the two models. . . 28

4.4 Top 1000 shared image captions . . . 28

(9)

1 Introduction

Automated text generation is a challenging task to perform. The aim of this study is to com-pare two different models’ performance at generating descriptive text to images. The quality of the descriptive texts will be measured by automated metrics. Another comparison is made where the accuracy of the lexical selection of words is measured between the two models. One of the models uses tokens describing individual words, while the other use character-sized tokens to spell words and to create captions describing images.

Current state-of-the-art architectures uses word-embedding, where each word gets em-bedded into a fixed length array of numbers and each word is represented as a unique num-ber array. Since each word has its own unique numnum-ber array, word embedding models have to re-learn lexical rules of identical prefixes for each occurrence of the rule. By using other embedding methods where the full words are not embedded it is possible to combine smaller morphological blocks when generating words.

1.1 Motivation

Natural Language Processing (NLP) is highly complex, and for a machine to use correct grammar and to generate understandable sentences it is important for the model to learn both an in-depth semantic structure and, in some cases, multiple meanings of identical words.

NLP is a branch within computer science focusing on interpreting and understanding hu-man language. Research within NLP help in analysing text, speech and languages. Apple, Amazon and Google have developed voice assistants, named Siri, Alexa and Google Assis-tant, which have allowed the user to ask questions, initialize phone calls, take notes and appointments, all by speech. These tools must have an understanding of a language and rec-ognize speech to translate speech into words and sentences into commands or instructions [16]. Hatsune Miku, a Japanese Vocaloid pop-artist, uses the same text-to-speech technique as a voice assistant for singing [12, 28]. NLP has also contributed in improving tools distin-guishing between spam mails and non-spam mails [35].

One of the current state-of-the-art (SOTA) method used to create the dense vector repre-sentations of words1, word2vec, utilizes the Skip-Gram approach proposed by Mikolov et al.

1_{A dense vector for representing words uses fewer values in its vector than the words in its dictionary (known}

(10)

1.2. Aim

[23, 25]. Other successful alternatives are Global Vectors (GloVe) and Latent Semantic Anal-ysis (LSA), which in some cases surpass word2vec in accuracy [25]. A model trained using word2vec, GloVe or LSA can be used as a lookup dictionary for words. Each word in the dictionary gets represented as a dense vector. Compressing words into a dense vector both helps to reduce the input dimension to a model and allows a non-uniform distance between word vectors.

The alternative method to dense vector representation is one-hot vector representation. The one-hot vector representation represents each word as a vector of the same length as the dictionary it uses. One of the values is set to one (1), and the others are set to zero (0). This method scales poorly as the size of each word vector increases linearly with the number of words in the dictionary, and each word is of equal distance to any other word.

A common problem for all word-sized token-based models is that the trained model be-comes limited to only the set of words it is trained with. Any word outside of the dictionary can neither be encoded to nor decoded from its dense vector representation. In some cases the problem is that a morphological version of the word has not yet been seen, another case of encountering a word not in the dictionary is if a word is misspelled. This problem is referred to as becoming out-of-vocabulary (OOV).

Word-based language models (word-sized token-based language models) represent each morphological variation with its own dense vector. The words “cat” and “cats” for exam-ple are treated as their own unique vector. The word-based models learn the relationships between words by training on large corpora. Rare variations are still problematic for word-based language models though, and could be improved using character-word-based models.

Both the problem of the model becoming OOV and that semantically similar – but mor-phologically different – words are treated as different dense vectors are limitations when using a word-based representation of words, and is also why studies using character-based language models continue. Since character-based models such as Kim et al.’s CharCNN [19] and Intuition Engineering’s very recent chars2vec [2] do not learn words as whole chunks it is possible for these models to learn patterns of blocks within words.

1.2 Aim

The current SOTA method involves embedding words into a multi-dimensional vector using Skip-Gram (earlier using Continuous-Bag-of-Words (CBOW)) or methods performing global document analysis (e.g. GloVe or Singular Value Decomposition (SVD)). This method yields limitations to the model in grouping morphological differences of words. A trained character-based language model could in theory generate varieties of words never encountered before, as long as each morpheme has been fully learned and understood by it.

The goal of this thesis work is to test if a character-based language model (character-sized token-based language model) would maintain a similar performance in text generation tasks as a word-based model, and to see whether character-based models exceeds word-based models in using correct morphemes. By using a character-based language model the words are not represented as distinct entries in a table, and the model would instead be able to learn words in smaller building blocks, morphemes, which include prefixes, root words and suffixes.

1.3 Research Questions

Due to the known weaknesses associated with using a word-sized token-based language model – both OOV2and duplicate representation from morphological differences – we plan

2_{out-of-vocabulary (OOV) will not be the focus of this work, although it is one additional property of}

(11)

1.4. Delimitations

to perform research on whether a character-based language model could improve the text quality. More specifically an analysis on morphological correctness will be performed.

Previous work using a 1-dimensional convolutional neural network (CNN) has been made where each input to the 1-dimensional network corresponds to one character in a word. Each word is furthermore zero-padded to always have the same length as the longest word in the current dictionary [39, 19]. The work shows that character-based language models can outperform token-based language models on morphologically rich languages, e.g. Arabic, Czech, French, German, Spanish and Russian. Andrej Karpathy uses long-short term mem-ory (LSTM) architecture in his experiments [18]. These experiments show that an LSTM-based architecture is able to reproduce the syntax of different text, including essays, poems and programming language syntax, although the result lack any underlying meaning.

Two questions are asked to compare both general performance and morphological cor-rectness of word choices:

1. How well do character-based language models perform compared to the current well

performing token-based models?

This question will be answered by comparing the score of a character-based language model and a token-based language model using Microsoft Common Objects in Context (COCO) Caption Evaluation software, where the general score from different metrics is compared, along with a metric identifying when a correct stem word is used, though incorrect exact word. Both the token-based and character-based model are trained using images from the MSCOCO 2014 captioning data set, which consists of images with hand-written descriptions.

2. Will a character-based language model make less morphological mistakes when

cre-ating captions compared to a token-based model?

A key property of character-based language models are their flexibility in learning and reusing morphemes when generating words. To answer this question we want to know whether a character-based model actually can make use of this flexibility and learn the meaning of various prefixes, suffixes and bases, or if the architecture learns each word as a whole, i.e. if the architecture gains an understanding of the morphological rules. The work will involve investigating the performance of the current token-based language models considering their morphological mistakes, and compare them with an alternative character-based language model by generating captions for images.

1.4 Delimitations

This research is limited to using encoder-decoder architectures performing full image cap-tioning. Two different approaches will be used:

1. TensorFlow im2txt model used for caption generation, and

2. A character-based language model based on Andrej Karpathy’s LSTM experiments3 The comparison between each model is made using Microsoft COCO Caption Evaluation, and the result will not be compared against other pre-trained language models. This is to reduce the influence of computational differences and allow the various models to have the same basis. Image features are extracted by using a pre-trained image recognition model based on inception-v34.

The models will only be trained on English sentences. The COCO dataset includes five annotations in English for each sample in the training data and validation data.

3_{http://karpathy.github.io/2015/05/21/rnn-effectiveness/}

(12)

2 Theory

This chapter provides some background information about architectures and concepts neces-sary for the reader. Firstly, a small introduction to NLP will be presented, mentioning some earlier established concepts. Secondly, a shorter description of the process a computer per-forms to translate images into sentences is provided. This is followed by a more in-depth analysis of the different type of neural network architectures used in the image captioning ar-chitectures, initiated with describing the basics of how neural networks work in general, and later followed by describing two specialized neural network architectures which are used in this work. Lastly, a few different structures of how words can be represented and chosen to allow a computer to understand them will be given, followed by a description of how the metrics used to automatically score sentences work.

2.1 Natural Language Processing

Natural Language Processing is a necessary research, focusing on a computerized analysis of text and speech recognition, and it allows machines to automatically translate between different languages and to react on spoken commands.

The first research within NLP began in late 1940s, where some core concepts and ideas were established. In early 1950s the focus was to perform automatic translation from Russian to English [21, 17], and in 1954 a first demonstration was made. The earlier versions of ma-chine translations used hard coded rules, the translation performed look-up of words from one language to another, and the focus continued on language syntax and semantics.

During this early period of NLP the researchers began to develop strategies and structured a baseline of which the development could continue from. The majority of the researchers during this period had a background within linguistic and language studies, and there were huge limitations on computers, both in performance and memory capacity, and also in pro-gramming languages and propro-gramming methodology; the machines during this time used punched cards in an assembler-like programming language.

At around 2010 a shift in methodology occurred within NLP-research. Models changed from using linear rules1into non-linear neural network (NN) models [10].

1_{A linear rule is a rule which activates when a threshold is met. Example: if (value ă thr}

(13)

2.1. Natural Language Processing

2.1.1 Levels of Natural Language Processing

Since NLP is the computerized method of processing human language (natural language), the procedure itself tries to correlate to the methods we human use. Seven different “levels of language” has been defined, one only considering oral language, other six regarding the understanding of a language [21]. The greater understanding a system has of these levels, the more human-like its understanding is.

For written text, the different levels are: • Morphology

This level handles the smallest units of a word; morpheme. A morpheme is the building blocks which make up each word: prefix, root and suffix.

• Lexicology

The lexical phase of NLP combines morphemes to create words. This step embeds a word with prefixes and suffixes to use the correct tense, possessive form, to change a word’s Part-of-Speech (POS) and more.

• Syntax

In this level the words are combined into sentences with an underlying meaning. Sen-tences with the exact same words within them will, depending on the order, result in different outcomes, sometimes grammatically correct while other times resulting in grammatically invalid sentences. The two sentences “Tom watched my bird” and “My bird watched Tom” both contain the same set of words. The order (syntax), however, differs.

• Semantics

The semantic level handles words with multiple meanings. By taking into account mul-tiple words within a sentence, and getting a broader context, it is possible to correctly map these words. An example covering semantic understanding is the word hit, which have multiple meanings depending on the context:

“The song “Bohemian Rhapsody” by Queen is one of the greatest hits of all time.” “Earlier today I hit my head”

“He managed to hit the ball without any problem”

are three examples with different meanings of the word “hit”. • Discourse

This level of natural language handles multiple sentences. It is necessary to link infor-mation existing between sentences. A good example by Hal Daumé [5] are the following two sentences:

1. I only like traveling to Europe. So I submitted a paper to ACL.

2. I only like traveling to Europe. Nevertheless, I submitted a paper to ACL.

From his example, where only a single word differ between the two sentences, it is possible to extract additional information not directly written in the sentence: clues of the location where ACL will be held. The first sentence in his example indicate that ACL is held inside of Europe. The second sentence, where only the word so is replaced by nevertheless, changes the interpretation into assuming that ACL is held outside of Europe.

• Pragmatics

The pragmatic level uses world based knowledge to accurately parse sentences. In NLP, the pragmatic level is still a challenge since this level of NLP relies on information outside of the sentence itself. Following are two examples where knowledge outside of the sentence is required to make the right assumption:

(14)

2.2. Image Captioning: From Image to Text

1. “The teachers denied the students access because they were never informed in ad-vance.”

2. “The teachers denied the students access because they lack experience.”

In the first example “they” most likely refer to the teachers not being informed of ac-cepting a group of students entering, while in the second example, “they” refers to the students and that they lack of experience being inside of the area.

On top of these six levels for written languages, another level exists for spoken: phonology, where oral words are translated into their written counterpart.

Each level of NLP become more difficult than the previous one, since the amount of exter-nal knowledge increase for each level.

2.2 Image Captioning: From Image to Text

There exist several methods to translate an image into text. Hossain et al. did a thorough study of the methods in 2018 [15]. Their study discusses different deep learning based image captioning algorithms. Deep learning is a machine learning concept where the core idea is to hierarchically extract abstract features [31]. In the case of neural networks a deep neural network is defined as a neural network with two or more hidden layers.

The method covered in this study involve using an Encoder-Decoder architecture. The Encoder-Decoder architecture consists of two parts: a convolutional neural network acting as the encoder and a recurrent/LSTM network for the decoder2. The goal of the encoder is to translate information into a feature space. The decoder then use the encoder’s features to decode it to another medium (or the same medium – which can be used to compress data).

Vinyals et al. uses an Encoder-Decoder architecture where an encoder encodes an image into a 2048-dimenisonal feature vector [37]. The vector is then stitched onto a decoder using a compressor, which is a fully connected neural network layer with 2048 inputs and N outputs, where N is determined by the decoder’s dimensions. The decoder used by Vinyals et al. is a recurrent architecture, where the features from the encoder is processed over multiple iterations. This result in an architecture inputting a single data sample – an image – and outputting multiple samples – a sequence of words.

In the following sections the neural network architectures will be described in detail, start-ing with the original neural network followed by different architectures improvstart-ing the per-formance in image analysis and time-dependent information.

2.3 Artificial Neural Networks

The neural network3architecture is inspired by the biological brain. Many of the core ele-ments a brain consists of are also visible in the artificial representation. The neural network consists of neurons (nodes) connected into a network, where each neuron receive data (input) from other neurons (or oneself), and are themselves inputs to other neurons in the network.

The most simplistic neural network is a layer-based network, where neurons are separated into different layers, and each neuron’s input is from the previous layer in the network (also called Dense feed-forward neural network). The weights of neurons in a dense layer can be defined as a matrix, where each column in the matrix correspond to one neuron and each row corresponds to the weights for a given input.

f(~x, W,~b) = ~x ¨ W+~b (2.1)

2_{CNN and LSTM architectures are typical for an image captioning Encoder-Decoder architecture, although an}

Encoder-Decoder does not have to use these specific architectures.

(15)

2.4. 2D Convolutional Neural Network

~x PRdin_{, W P}Rdin,dout_,~_{b P}Rdout

Formula 2.1 describe a layer in a dense neural network. ~x and~b are vectors, where~x is the input and~b is a bias vector – offsetting the result – and W is a matrix to weight the inputs for each node. dinis the input dimension and doutthe resulting output dimension.

The resulting vector from one layer become the input x for the next layer in the network. Between each layer it is common to have an activation function. The activation function translates each value in the output vector. The primary goal of an activation function is to do a non-linear translation of the data, although in some cases a linear function is still used. Some commonly used activation functions are: Linear, ReLU (Rectified Linear Unit), Sigmoid and Gaussian. Figure 2.1 plots the different functions. [6, Chapter 2]

´1 ´0.5 0 0.5 1 ´1 ´0.5 0 0.5 1 f(x)=x f(x)=max(x, 0) ´3 ´2 ´1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 f(x)= 1 1+e´λx f(x)=e´λx2

Figure 2.1: Activation functions commonly used with neural networks. To the left: Blue - ReLU activation function, Red - Linear activation function. To the right: Blue - Gaussian activation function (λ=2), Red - Sigmoid activation function (λ=2).

Different neural network architectures have since then been designed to improve and speed up the performance on structured data. Some of them are:

• Convolutional neural network (CNN)

The input is structured as an N-dimensional matrix4and weights are shared. This archi-tecture is useful when you want to find the same features throughout the entire input data.

• Recurrent neural network (RNN)

recurrent neural networks (RNNs) are networks which include their own output as input. These networks allow actions to be influenced by previous states/inputs and not only the current state/input. In the case of sequential events some kind of influence from older states are needed.

• Long-short term memory (LSTM)

LSTM architecture is a special recurrent architecture with gated storing, reading and forgetting data. LSTM allows the network to store data for a long period before using it, unlike RNNs which feed the data each iteration.

2.4 2D Convolutional Neural Network

Convolutional neural network (CNN) architectures are useful when data is arranged such that nearby values are more relevant than values far apart [38]. This architecture is con-structed using kernel matrices and channels. A kernel feature is an n-dimensional matrix of

4_{Examples of applications for different dimensions: 1D – Text/Speech, 2D – Image, 3D – Movie or volumetric}

(16)

weights, which is used on each area of the input data. A red-green-blue image (coloured image) with a fixed amount of pixels (width and height) would be translated as input to 2D-convolutional neural layer, having 2 dimensions (width, w, and height, h), and 3 channels, c, (one for each colour), giving a total of w ˆ h ˆ c values. One kernel matrix for this input would have x ˆ y ˆ 3 weights, where x and y is the kernel’s width and height. This kernel matrix is then overlapping each subsection of the input, resulting in a feature map with the dimensions

(w ´ x+1)ˆ(h ´ y+1). Each CNN layer is defined by a set of N kernel matrices, all stacking their result on top of each other giving a final output of(w ´ x+1)ˆ(h ´ y+1)ˆN, which itself also have 2 dimensions and N channels.

Fr,c= w ÿ i=0 h ÿ j=0 Ki,j˚Ii+r,j+c (2.2)

Each cell value in the feature map populated by the kernel matrices correspond to the re-sulting value of the sum of an element-wise multiplication between the kernel matrix and a subset of the input matrix for the corresponding cell, as in formula 2.2, where F is the result-ing feature map, K the kernel matrix and I the input matrix. r and c is the current row and column of the feature map, w and h is the width and height of the kernel matrix. Figure 2.2 illustrate a kernel matrix with sample input and the resulting feature map it would create.

Input -0.2 +0.3 -0.3 +0.5 +0.6 -0.9 -0.1 -0.8 +0.3 -0.9 -0.4 +0.1 +1.0 +0.0 -0.2 -0.7 Kernel Matrix +0.9 +0.5 -1.0 +0.1 Feature Map -0.72 +1.01 +0.00 -0.30 +0.00 -0.08 -1.18 -1.03 -0.18

Figure 2.2: Sample of a kernel matrix and its resulting feature map when applying a kernel matrix on a 4x4x1 input using kernel size: 2x2 and stride: 1.

The convolutional layer has, on top of this, a parameter called stride. The stride parameter define how many steps on the input matrix each cell in the feature map skip.

Fr,c= w ÿ i=0 h ÿ j=0 Ki,j˚Ii+r˚s,j+c˚s (2.3)

With the extra stride parameter, the final formula can be seen in 2.3, where s is the stride parameter.

The same example as above, using stride 2 instead of 1 is seen in figure 2.3. Input -0.2 +0.3 -0.3 +0.5 +0.6 -0.9 -0.1 -0.8 +0.3 -0.9 -0.4 +0.1 +1.0 +0.0 -0.2 -0.7 Kernel Matrix +0.9 +0.5 -1.0 +0.1 Feature Map -0.72 +0.00 -1.18 1.18

Figure 2.3: Sample of a kernel matrix and its resulting feature map when applying a kernel matrix on a 4x4x1 input using kernel size: 2x2 and stride: 2.

A final method used on convolutional networks is pooling. Pooling is used on the re-sulting feature maps to reduce their size. Nearby pixels in an image closely correspond to a similar area, pooling methods either combines or selects a feature from a region of cells into a single cell, with close to no data loss. Some advantages of pooling are:

(17)

• Reduction in size directly correlate to the computational performance.

• Positional data of features is removed. When pooling a region, a feature is triggered no matter where in the region it was found.

• Noisy input has its influence reduced in the result.

Two different pooling methods are mainly used: max-pooling and average-pooling. Max-pooling is using the maximum value of a given region while average-Max-pooling averages each value of the region. The pooling method, whether using max-pooling or average-pooling, uses two parameters: kernel size and stride. These two parameters has the same usage as for the kernel matrices.

Figure 2.4 shows the result from the two different pooling methods. Input: +0.87 +0.32 +0.82 +0.29 +0.37 -0.54 +0.68 -0.36 +0.78 -0.40 +1.00 +0.79 -0.13 -0.62 -0.69 +0.09 Max-pooling: +0.87 +0.82 +0.78 +1.00 Average-pooling: +0.26 +0.36 -0.09 +0.30

Figure 2.4: Pooling of data, kernel size: 2x2 and stride: 2. To the left: the feature map, middle: max-pooling, right: average-pooling

In some cases it is required or beneficial to maintain the same dimensionality of both the input and the output. Depending on the kernel size and stride the dimensionality of the output is altered. One method to maintain the same width and height on the output is to add padding around the input to counteract the reduction from using a certain kernel size.

Figure 2.5 shows the changes on the input when adding a 1 or 2 layer thick zero-padding. Input: +0.87 +0.32 +0.82 +0.29 +0.37 -0.54 +0.68 -0.36 +0.78 -0.40 +1.00 +0.79 -0.13 -0.62 -0.69 +0.09 1 layer zero-padding: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 +0.87 +0.32 +0.82 +0.29 0.0 0.0 +0.37 -0.54 +0.68 -0.36 0.0 0.0 +0.78 -0.40 +1.00 +0.79 0.0 0.0 -0.13 -0.62 -0.69 +0.09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 layer zero-padding: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 +0.87 +0.32 +0.82 +0.29 0.0 0.0 0.0 0.0 +0.37 -0.54 +0.68 -0.36 0.0 0.0 0.0 0.0 +0.78 -0.40 +1.00 +0.79 0.0 0.0 0.0 0.0 -0.13 -0.62 -0.69 +0.09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Figure 2.5: Padding of input data.

Saha provide clear visualization of each component and the benefits from convolutional neural networks [30].

(18)

2.4.1 Inception – A Convolutional Neural Network for Image Recognition

The inception model (inception-v3) consists of 42 layers in total. The network is fully con-volutional, except for the final classification occurring in the end [33]. The convolutional layers in inception-v3 use optimized methods both to minimize the amount of weights and to reduce the size of each kernel matrix without any loss in performance.

A pre-trained model of inception-v3 for image recognition exists which can be used in other applications where image-analysis is done. The inception model was developed by Google, and is one of the most successful models for image recognition currently available. Figure 2.6 shows the structure of the entire inception network.

Figure 2.6: Illustration of inception-v3 network5

Each inception module in the network consists of different scaled kernel matrices. Each scale of an inception module is zero-padded to maintain the same width and height on both the input and output. This allows each module to be in a series after one another, and for each different scale to be concatenated.

One of the main problems with deep neural networks is the vanishing gradient problem, which occurs since the error gradient, i.e. the gradient between the actual output and the expected output, from the final layer diminishes the further it is back-propagated through the network [6, Chapter 11]. This result in that the very first layers in a deep neural network model not receiving much of the initial error gradient to correct itself towards, leaving the earlier layers almost unchanged.

The Auxiliary Classifier is added to enhance the training in the deeper layers. The initial idea was to “push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combating the vanishing gradient problem in very deep networks” [32]. Trial and error showed that the auxiliary classifiers does not change the models training rate during early and mid training. However, the effect was noticeable at the end of the training where models using auxiliary classifiers surpassed the ones that did not, gaining a slightly higher accuracy [33].

5_{Adapted image of the inception-v3 network by SH Tsang, original available at: https://cloud.google.}

com/tpu/docs/inception-v4-advanced, Image licensed under CC BY 4.0 (https://creativecommons. org/licenses/by/4.0/)

(19)

2.5. Recurrent Neural Networks

2.5 Recurrent Neural Networks

A recurrent neural network layer behave similar to the traditional (fully connected) artificial neural network layer. The core difference between the recurrent layer and a fully connected neural network layer is the input. On top of having the normal input being fed to the layer, the output of the layer is fed back as an input, allowing an old output to influence future iterations.

This study will be using the extended recurrent structure called Long-Short Term Memory, which will be described more thoroughly in the following subsection.

2.5.1 Long-Short Term Memory Neural Network

An LSTM layer (LSTM-Block) has the tools necessary to internally store information for an arbitrary time, and output it in a later iteration. The layer consists of gates controlling how an internal cell state stores, outputs and forgets information. The original version of the LSTM architecture was introduced in 1997 by Hochreiter and Schmidhuber [13]. It extended the traditional RNN layer by including a cell state and gates to both protect the state and to protect the consecutive layers from unwanted and uncontrolled output. The internal cell state accumulated new information, which was both positive and negative, and in 1999 the "forget" gate was introduced by Gers et al. [9]. The forget gate solved a problem where cell states continued to grow uncontrolled. Although it was still possible for the LSTM-Block to reset its internal cells by feeding the cells with negated values this process is difficult for the block to learn. Figure 2.7 illustrates how a cell in Gers et al.’s variation of the original LSTM was configured. σ σ tanh σ tanh Forget Block Input Input Output Ct-1 Ht-1 Ct Ht Xt Ci

Figure 2.7: One LSTM-Block with all the inputs to it. X is the new input to the layer, C the internal cell states of the network and H is the layer’s output. t ´ 1 is the previous time step and t the current time step.

Gers continued to improve the LSTM architecture, and in 2000 Gers and Schmidhuber introduces "peepholes" [8]. The peepholes allowed the LSTM architecture to count and keep track of how many time steps has passed since a previous event. The peepholes added to the architecture can be seen in figure 2.8.

(20)

2.6. Word Representation σ σ tanh σ tanh Forget Block Input Input Output Ct-1 Ht-1 Ct Ht Xt Ci

Figure 2.8: An LSTM-Block with peepholes added as extra data to the weights of the Forget, Input and Output gates. The red lines illustrate which step of the internal cell state that is included when gating the data.

2.6 Word Representation

Machines can represent words and text in several ways. The most commonly used method to represent words is using a token-based approach. Each word is in this case represented by a fixed length vector. In some cases a token represent multiple words, such as “Eiffel Tower” and “New York” or “New York City”, since the group of these words yield a different meaning compared to the words individually. The tokens representing multiple words are defined as multi-tokens. The second method to represent a word is by its character sequence. Main downside of representing a word by its character sequence is that the length of the word-vector is not static, but changes based on the length of the word.

Token-based word representation can use two different methods: one-hot or dense repre-sentation [10, p.89-92], which will be described in the following sections.

2.6.1 One-Hot Word Representation

For a one-hot word representation, the vector representing the target word has the same size as the known dictionary. Assuming that the known dictionary contain the following 5 words, and vector representation:

“the” [1,0,0,0,0] “was” [0,1,0,0,0] “chasing” [0,0,1,0,0] “dog” [0,0,0,1,0] “cat” [0,0,0,0,1]

The sentence "The dog was chasing the cat" become:

[[1,0,0,0,0], [0,0,0,1,0], [0,1,0,0,0], [0,0,1,0,0], [1,0,0,0,0], [0,0,0,0,1]].

The advantage of using a one-hot word representation is its simplicity, each word in the dic-tionary is assigned its own dimension. The downside, however, is that a one-hot dicdic-tionary does not scale well when the dictionary increase, since each word is given its own dimension.

(21)

2.6. Word Representation

Another major downside is that a one-hot word representation does not have any correlation between words; the dissimilarity between the word “cat” and “dog” is identical as that be-tween “the” and “dog”.

2.6.2 Dense-Vector Word Representation

Instead of using a one-hot representation, words can be represented using a dense vector. A dense vector representation does not give each word its own dimension, instead it represents each word using multiple values.

Following the same example words as above, these words could instead be represented using fewer dimensions, but each dimension contain some information of the word. Instead of representing each word as a 5-dimensional one-hot vector, they could be represented by a 2-dimensional vector: “the” [0,0.5] “was” [0,-0.5] “chasing” [-1,1] “dog” [1,1] “cat” [1,-1]

Even though the above example is reduced into 2 dimensions it is still possible to represent each word as a separate vector. The same example sentence using a dense vector become: [[0,0.5], [1,1], [0,-0.5], [-1,1], [0,0.5], [1,-1]].

Using a dense word representation is a huge improvement for tokenized words, since, instead of giving each word its own dimension, the dimensionality can be reduced to a fixed size. A dictionary of 40,000 items, which would require a 40,000-dimensional one-hot vector, can instead be represented using a 100 or 200-dimensional dense vector [10, p.89].

If the representation of the dense vector is chosen carefully it is possible to remove the uniform distance between words as in the one-hot version and position similar words closer to each other, and dissimilar words far from each other. The two words “dog” and “dogs” would in this case be positioned closer to one another, while “the” would be positioned fur-ther away. Dense vectors are also able to encode relationships between words, allowing the representation to have a similar correlation between “dog” and “dogs” as “cat” and “cats”.

2.6.3 Word2vec

Word2vec is one of the current state-of-the-art methods when it comes to word embedding [11, 23, 29]. The method uses the Skip-Gram6_{approach, and is trained on either a large text}

corpus or together with its application.

The idea of the Skip-Gram training method is to get a good word vector representation by requiring the network to predict the context given a word during training. Skip-Gram uses pairs of words as its training data, where a word is matched up with each nearby word within a certain window size, i.e. how many words before and after the given word a pair should be generated for, of the corpus to find good correlations between the occurrence of words. Further improvements on the algorithm was done, which include: Negative Sampling, Subsampling of Frequent Words and the generation of multi-tokens.

Negative Sampling extends the training samples by introducing false pairs. A separate set of pairs are generated, where the goal is to minimize the probability of these words cor-relating. Mikolov et al. [24] concluded that 2 to 5 negative samples for each positive training sample work well for large datasets, while smaller datasets benefit from using 5 to 20

(22)

2.6. Word Representation Word Context quick the quick brown brown quick brown fox fox brown fox jumps jumps fox jumps over over jumps over the the over the lazy lazy the lazy dog

Table 2.1: The full set of word pairs of the sentence “the quick brown fox jumps over the lazy dog” with window size set to 1.

ative samples for each positive. These negative samples are added to require the model of distinguishing between in-context words and out-of-context words.

Subsampling of frequent words remove certain words from the set. Words that are very frequent (such as “the”, “of”, “a/an”), does not contain much contextual information com-pared to the rare words. To reduce their effect, and to reduce their dominance, a method to partially discard these words was introduced.

P(wi) =1 ´

d t f(wi)

(2.4)

Each word in the dataset was given a probability to be kept, shown in formula 2.4, where t is a given threshold, and f(wi)is the frequency of a word in the dataset. Mikolov et al. used

t=10´5when comparing their method. Subsampling help to avoid overfitting of the model. Overfitting occurs when the performance of a model used outside of training is reduced with more training, while the performance of the training data increases.

Multi-tokens can be created by comparing the probability of two words occurring in se-quence compared to the probability of two words occurring individually. If the probability P(” ă a ąă b ą ”)is greater than P(” ă a ą ”)˚P(” ă b ą ”)a multi-token can be created, replacing each occurrence of "<a> <b>" with a new token. This method can replace word sequences such as "Great Wall", "White House" and "New York" by their own token, which improve the quality of these word sequences [14].

When the model is trained the embedding can be used as a dictionary. Each row in the embedding represents the dense vector of a word. Each word is initially represented as a one-hot vector, multiplying the word with the embedding matrix effectively result in fetching the vector in the row of the active word.

The word pairs from the sentence “the quick brown fox jumps over the lazy dog”, when using window size 1, can be seen in table 2.1. When training the word embedding matrix a batch of pairs is used as input data, the word being the input, and the context the nearby words of the input based on window size. Both negative sampling and subsampling is applied during the training.

(23)

2.6.4 Alternative Word-Embedding Methods

Other methods to generate the dense vectors has been proposed since Mikolov et al. pub-lished their Skip-Gram with Negative Sampling (SGNS), word2vec. Two of these methods are Global Vector (GloVe) by Pennington et al. [27] and SVDPPMIby Levy et al. [20].

The Global Vectors (GloVe) method tries to generate word embedding vectors that cap-ture a global relationship instead of a local context within a window size. GloVe starts by generating a matrix, X, where the width and height is the size of the vocabulary. The matrix consists of the co-occurrence of every word-to-word pair such that Xij defines the number

of times the word j occurs in the context of word i. They further define that Xi is the sum:

ř

kXik, i.e. the number of times any word co-occur together with the word i. The probability

of word j is thus defined as Pij=P(j|i) =Xij/Xi

To find relationships between certain words, i and j, GloVe introduce a third word, named probe word. Using the probe words it is possible to find the ratio between different word triples, which is what GloVe utilizes to generate its word vectors. The authors provide a clear example where they compare the relationship between the word i=ice and j=steam along with the probe words: k= [”solid”, ”gas”, ”water”, ” f ashion”]. The ratio is found by dividing the probability Pikwith Pjk. The ratio is expected to be high if k relates to i but not j, low if

the opposite and close to 1 if it relates to both i and j or neither i nor j. Following is the result they found using the mentioned value of i, j and k:

Probability and Ratio k=solid k=gas k=water k= f ashion P(k|ice) 1.9 ˆ 10´4 6.6 ˆ 10´5 3.0 ˆ 10´3 1.7 ˆ 10´5 P(k|steam) 2.2 ˆ 10´5 7.8 ˆ 10´4 2.2 ˆ 10´3 1.8 ˆ 10´5 P(k|ice)/P(k|steam) 8.9 8.5 ˆ 10´2 1.36 0.96

SVDPPMI(Singular Value Decomposition) uses variety of Pointwise Mutual Information

(PMI), a measure introduced in 1990 by Kenneth Church and Patrick Hanks [20, 3]. PMI defines the mutual information I(x, y)of the two words x and y as:

I(x, y) =log2 P

(x, y)

P(x)P(y) (2.5)

where P(x)and P(y)is the probability of a word being x or y in a corpus C by counting the number of occurrences of the word and dividing it by the length, N, of the corpus. P(x, y)

is estimated by “counting the number of times that x is followed by y in a window of w words, fw(x, y), and normalizing by N”, which yield the consequence of maintaining order in the data,

since P(x, y)would give different result than P(y, x).

The version used by Levy et al. alters PMI(x, y)(denoted as I(x, y)by Church and Hanks) into PPMI (positive PMI): PPMI(x, y) =max(0, PMI(x, y)), since PMI(x, y)would result in -8 in the case where P(x, y) =0. They use PPMI to create a matrix MPPMIand perform SVD to create word and context representations.

SVD is a method which factorizes a matrix into the product of three matrices:

M=U ¨Σ ¨ VT (2.6)

Where the matrices U and VTare orthonormal andΣ is diagonal, and each value greater than or equal to zero.Σ is ordered in descending order: Σ0,0ąΣ1,1ą... ąΣn´1,n´1ąΣn,n.

Since the rows and columns are ordered such that the diagonal matrix is in descending order the most relevant features can be kept by only keeping the first d rows and columns, giving us:

(24)

The word and context matrices are then created using:

WSVD=Ud¨Σd (2.8)

CSVD=Vd (2.9)

Unlike both word2vec and GloVe, SVDPPMI does not rely on randomization, and give identical matrices each time it is performed using the same hyperparameters, i.e. the pa-rameters set prior to execution and which adjusts/weights the behaviour of the algorithm in question.

Later analysis of various algorithms (word2vec, GloVe and LSA7) came to the conclusion that word2vec and GloVe are more effective than LSA and that word2vec provide the best word vector representation with few dimensions [25]. Each method did however give high-end results and outperformed the other methods in some measures, making the choice very application-dependent.

2.6.5 Character-Based Embedding

A character-based language model does not use a dictionary with tokens representing each word, instead the tokens are character-sized and text is generated one character a time. For a character-based model it is possible, but not necessary, to represent the characters as a one-hot vector since the number of possible characters to write are limited to a vastly lower amount than the number of words.

Andrej Karpathy performed some experiments using a character-based network for var-ious text structures with interesting results [18]. The result from his experiments show that a character-based network is able to learn the syntax of English text, text from Shake-speare and even the syntax of XML-code (with correct opening and closing of scopes) and C-programming code.

The experiments from Karpathy does not, however, learn an accurate semantic or con-textual representation. Since all the experiments train on replicating the structure of the text from its training data, the training also never learn the underlying meaning of the words it uses, only the structure. The following text is the output from a model trained using essays from Paul Graham8:

The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea. [2] Don’t work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too. The text show that a character-based language model is able to learn several words, as well as that each sentence begin with a capital letter, and are also, surprisingly, able to learn how to write citations. Although some of the sentences make sense, most of them do not. The words from his model are not being selected based on their meaning, but by their learnt grammatical role in a sentence, leading to a lack of connection between each sentence.

There is one main benefit of embedding words using a character-sized token-based lan-guage model compared to a word-sized model, which is that the word-based (word-sized token-based) model is limited to a fixed dictionary, while a character-based (character-sized token-based) model is not.

If the word-based model encounter a word not in the dictionary, it has no method to “guess” what the word may mean. Instead, the model become OOV and the word is marked

7_{LSA: Latent Semantic Analysis – SVD is one of these methods} 8_{http://www.paulgraham.com/articles.html}

(25)

2.7. Caption Evaluation Metric Systems

as “unknown”. A character-based language model could, in theory, correctly interpret or rep-resent a word it has never encountered before, as long as the word contains rules which the model has learned. The incorrect word “unprobable” (misspelling of the word: “improba-ble”) uses the prefix “un-” to negate the adjective “probable”. If a character-based model has learned both the meaning of the prefix and the word, this model could correctly interpret the word. A word-sized tokenized model, on the other hand, would either not have the word in its dictionary, or been encountering it too rarely to correctly learn its meaning.

2.7 Caption Evaluation Metric Systems

To evaluate the performance of automated text generation it is necessary to have an auto-mated evaluation of a generated text string. Having autoauto-mated methods allow both a fair judgement of the quality and is the only realistic approach if it should be done on millions of lines of text in a short duration.

The 2015 MSCOCO captioning competition also provide an Application Programming Interface (API) which they used when evaluating the captioning algorithms [1]. This API was also available before the competition and is still available after the competition finished. The API uses several of the commonly used captioning metrics: BLEU [26], ROUGE [22], METEOR [7] and CIDEr [36]. At the time of the competition, the top scoring algorithms were also evaluated by human judges, although the criteria of which the judges used have not been described.

2.7.1 BLEU Evaluation

The BLEU9algorithm is a word matching algorithm that matches a candidate string with one or several reference strings.

The simplest form of BLEU, also called BLEU(1), check how many of the words in the candidate string that also exist in any of the reference strings. The score of this candidate string is then the fraction between matching words and the total words. BLEU also include another two rules to avoid false accurate candidates:

1. A word cannot be matched more than the highest occurrence of the word in any refer-ence string

2. A candidate string’s score is penalized if the length is shorter than the shortest reference string

Why are these two extra rules necessary? Take a look at the following candidate and reference strings10:

Candidate: the the the the the the the Reference1: The cat is on the mat Reference2: There is a cat on the mat

If rule #1 would not be used, the candidate string would have 7 matching words out of 7, giving it a perfect score. With the rule however, the maximum count of the word "the" is 2, and the score drop to 2/7.

The second rule, to penalize short candidate strings, hinders the algorithm from exploiting absurdly short candidates.

Given another example, consider the following candidate and reference strings10: Candidate: of the

9_{BLEU: bilingual evaluation understudy}

(26)

Reference1: It is a guide to action that ensures that the military will forever heed Party com-mands.

Reference2: It is the guiding principle which guarantees the military forces always being un-der the command of the Party.

Reference3: It is the practical guide for the army always to heed the directions of the party. The candidate would be given a perfect score, 2 out of 2, although it does not result in a good candidate. BLEU penalizes candidates strings by:

1 i f c ě r (2.10)

e(1´r/c)i f c ă r (2.11)

where c is the candidate string length and r is the shortest reference string length.

BLEU can also compare a consecutive group of words between a candidate and reference strings. Instead of finding a single word in the candidate string with the reference strings, each set of n consecutive words are checked in the reference strings, also called n-gram.

2.7.2 ROUGE Evaluation

The ROUGE11 algorithm is a package of five different evaluation functions: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S and ROUGE-SU.

N uses n-gram measuring, similar to BLEU. The main difference is that ROUGE-N’s denominator is the number of n-grams of the reference string instead of the candidate string.

ROUGE-L counts the Longest Common Subsequence (LCS) of a string. The subsequence does not have to be continuous in the candidate string, though it should be in matching order within the candidate string.

The string: a b c d has LCS=4 in: a b e c d and LCS=3 in: a b e d c

ROUGE-L both calculate the Recall, i.e. the percentage reference words that reoccur in the candidate string, and Precision, the percentage of candidate strings that reoccur in the reference strings. Writing all words in the dictionary word result in a high recall, but low precision, where the opposite, writing only one word, for example the most common ("a") would in the majority of the cases give a high precision, but low recall.

ROUGE-L’s formula is:

Rlcs =LCS (X, Y) m (2.12) Plcs = LCS(X, Y) n (2.13) Flcs = (1+β2)RlcsPlcs Rlcs+β2Plcs (2.14) where X is the reference string of length m, Y the candidate string of length n and β a weight-ing factor.

ROUGE-W is a weighted LCS. Using ROUGE-L, the following candidate would be given the same match in the two reference strings:

reference: a b c d e f g candidate 1: a b c d h i j candidate 2: a h b i c j d

(27)

ROUGE-W weight consecutive words without breaks higher than ones with breaks, and can-didate 1 would be given a higher score than cancan-didate 2.

ROUGE-S matches the occurrence of a pair of words, but unlike ROUGE-N, it generates distant word pairs. A constraint is set to limit the distance between the pair of words. This avoids obscure word pairs such as “the the”.

The authors noticed a potential problem with ROUGE-S, which is that candidate sentences being in reversed order to the reference string does not contain any matching pair, and is thus scored zero. For this reason they created an extension to ROUGE-S: ROUGE-SU. ROUGE-SU combines ROUGE-S with ROUGE-N, with N = 1.

2.7.3 METEOR Evaluation

METEOR12uses both Precision and Recall on a 1-gram word matching. By using a dictionary of synonyms and stems of words some words in a candidate string give a partial score if the word is either of them. The candidate sentence is then penalized based on how many chunks the reference sentence is separated into.

The METEOR algorithm has eight configurable parameters, four parameters configure the distribution of influence, and the other four are weighting of whether a word is an exact match, having a matching stem (base word), is a synonym or if it is a paraphrase.

The 4 distribution parameters are:

α– Harmonic mean between Recall and Precision: F= _α˚P+(R˚P_1´α)˚R

βand γ – Penalty factor for number of chunks Pen = γ(ch_m)β, where ch=number of

chunks, m=matches

δ– Weighting between content words and function words. Each matched word is

sepa-rated into being a ’function’ or ’content’ word. A function word is a word with a relative frequency above 10´3, any other word is considered a content word.

Since METEOR is able to give a partial score of words having a correct stem or being a syn-onym it is more flexible toward model and ground truth dissimilarities where the model still provide a correct sentence. Out of all metrics, METEOR is the only metrics in the Microsoft COCO caption evaluation software that partially score words not being an exact match.

Thanks to the extra dictionary comparison and stemming the METEOR metrics is able to provide a more accurate scoring of sentences and is not relying on exact matches in each word.

2.7.4 CIDEr Evaluation

CIDEr13 is a two-step process. Before evaluating candidate sentences CIDEr perform Term Frequency Inverse Document Frequency (TF-IDF) on the entire dataset for each n-gram being used.

The TF-IDF algorithm consists of two values, Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency is the percentage of times a given word exist in a specific "document", where document stand for a limited text, such as a caption, blog, news post etc.

Inverse Document Frequency is globally checking among a set of documents how com-mon a word is. The IDF-term is using a formula where words found across multiple doc-uments (boolean counting: existing or not existing) returns a low value (down to 0), while words rarely found returns a high value (up to 1).

12_{Metric for Evaluation of Translation with Explicit Ordering} 13_{Consensus-based Image Description Evaluation}

(28)

CIDEr weights n-gram pairs using the TF-IDF value instead of the direct Precision and Recall. This result in scoring each candidate string based on how many meaningful words they contain rather than words that does not provide any information. Words such as “a”, “and”, “in”, “with”, “to” and “of” provide close to no information compared to the words: “skateboard”, “kite”, “beach”, “pizza”, and are thus weighted lower than the latter words. To account for a lower average maximum score the CIDEr metrics adds a constant multiplication to the end result. The model uses a constant multiplication of 10, which result in a maximum theoretical value of 10, although the probability that all words in the sentence are the rarest words is highly unlikely. The constant multiplier was set to put the metrics within a similar range as the other metrics.

(29)

3 Method

In this chapter the various architectural models for caption generation are described in de-tail. This chapter is separated into two sections. The first section describe the architecture of the different models. This includes the baseline model and the alternative model for cap-tion generacap-tion. The second seccap-tion define the methodology used to evaluate the different methods.

3.1 Architectures

To perform the experiment, a baseline (word-based) model and alternative character-based models were trained on the MS COCO 2014 training data set and annotations1.

For the word-based language model, im2txt was used. Im2txt is the name of the archi-tecture used by Google during the MS COCO captioning competition. The model is made using TensorFlow, which is an open-source platform and library to design, train and share neural network models [34]. The character-based model is a modification of the im2txt ar-chitecture where the recurrent LSTM-Block and the tokenization is modified. Tokens on the character-based model are split between each character instead of between each word, this includes tokens for special characters such as whitespace, numbers, and other non-numeric characters.

The models consist of two parts, image recognition and word generation. Figure 3.1 shows the general architecture of the different models.

(30)

3.1. Architectures Image recognition Token generation IMAGE 3x299x299 1x2048 T oken MxN

Figure 3.1: General architecture of the models.

Each part of the architecture will be further described in their own subsection.

3.1.1 Image Recognition

The image recognition used a pre-trained inception-v3 network. The model was down-loaded from http://download.tensorflow.org/models/inception_v3_2016_ 08_28.tar.gz. There is support to do a second phase of training where the image embed-ding layer and token generation layers train together with the inception network, although this was not performed since this co-training would take several weeks to perform. The inception model translated each image into a 2048-dimensional feature vector. Since the inception-v3 network is unchanged during training each image was pre-processed, and the 2048-vector output value from the inception-v3 network was used throughout the training.

3.1.2 Token Generation

During the training two data sets were used. The first is a 2048-dimensional feature vectors generated by the image recognition step. The second is a sequence of one-hot vectors corre-sponding to each word in ground-truth captions matching each image. Figure 3.2 shows the general structure of the token generation used for both of the models. Each of the “LSTM-Block(s)” in the figure represent the same blocks, though in different time steps.

(31)

3.1. Architectures

LSTM-Block(s)

S0 S1 Sn-1

log p1(S1) log p2(S2) log pn(Sn)

Image embedding LSTM-Block(s) LSTM-Block(s) LSTM-Block(s) Inception-v3 Image features WeS0 WeS1 WeSn-1 WdH1 WdH2 WdHn H1 P1 H2 Hn P2 Pn Zero-state

Figure 3.2: Structure of the token generation part of each model. S0 is a special

“start_of_sentence” token, while the tokens S1to Sn´1are words from the reference sentence.

Final token Sn is the “end_of_sentence” token used to mark the end of a sentence. This

struc-ture follows the configuration used by Vinyals et al. [37]. The matrices We and Wdare

ma-trices used to embed a word into its dense representation – WeSt– and to translate the dense

representation back to a one-hot probability distribution – WdHt.

In the first iteration the image features from an inception-v3 model is passed to the image embedding layer, which changes the dimensions from 2048 to the size of the LSTM-Block cell size. The embedded resized features were then fed to a zeroed LSTM-Block, i.e. the cells in the block contains no data, and the LSTM-Block’s input Ht´1is zero (see chapter 2.5.1).

After the first iteration the model’s input was the sequence of tokens (S0to Sn) where each

value in Stcorrespond to a one-hot representation of the models’ dictionary (known tokens).

When performing the multiplication WeSt the one-hot representation is embedded into its

dense representation. Likewise, the WdHt+1 reverse the translation from dense back into a

one-hot probability distribution. The values S0and Sn are two special values. S0 indicate

the start of a sentence, and Snis the end of a sentence. These two tokens are included in the

model’s dictionary.

Except during the first iteration, when the state of the LSTM-Block is set using image information, each iteration produce a probability distribution Ptover all words in the

dictio-nary. The output Htis multiplied by a weight matrix, which outputs the probability of St

being a certain token in the dictionary.

During the training both the image embedding, encoding (We) and decoding (Wd) weights

were updated, as well as the weights in the LSTM-Block(s).

When the model generate own descriptions to sentences the token with the highest prob-ability is chosen from Ptas input in place of Stduring the next iteration, i.e. argmax(Pt)is

chosen. This continues until argmax(Pt)is the end_of_sentence token.

3.1.3 LSTM-Block Configuration and Tokenization

Two different LSTM-Block configurations and tokenization methods were used during this study. The configurations were applied on two different models, one word-based language model and one character-based language model.

(32)

3.1. Architectures

The word-based language model used an unaltered im2txt model, which was created by Vinyals et al. and was used during the MSCOCO 2015 captioning competition [37]. The model was made public in 2016, and is currently available from TensorFlow’s Git-repository2. The im2txt model uses a single LSTM-Block consisting of 512 cells and a dictionary con-sisting of word-sized tokens. The dictionary was limited to 12,000 unique tokens, leading to at most 11,998 words, start_of_sentence token and end_of_sentence token.

To choose which words that should exist in the dictionary all the annotations are scanned and the most common words are kept in the dictionary.

The model trained for 100,000 epochs, where one epoch is a training iteration. Each epoch trained on 586,363 image-caption pairs, where the image consists of the extracted image fea-tures and not the raw image input. Afterwards the MS COCO validation data set and annota-tions3was used to evaluate the performance of captions generated on images the model had not seen before.

The character-based model alters the word-based model in two ways. The first alteration is the LSTM-Block of the model. Instead of using a single LSTM-Block the model uses a Mul-tiRNNCell4. The MultiRNNCell combines multiple recurrent layers/blocks into behaving as a single block. This model uses two LSTM-Blocks in the MultiRNNCell, both with 512 cells, instead of using a single LSTM-Block. The choice of using two LSTM-Blocks was made based on previous work by Andrej Karpathy [18]. His work used three LSTM-Blocks for his model, although the model should generate multiple paragraphs instead of one or two descriptive sentences. Since the character-based model has added a second task (to write words on a character basis), a second LSTM-Block was added for this purpose. Figure 3.3 show how the MultiRNNCell is constructed. The design changes between the two models were minimized to provide a clear comparison between generating sentences with tokens of different sizes. The second alteration was the model’s dictionary. The dictionary is changed into containing character-sized tokens, including numbers, whitespace and special characters5, which is a to-tal of 55 unique tokens when including the start_of_sentence and end_of_sentence tokens. This change is the core difference between the two models.

LSTM #1

LSTM #2

LSTM #1

LSTM #2

W

e

S

t-1

W

e

S

t

H

t

H

t+1

MultiCell

Figure 3.3: MultiRNNCell with two LSTM-Blocks, unfolded two iterations.

The character-based language model trained for 200,000 epochs, with an intermediate evaluation when it had trained 100,000 epochs.

All code is published on GitHub6.

2_{https://github.com/tensorflow/models/tree/master/research/im2txt} 3_{http://cocodataset.org/#download}

4_{https://www.tensorflow.org/api_docs/python/tf/nn/rnn_cell/MultiRNNCell} 5_{Special characters: . , ’ ” / : ; ? & ! ( ) $ #}

Using a Character-Based Language Model for Caption Generation

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/095--SE

Using a Character-Based

Lan-guage Model for Caption

Gener-ation

Användning av teckenbaserad språkmodell för generering av

bildtext

Simon Keisala

Upphovsrätt

Copyright

Acknowledgments

Acronyms

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research Questions

1.4

Delimitations

2

Theory

2.1

Natural Language Processing

2.1.1

Levels of Natural Language Processing

2.2

Image Captioning: From Image to Text

2.3

Artificial Neural Networks

2.4

2D Convolutional Neural Network

2.4.1

Inception – A Convolutional Neural Network for Image Recognition

2.5

Recurrent Neural Networks

2.5.1

Long-Short Term Memory Neural Network

2.6

Word Representation

2.6.1

One-Hot Word Representation

2.6.2

Dense-Vector Word Representation

2.6.3

Word2vec

2.6.4

Alternative Word-Embedding Methods

2.6.5

Character-Based Embedding

2.7

Caption Evaluation Metric Systems

2.7.1

BLEU Evaluation

2.7.2

ROUGE Evaluation

2.7.3

METEOR Evaluation

2.7.4

CIDEr Evaluation

3

Method

3.1

Architectures

3.1.1

Image Recognition

3.1.2

Token Generation

3.1.3

LSTM-Block Configuration and Tokenization

LSTM #1

LSTM #2

LSTM #1