Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/026--SE

Automated Image Sugges ons

for News Ar cles

–

An Evalua on of Text and Image Representa ons in an Image

Retrieval System

Automa ska bildförslag ll nyhetsar klar

Pontus Svensson

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Multimodal machine learning is a subfield of machine learning that aims to relate data from different modalities, such as texts and images. One of the many applications that could be built upon this technique is an image retrieval system that, given a text query, retrieves suitable images from a database. In this thesis, a retrieval system based on canonical correlation is used to suggest images for news articles. Different dense text representations produced by Word2vec and Doc2vec, and image representations produced by pre-trained convolutional neural networks are explored to find out how they affect the suggestions. Which part of an article is best suited as a query to the system is also studied. Also, experiments are carried out to determine if an article’s date of publication can be used to improve the suggestions. The results show that Word2vec outperforms Doc2vec in the task, which indicates that the meaning of article texts are not as important as the individual words they consist of. Furthermore, the queries are improved by rewarding words that are particularly significant.

(4)

Acknowledgments

First of all, I would like to thank Consid and Saab for finding a thesis project completely in line with my preferences. A special thanks goes to my supervisor and mentor at Consid,

Jesper Bäck, for always being supportive and available when I needed him.

I would also like to thank Marco Kuhlmann, my supervisor at the university, for his valuable feedback and answers to all my questions. This spring came to be an odd time for us all, with social distancing and reduced contact with the outside world, but our weekly meetings always cheered up.

The next man to shave in the cavalcade of thank-yous is my examiner Arne Jönsson, for his feedback on my draft reports and his fun interactions on the seminars. Speaking of seminars, my final acknowledgments are addressed to the rest of the group who wrote their theses at NLPLAB this spring. You gave a lot of helpful feedback and were a great source of inspiration. All the best.

(5)

List of Figures

2.1 Word2vec’s training algorithms. . . 6

2.2 A typical CNN architecture. . . 9

2.3 Identity block with shortcut connection. . . 9

2.4 The ResNet-50 architecture. . . 10

2.5 Different types of correlation. . . 12

2.6 Canonical Correlation Analysis (CCA). The input data, X and Y , are transferred by a and b into U and V that share the same vector space. . . . 13

2.7 Kernelized Canonical Correlation Analysis (KCCA). The input data, X and Y , are kernelized into ϕX(X) and ϕY(Y ). The new representations are then transferred by a and b into U and V that share the same vector space. . . . 14

3.1 An overview of the system pipeline. Texts and images are translated into vector format by a text embedding model and an image classification model. The vectors are then transferred into a joint vector space where related texts and images are placed close to each other. . . 18

3.2 An example tokenization . . . 20

3.3 The date representations. . . 20

4.1 Two successful image suggestions for texts from the SaabNews test set. . . 27

(7)

List of Tables

3.1 Example evaluation of the system where a text and image from the same pair share id. . . 21 4.1 SaabNews experiments using CCA.△ denotes that higher is better, ▽ denotes that

lower is better. The best scores in each column are marked in bold. . . 24 4.2 SaabNews experiments using KCCA with linear kernel and regularization set to

0.001. △ denotes that higher is better, ▽ denotes that lower is better. The best scores in each column are marked in bold. . . 24 4.3 BreakingNews experiments using CCA.△ denotes that higher is better, ▽ denotes

that lower is better. The best scores in each column are marked in bold. . . 25 4.4 BreakingNews (subset) experiments using CCA. △ denotes that higher is better,

▽ denotes that lower is better. The best scores in each column are marked in bold. 26 4.5 BreakingNews (subset) experiments using KCCA with linear kernel and

regulariza-tion set to 0.001. △ denotes that higher is better, ▽ denotes that lower is better. The best scores in each column are marked in bold. . . 26

(8)

Acronyms

BoVW Bag-of-Visual-Words. BoW Bag-of-Words.

CBOW Continuous Bag-of-Words. CCA Canonical Correlation Analysis. CNN Convolutional Neural Network. IDF Inverse Document Frequency.

KCCA Kernelized Canonical Correlation Analysis. MRR Mean Reciprocal Rank.

NLP Natural Language Processing. PV-DBoW Distributed Bag-of-Words. PV-DM Distributed Memory.

R@K Recall at K.

RBF Radial Basis Function. TF Term Frequency.

(9)

1 Introduction

With today’s rapidly growing databases, the famous John Naisbitt quote from 1982 “We are drowning in information but starved for knowledge” [1] might be more appropriate than ever. As the databases grow larger they consequently get more difficult to navigate through. This is a problem for the news publicists at Saab who have to search in an inadequately annotated image database regularly to find relevant images for articles that are about to be published. In recent years, machine learning has been used to gain awaited knowledge from data in a number of areas, such as computer vision, text mining and Natural Language Processing (NLP). More specifically, neural networks can be trained to recognize objects in images and to find the semantic meaning of words, sentences and longer pieces of text. In this thesis, such techniques will be utilized for suggesting suitable images to a given text in what is called an image retrieval system.

1.1 Motivation

To represent text as numbers for further computations is a fundamental part of NLP. The usual process for this is to let a neural network generate word embeddings – a way to represent words as vectors – where words with similar meaning are placed close to each other in a vector space. A typical example that shows what can be achieved with word vectors is a calculation such as

X = vector(“king”) − vector(“man”) + vector(“woman”), where a search in a well-trained

vector space reveals that X is most similar to vector(“queen”). In some applications, it is interesting to create vector representations of longer pieces of text. Bag-of-Words (BoW) is a simple yet sometimes surprisingly effective method that serves this purpose. Here all words in a set of documents are stored in a vocabulary, and each document is represented as a histogram of the words it consists of. As the histogram representations tend to become large and sparse, they are not optimal in terms of computational complexity and memory requirement. To top it all, the BoW representation does not take the word order into account, which leads to a potential loss of valuable semantic and syntactic information. Hence, plenty of research about sentence and document embeddings have been made during the last decade. It has shown that the construction of such embeddings could be done similarly to how word embeddings are constructed.

Not only text is interpretable with machine learning methods. The large amount of im-ages available online has enabled well-performing neural network models that can recognize

(10)

1.2. Aim

both objects and landscapes in images. Some of these models are pre-trained on large image datasets and distributed online, ready to classify out-of-the-box. Given an input image, these models first extract its characteristic features, which are subsequently used to determine the image’s class probability distribution. Since both the features and the probability distribution are represented as vectors, this kind of neural network models is suitable for creating vector representations of images for further mathematical computations. Both text and image vectors are cornerstones in the proposed image retrieval system.

Image retrieval systems are often used with keyword queries, where one of the most well-known examples is Google Images. However, the progress in machine learning has allowed for “smarter” solutions that could, for instance, make use of the previously mentioned text and image vectors. That is, longer pieces of text with semantic meaning could be used as search queries with possible higher precision. A thorough scan throughout the relevant liter-ature showed that many studies in image retrieval use sentence queries. In this thesis, image retrieval using both sentences and longer texts with and without word order information will be examined in order to conclude which makes the better query.

1.2 Aim

Editors in the news industry occasionally need to use images that are not specifically taken for the news article that is to be published. This is probably even more common for company news, where a certain product has a limited set of concept images taken in different environments. As the contents of this kind of image databases have been added over years by multiple users, there is a high risk that the quality of the metadata tags are lacking.

This thesis aims to propose an end-to-end solution for receiving appropriate image sugges-tions from a weakly annotated image database by providing a news text as input. The main focus will be to investigate how different design choices related to text and image affect the quality of the suggestions. The outcome will help Saab in the further development of a system that fits their data.

1.3 Research Questions

1. In an image suggestion system for news articles, how do the results obtained using

headlines as search queries compare to the results obtained using both headlines and running texts as queries?

Two of an article’s main ingredients are the headline and the running text. Often the headline is used to conclude the article in a single sentence. Hence, the headline could intuitively be a suitable and sufficient query to an image retrieval system. On the other hand, newspaper headlines are not seldom formulated to grab the reader’s attention rather than being informative. Headlines also tend to be formulated very differently depending on the person and newspaper behind it. This question aims to evaluate if article headlines are suited as queries in image retrieval or if the extra dimensions in the full texts add valuable information to the system.

2. How does a system based on averaged word embeddings compare to a system based on

paragraph embeddings?

An informative text representation is crucial for the system at hand. In this work, two different methods are evaluated: averaged Word2vec embeddings and Doc2vec embed-dings. Doc2vec embeddings probably better capture the sentiment of a text. However, it is not certain that the meaning of a text is important in the retrieval task.

(11)

1.4. Background

3. To what extent does the date of publication of articles improve a search query?

In addition to the aforementioned queries, it could also be interesting to see if the date of publication could be used to further improve the image suggestions. For instance, if the date of publication is during the winter months this feature could potentially help the system to decide between an image with or without snow.

4. Is an image classifier trained on objects, scenes or a combination of the two the better

option for the system?

Article images are widely spread across domains and styles. At the same time, there are a few prominent image datasets available for training image classifiers. Two of these are ImageNet for object classification and Places for scene classification. Which of the two that is better to use as a basis for news image classification will be examined. A combination of the two classifiers will also be evaluated.

1.4 Background

Saab is a large Swedish company with 17,000 employees in total.1 _{More than 6,000 of these}

work in Linköping, which makes it the company’s largest site. The company’s news department and publicists in other departments publish news both internally and externally on a regular basis. Each article is assigned with one or more images that are either specifically captured for an article or fetched from Saab’s large image database with inadequate metadata tags. To simplify the process of finding suitable images for their articles, Saab now requests a system that automatically suggests images to the publicists.

1.5 Delimitations

No project-specific image classification models will be used in the project. Instead, it will rely on pre-trained models, which could impair the results. Furthermore, the datasets used will be deducted so that each article only has one assigned image.

(12)

2 Theory

This chapter covers the theory related to the different concepts used in the thesis. Since the system needs to handle both text and image data, the chapter begins with a presentation of the important NLP concepts. It then proceeds to the area of computer vision before finally describing how knowledge from the two areas can be utilized together in an image retrieval system.

2.1 Natural Language Processing

NLP is a research area in computer science and linguistics that focuses on analyzing natural language with the help of computers. There are two main techniques used to perform this kind of analysis: syntactic analysis and semantic analysis. Syntactic analysis focuses on the order in which words appear in a sentence and whether that order aligns with grammatical rules or not. Semantic analysis is used to understand the meaning of natural language and represent it in a computationally efficient way, which is considered to be a difficult problem. The research in NLP has advanced with the progress of machine learning algorithms, which have proven to be effective in natural language analysis. In addition, the amount of text data that is now available online has further contributed to the progress.

Nowadays, NLP is used in several modern applications, often to aid people in their everyday life. Applications for language translation, spell and grammar checking and virtual assistance are a few examples that are built upon the technique. So are some information retrieval systems, such as the Google search engine that uses semantic analysis on the input queries in order to improve the search results.

In this thesis, several NLP concepts are employed in order to build representations of article texts. These concepts are described in detail below.

2.1.1 Term Frequency-Inverse Document Frequency

Term Frequency (TF), denoted by tf(t, d), is a measure of how often a term t appears in a document d [2]. This measure could give a sense of the context of a document. However, it considers all terms in the document to be of equal importance. That is, it could give high scores to stop words and words that occur frequently in the document domain, which may not be of any particular interest when the objective is to find words that are characteristic of one

(13)

2.1. Natural Language Processing

Input Projection Output Input Projection Output

w(t-2) w(t-1) w(t+1) w(t+2) Sum w(t) w(t) w(t-2) w(t-1) w(t+1) w(t+2)

CBOW

Skip-gram

Figure 2.1: Word2vec’s training algorithms. w(t) denotes the current word and the window size is set to 2. Image adapted from [3].

specific document. For this reason, term frequency is seldom used as a measure of its own. Document frequency, denoted by df(t), is defined as the number of documents in a dataset that contains a term t. Furthermore, Inverse Document Frequency (IDF) is defined as

idf(t) = log10

N

df(t),

where N is the total number of documents in a dataset. TF and IDF are often used together to form the weighting scheme Term Frequency-Inverse Document Frequency (TF-IDF) that measures the importance of a term in a document, given a collection of documents. The weighting scheme combines TF and IDF according to

tf-idf(t, d) = tf(t, d) ⋅ log10

N

df(t).

Terms that appear frequently in few documents are given higher scores, while lower scores are given to terms that appear less frequently, or in more documents.

2.1.2 Word Embeddings

Word embeddings are a language modelling technique in NLP where words are mapped to high-dimensional vectors of real numbers. The vectors capture words in their context, which is one explanation to their successful contribution to syntactic and sentiment analysis. There are currently three dominant models for producing word embeddings: Facebook’s fastText, Stanford’s GloVe and Google’s Word2vec. The three models are all unsupervised and more or less interchangeable. In this thesis, Word2vec is used.

Mikolov et al. [3] presented the first work on Word2vec in 2013 and managed to beat the current state of the art in similarity tasks. The model is a shallow neural network that utilizes either of two model architectures to produce word embeddings: Continuous Bag-of-Words (CBOW) or continuous skip-gram. In short, CBOW uses a sliding window in order to determine the probability that a word appears in the context of the words in its proximity. The continuous skip-gram model works oppositely as it, given a word, predicts probabilities for its adjacent words. The two models are illustrated in Figure 2.1.

(14)

2.1. Natural Language Processing

Further development by Mikolov et al. [4] led to added subsampling, negative sampling and support for phrases to improve Word2vec’s performance. Subsampling refers to the process of smoothing out imbalances between common and uncommon words in the data. Words are randomly discarded with a probability of

P(wi) =

√

t f(wi)

,

where wiis the current word, t is a sample threshold and f(wi) is the frequency of the current

word. Mikolov et al. concludes that subsampling improves training speed and representations of uncommon words.

Negative sampling is a technique that is used to further reduce the training time of the network. During traditional neural network training, every weight is updated at each training sample. Since the number of weights in Word2vec scales with the size of the word vocabulary, Word2vec models tend to be very complex. Thus, negative sampling selects only a few1

negative samples that should be updated for each training sample, instead of updating them all. This drastically reduces the training time without notably affecting the performance of the model.

Finally, support for phrases was added, which means that words that appear frequently together, for example “New York” and “Bill Gates”, get their own embeddings that are not necessarily related to the vectors of the individual words in the phrases.

The Word2vec model allows the user to specify how many dimensions a word should be represented by. Intuitively, high-dimensional vectors contain more information than their low-dimensional counterparts. At the same time, they also require more training data in order to take advantage of the additional dimensions. In [3] it was shown that the accuracy in their semantic and syntactic experiments increased with the vector size. Their models were trained on the Google News dataset that contains about six billion tokens and they found that a dimensionality of 640 performed well in their experiments. However, it is concluded in [4] that optimal hyperparameters are task-specific. Moreover, vector size, subsampling rate and window size are pointed out as the most crucial hyperparameters to tune.

2.1.3 Combination of Word Vectors

Representations of phrases and longer pieces of text are also of great interest in NLP. As mentioned in Chapter 1, BoW is a common technique that serves the purpose. This kind of representations could also be composed of word vectors in surprisingly simple ways. Each vector that corresponds to a word in a text could be either summed or averaged element-wisely into a single representation. That is, averaged word embeddings are the same as summed word embeddings except that they are divided by the number of summed words. Wieting et al. [5] propose an approach to this solution that relies on data from the Paraphrase pairs dataset2_for

additional knowledge. It was shown that this simple method, quite surprisingly, managed to outperform more sophisticated neural network models on out-of-domain data in similarity and entailment tasks. Arora et al. [6] managed to further improve the performance by modifying these embeddings with principal component analysis and singular value decomposition. The result is presented as “a simple but tough-to-beat baseline for sentence embeddings”.

Each word in Word2vec is treated equally. Since all words in a piece of text are not of equal importance for its meaning and context, weighting techniques could be applied to the embeddings. One popular weighting technique in BoW is TF-IDF weighting. That is, instead of the raw counts of words in a text, this representation contains the TF-IDF weight of each word. The same idea can be applied for averaged word embeddings, where each embedding is multiplied by its weight. The method was successful in [6] and [7] with increased performances in some experiments, as compared to unweighted averaged Word2vec embeddings.

1_{In [4], 5-20 words for smaller datasets and 2-5 words for large datasets are suggested.} 2_{http://paraphrase.org/. Last accessed February 14, 2020.}

(15)

2.2. Computer Vision

Neither of these methods takes into account the word order, which means that a sentence like “John likes Mary” and “Mary likes John” will have the same representation although having different meanings.

2.1.4 Paragraph Embeddings

An extension to Word2vec was proposed by Le and Mikolov [8] in 2014 and later popularised by Gensim [9] as Doc2vec. Doc2vec is an unsupervised model that finds vector representations of sentences, paragraphs and longer pieces of texts, which in [8] are called paragraph embeddings. Doc2vec captures both word ordering and semantics, which was probably a key to its state-of-the-art performance in paragraph classification tasks at the time of its release. This was one of the motivations behind why the model was invented – it was supposed to avoid the disadvantages of the sparse BoW representations that do not take word order into account [8]. Like Word2vec, Doc2vec utilizes one of two model architectures: Distributed Bag-of-Words (PV-DBoW) and Distributed Memory (PV-DM). PV-DBoW is similar to Word2vec’s skip-gram algorithm and PV-DM is similar to CBOW. It is concluded in [8] that PV-DM gives consistently better results than PV-DBoW, but also that a concatenation of the output from the two models is recommended.

A thorough literature study has shown that Gomez et al. [7] is one of the few that have used Doc2vec embeddings in image retrieval. The method was unsuccessful in their experiments, which could be explained by their use of very short text queries.

2.2 Computer Vision

Computer vision is a field of research that focuses on computers’ understanding of images and videos. One major part of computer vision is to analyse visual media, which is done by, for instance, image classification, object detection and motion tracking. The articles used in this thesis consist of still images that will be analysed with image classification.

2.2.1 Image Classification

One primary process in computer vision is to classify the contents of an image, similarly to how images are interpreted by humans. This could, however, be a difficult task for a computer. The concept of the previously mentioned BoW representation for NLP could actually come in handy in image classification as well. In order to interpret an image, the computer could break it down into visual features of low and high levels. Low-level features such as dots, lines and edges build high-level features such as objects. The contents of an image could be represented as a Bag-of-Visual-Words (BoVW) where the extracted features from a set of images are stored in a vocabulary. Then each image is represented as a histogram of the features it contains. Such a histogram could be used to classify an image. Nowadays BoVW are often outperformed by deep Convolutional Neural Networks (CNNs) in image classification tasks.

2.2.2 Convolutional Neural Networks

CNNs are a kind of deep neural networks that are commonly used in computer vision. Since images could be in high resolution these typically need to be scaled down into a computationally efficient size. This is utilized in CNNs with convolution and pooling layers. In the convolution layers, the input is convolved with a kernel in order to produce features. Pooling layers downsample the input by partitioning adjacent pixels, typically 2× 2 pixels, and outputting the max or average of each partition. A CNN typically consists of a number of convolution and pooling layer pairs followed by fully connected layers. The fully connected layers flatten the output from the previous layers and give the probabilities for each class that is defined in the dataset. A typical CNN architecture is illustrated in Figure 2.2.

(16)

2.3. Transfer Learning

Output Input

Feature maps

Fully connected

Convolutions Subsampling Convolutions Subsampling

Figure 2.2: A typical CNN architecture.

Conv Batch_norm ReLU Conv Batch_norm + ReLU

( ) +

identity

Figure 2.3: Identity block with shortcut connection.

2.2.3 ResNet

Residual Network (ResNet) is a CNN architecture that was presented by Microsoft in 2015 [10] and has since then become established in image classification. It was the state-of-the-art architecture in computer vision in the year of its release and is still one of the most well-performing architectures.

A common problem with deep networks is that they easily suffer from vanishing gradients. The ResNet architecture is constructed to handle this problem by utilizing shortcut connections (see Figure 2.3). The input, x, to a set of layers is added via the shortcut connection to the output,F(x), from the layers before going through the activation function. By doing so the risk of getting small gradients is reduced. Since the connections perform identity mapping, neither any additional parameters nor any computational complexity is added to the model [10].

ResNet is typically constructed in versions with 34, 50, 101 or 152 layers. The architecture of ResNet with 50 layers (ResNet-50) is illustrated in Figure 2.4. Stage 2 to 5 consist of a convolutional block and multiple identity (ID) blocks. ID blocks (see Figure 2.3) are the standard building blocks in ResNet that consist of a main path and a shortcut path. Here, the input and output share dimensions, which is what distinguishes ID blocks from convolutional blocks. They share the same main path, but convolutional blocks have a convolutional and batch normalization layer in the shortcut in order to handle the dimension difference.

Some of the aforementioned ResNet architectures are, sometimes with minor differences, available with pre-trained weights in machine learning libraries such as PyTorch [11] and Tensorflow [12]. These models are typically trained on the ImageNet dataset with 1000 classes.

2.3 Transfer Learning

Transfer learning is a method that takes knowledge from one domain and transfers it into a similar target domain. As an example, a model that is trained to detect cars in daylight could

(17)

2.4. Multimodal Machine Learning

Zero

pad Conv Batchnorm ReLU poolMax blockConv ID block x2 Conv block ID block x3 Conv block ID block x5 Conv block ID block x2 Avg pool Flatten FC Input Output

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Figure 2.4: The ResNet-50 architecture.

be used as the basis for a model that detects cars in the dark. This method is well-suited for model training in a domain with little data. In some cases, more general models could also improve models designed for specific tasks. Transfer learning could potentially increase a model’s performance in three different ways. The initial performance using transferred knowledge should be better than the performance of an untrained model. The training time needed to learn a target task is reduced when using the transferred knowledge. Lastly, the overall performance of a transfer-learned model is usually better.

The images that will be analysed in this thesis are not annotated with any tags that could be used to train an image classifier in a supervised manner. This is a case where transfer learning comes in handy.

2.4 Multimodal Machine Learning

A modality is a way of interpreting surroundings, which is most commonly done by our senses [13]. Thus, sight, hearing and touch are all modalities. It has been shown that these modalities can also be interpreted by computers. Previously, machine learning research has pri-marily been unimodal, which means that it has focused on one modality at a time. More recent research has been extended into what is called a multimodal machine learning, where multiple modalities are interpreted in parallel. Common tasks for multimodal machine learning-based systems are to relate, for example, audio and image, text and audio or text and image. What seems to be a simple task for humans has shown to be complex for computers. Baltrusaitis et al. have compiled five core technical challenges in multimodal machine learning [13].

1. Representation - The modalities are often represented in different formats – text by symbols and audio, images and video by signals. To represent these modalities efficiently can be challenging.

2. Translation - The different modalities need to be translated between each other. How this mapping should be performed is not trivial, several different methods have been presented in the literature [14], [15], [16]. The relationship between the modalities is often subjective, which further aggravates the translation. In this thesis, the aim is to translate article texts into images.

3. Alignment - The direct relations between elements within modalities are found with alignment. For example, different parts of a text may refer to different parts of an image. 4. Fusion - In some multimodal machine learning tasks, there may be a need for fusion of data from different modalities in order to perform prediction. These modalities may be of different significance for the predictions and the data can be heterogeneous, which makes the fusion more challenging.

5. Co-learning - Sometimes, especially when there is little annotated data in a modality, it could be useful to transfer knowledge between modalities. Co-learning could be used to help a model trained on one modality by using knowledge from another modality. However, this transfer of knowledge may not be trivial.

(18)

2.4.1 Multimedia Retrieval

Multimedia retrieval (also known as cross-modal retrieval), is one use case in multimodal ma-chine learning. This research topic could face four out of the five challenges listed above, leaving only fusion behind. In image retrieval, which is sometimes referred to as text illus-tration, images are typically retrieved from text queries. Here, there is a challenge in how to represent the modalities in a way that utilizes their complementarity. When the two repre-sentations have been set, they need to be translated into the same kind of representation in order to be comparable. The translation can be both example-based and generative, where the former retrieves the best translation from a dictionary of translations, for instance, text-image pairs, while the latter can generate a suitable translation that might not already exist in the dictionary. The next step is to align the translations in such a way that words or sentences in a text are related to certain parts of an image. To improve the text and image representations, non-parallel data can be used to train the modality representations separately. This is a kind of co-learning [13] that is utilized when using pre-trained text and image models.

In this thesis, the example-based approach is considered because existing images are sup-posed to be suggested. Several different takes on this kind of image retrieval systems have been proposed during the last years. Socher et al. [14] proposed a retrieval system for sen-tences and images. They used a recurrent neural network that learned embeddings from the sentences’ dependency trees and the top layer of an autoencoder model to generate image embeddings. The multimodal representations were trained and evaluated on a dataset of im-ages with five descriptive sentences each. A max-margin objective function was used to train correct image-sentence vector pairs to have high inner products and incorrect pairs to have low inner products. The vectors were mapped into a new multimodal embedding space with their corresponding images. This space was then able to relate text to images and vice versa. The model managed to outperform baselines and other models built for the same purpose. Later the same year, a similar system was proposed by Karpathy et al. [15]. This system also em-bedded fragments of images and sentences into the same space instead of the aforementioned global representations of sentences and images. The extension managed to further improve the results.

The max-margin objective could also be replaced by a CCA objective that strives for finding a joint latent vector space where the correlation between the gold-standard text-image pairs is maximal. CCA models are often used as strong baselines in multimodal retrieval research. Regular CCA and an extension to the method are described in Section 2.4.2.

Relation to Image Captioning

Another popular task in multimodal machine learning is image captioning, which aims for describing images with text. The task can be seen as the inverse of image retrieval as images are used as queries and textual descriptions as results. In the early days of image captioning, example-based models were used to find appropriate textual descriptions of images [17], [18]. The method is similar to the one used in image retrieval, where a dictionary is used to translate between the modalities. Some of the example-based solutions are bidirectional, meaning that they work for both captioning and retrieval. However, the image captioning research has now progressed to generative models with the advent of deep learning. In the case of image captioning this implies, ideally, that descriptive and grammatically correct captions could be generated from an image query [19], [20], [21]. This adds extra complexity to the problem compared to the example-based methods.

2.4.2 Canonical Correlation Analysis

Correlation is a concept in statistics that is used to measure how strongly a pair of variables are linearly related. Correlation is often measured by the Pearson correlation coefficient, which ranges from 1 to −1, where the former implies a perfect positive correlation and the latter

(19)

implies a perfect negative correlation (see Figure 2.5). A correlation coefficient of 0 means that there seems to be no correlation between the variables. When applied to a population, the correlation coefficient ρX,Y between two random variables X and Y is calculated by

ρX,Y =

cov(X, Y )

σXσY

(2.1) where cov is the covariance and σX and σY are the standard deviations of X and Y ,

respec-tively. For a series of n measurements {(x1, y1), (x2, y2), ..., (xn, yn)} (a statistical sample),

this coefficient can be estimated by the sample correlation coefficient according to

rxy= ∑ n i=1(xi− ¯x)(yi− ¯y) √ ∑n i=1(xi− ¯x)2∑ni=1(yi− ¯y)2 (2.2) where ¯x and ¯y are the sample means.

It is important to note that correlation does not imply causation. Thus, it is not certain that, in correlated data, one thing causes the other. Correlation could equally well depend on a third mutual factor, or be a sheer coincidence. A classic example is that in New York, the number of ice creams sold has been shown to be positively correlated with the number of homicides committed in the city. Since the discovery, studies have clarified that it is a third factor, the increasing temperature in the city, that causes the correlation.

Perfect Positive Correlation Positive Correlation No Correlation Negative Correlation Perfect Negative Correlation

Figure 2.5: Different types of correlation.

Correlation is the basis for CCA, which is a statistical method and optimization problem for finding linear relationships between multidimensional datasets. In the current context, the article texts and images are seen as two different datasets. In short, data from two sets are transformed into a joint vector space in which their correlation is maximal. CCA was proposed by Hotelling [22] back in 1936 and is still relevant for certain types of statistics problems. Let’s say the observations to analyse are stored in two data matrices X∈ Rn_×p_{and Y} _{∈ R}n_×q_{, where} n is the number of samples and p and q are the dimensions of the respective datasets. The

goal in CCA is to find weight vectors a∈ Rp_{and b}_{∈ R}q _{that transform the data into the same}

vector space (see Figure 2.6). The process is performed iteratively by first finding the canonical components u1= aT1X and v1= bT1Y that maximizes ρ (see Equation 2.1), then u2= aT2X and

v2= bT2Y that maximizes ρ while being uncorrelated with u1 and v1, and so on. The number

of canonical components that could be found is limited by the dimensionality of the datasets according to min(p, q). However, the more canonical components kept, the more likely it is that the solution is overfitted [23]. Overfitting may also be avoided by regularization, which could be enabled in the objective function.

CCA could be used to find correlation between data from different modalities, which makes it suitable for multimedia retrieval. There is also no need for the modalities to have the same dimensionality, which is often the case when handling different kinds of data. Furthermore, it finds bidirectional relationships, which, for example, make an image retrieval system easily extendable into an image description system and vice versa.

(20)

2.4. Multimodal Machine Learning . . . . . . . . . . . .

Figure 2.6: CCA. The input data, X and Y , are transferred by a and b into U and V that share the same vector space.

This kind of CCA is limited in the way that it can only describe linear relations between the modalities, which in many cases is not enough. Thus, CCA has been further improved by different techniques such as kernelization.

2.4.3 Kernelized Canonical Correlation Analysis

KCCA uses a method called kernel trick to extend the linear CCA to a general nonlinear setting [24]. The kernel trick first gained attention due to its high performance in support vector machines and has since been adapted to several linear machine learning techniques. In short, a kernel is a similarity function between data points that can take low-dimensional data that is not linearly separable into a high-dimensional space. By doing so, high-dimensional relationships could be found without actually transforming the data; hence it is called a “trick”. Accordingly, KCCA takes the data X and Y and calculates their new representations ϕX(X)

and ϕY(Y ), respectively (see Figure 2.7). Furthermore, their relationship could be calculated

with CCA as it is described in Section 2.4.2. Three of the most common kernel functions are linear, Radial Basis Function (RBF) and polynomial kernel.

• The linear kernel is a simple kernel that is given by the inner product between points in the data plus an optional constant c. As opposed to the other kernels, the linear kernel reduces the dimensions in the input data. Because of this, linear kernels tend to perform very well in cases where the number of features is large compared to the number of samples. This kernel is by far the fastest of the three.

K(x, x′) = ⟨x, x′⟩ + c

• The polynomial kernel is a linear kernel that is raised to the power of d. This kernel is rather popular in NLP, often with d= 2 since larger degrees tend to overfit.

K(x, x′) = (⟨x, x′⟩ + c)d

• The Gaussian RBF kernel is one of the most popular kernels in machine learning. It projects the data into an infinite-dimensional space. The σ is a free parameter that needs to be tuned to the problem at hand. An overestimation of σ could lead to an almost linear behaviour of the kernel, while an underestimation could make it sensible to noise.

K(x, x′) = exp(−∣∣x − x

′_∣∣2 2

(21)

2.4. Multimodal Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.7: KCCA. The input data, X and Y , are kernelized into ϕX(X) and ϕY(Y ). The

new representations are then transferred by a and b into U and V that share the same vector space.

KCCA has shown to give pleasing results in image retrieval [14], [25], [26], [27]. The major drawback with KCCA is the training time. It is a nonparametric method and thus it does not scale well with the size of the training set. The kernel matrices tend to become very large for large datasets, more precisely n× n dimensions, where n denotes the number of samples. For large datasets, this might also lead to memory-related problems.

(22)

3 Method

The pre-study with related work and theory that was presented in Chapter 2 eventually led to the implementation of an own image retrieval system, specifically designed to answer the research questions. In this chapter, the method for implementing that system is presented, along with other useful information needed for evaluation.

3.1 Candidate Datasets

Multimodal data is needed to learn the similarity between two modalities (in this case texts and images). The pre-study has shown that there are a few datasets of this kind available, but datasets with news articles and associated images are uncommon. All datasets relevant to the thesis are described in this section, divided into categories for image, text and multimodal data.

3.1.1 Image Datasets

There are multiple resources distributed online that aim for increased performance in image classification tasks. Some of these are large and general while others are specified on a certain topic. Since news articles is the main focus in this work the selected image datasets primarily contain photos of objects and scenes.

ImageNet

ImageNet is a research project that was first introduced in 2009 [28], which consists of a large collection of images organized according to the WordNet hierarchy, where synonyms are grouped into synonym sets. The aim with ImageNet at the time of its release was to have an average of 500 to 1000 images for each synonym set [28]. To increase the reliability of the dataset, each image is quality controlled and human-annotated. The authors state that their goal with ImageNet is that it should contain around 50 million annotated high-resolution images. Because of its size and high quality, ImageNet has come to be one of the most popular datasets for image classification model training.

(23)

3.1. Candidate Datasets

Places

In 2017, Zhou et al. [29] first presented the Places dataset for scene classification. The latest version of the dataset is called Places365 since it has a total of 365 classes with more than 5000 images each. This makes Places the largest publicly available scene-centric dataset to date. The categories are widely spread, spanning from “aquarium” to “fire station” and “sushi bar”. Places has helped to achieve state-of-the-art results in several scene classification tasks [29]. Furthermore, Zhou et al. have provided CNN models pre-trained on Places as well as Places and ImageNet together.1

3.1.2 Text Dataset

Language models require text data in order to learn the semantics of words and texts. There is a myriad of text data available online which has contributed to increased performance of these models. In this thesis, only one pure text dataset was used.

Google News

Google provides Word2vec vectors trained on the Google News dataset, which is a large internal Google dataset of news articles [4]. Since the dataset is large (three million words and phrases2₎

it is a helpful aid for creating very rich word embeddings. The fact that the data is in the news domain is fortunate given the system that is being implemented.

3.1.3 Multimodal Datasets

To learn how to map texts to images, there is also a need for multimodal datasets with texts and related images. Even though text-image pairs appear everywhere on the internet there are seemingly few distributed datasets available online with texts longer than a sentence. The current system was evaluated on two multimodal datasets.

SaabNews

For this work, Saab provided their internal news archive, which in this thesis is referred to as SaabNews. All 6,455 articles, consist of a title, running text, date of publication and image. The average lengths of the titles and running texts are 6 and 330 words, respectively. The articles’ contents are varied in subject and quality, and the texts are written by several authors from different departments. The images illustrate a mixture of Saab products, person portraits, events, etc. The texts contain many cases of Saab-specific words, for instance, product names and internal terms, that probably aggravate the suggestion process when using pre-trained word vectors. The dataset was randomly split into training, validation and test sets of 60%, 20% and 20%, respectively.

BreakingNews

The BreakingNews dataset [30] was collected by Ramisa et al. in 2017 with the aim to im-prove the results in text-image tasks. Apart from text and images, each article also contains additional meta-data such as GPS coordinates and popularity metrics. The dataset consists of approximately 100,000 articles with an average length of 630 words published by highly-ranked news media agencies between January 1 and December 31, 2014. As mentioned in [30], image retrieval using BreakingNews is much more complex than with the more commonly used datasets for image retrieval, such as MS COCO [31]. The texts are less tailored to the corresponding images and the vocabulary is much richer. Furthermore, 46% of the images in

1_{https://github.com/CSAILVision/places365. Last accessed April 25, 2020.} 2_{https://code.google.com/archive/p/word2vec/. Last accessed April 24, 2020.}

(24)

3.2. Implementation

BreakingNews contain faces [30]. Splits to divide the data into training, validation and test sets of 60%, 20% and 20%, respectively, are available on the dataset website.3

3.2 Implementation

Every step in the implementation, from dataset to suggestion, is described in this section.

3.2.1 Development and Replicability

The development, training and evaluation were performed in Windows 10 on a desktop com-puter with an AMD Ryzen 7 3700X CPU, NVIDIA GeForce RTX 2080 SUPER GPU and 16 GB RAM. NVIDIA CUDA 10.2 was used for calculations on the GPU. Furthermore, all code was written in Python 3.7.6 and the modules PyTorch 1.4.0 [11], Gensim 3.8.1 [9], NLTK 3.4.5 [32] and Scikit-learn 0.22.2 [33] were used for the work related to machine learning.

3.2.2 System Overview

The system pipeline is illustrated in Figure 3.1. Texts and their associated images are fed to a text and image classifier, respectively. In this stage, the text embedding model and image classification models are already trained. The new representations of each text and each image are saved in the matrices, say X and Y , where Xi and Yi belong to article i. The

matrices are then sent to the CCA module, which in turn finds the weight vectors a and b (see Section 2.4.2). With the weight vectors available, the three system modules have been trained and image retrieval can be performed. A text is translated by the text embedding model and the output is multiplied with aT _{in order to be transferred into the joint space. The same}

goes for all available images through multiplication with bT_{. The images that are closest to}

the input text in the joint space are retrieved and suggested to the user.

3.2.3 Image Representations

The image representations are derived using pre-trained CNNs, of which there are plenty available online. In this system, PyTorch’s ResNet-50 model, trained on ImageNet data, is used. ResNet-50 is one of the most well-performing models in terms of error rate and complexity4_{. In addition, a second ResNet-50 model is used, this one pre-trained on the}

Places dataset. This model is available for download on Places365’s GitHub page5_{. Another}

reason for choosing ResNet-50 was that it is one of few architectures available online that has been pre-trained on both ImageNet and Places. The two models should complement each other since the ImageNet model is primarily trained to classify images of objects while the Places model is trained to classify landscapes and scenes. ImageNet and Places consist of 1000 and 365 classes, respectively.

The classification models are trained to output the pre-defined class that corresponds to each image. However, the actual classes are not of particular interest in the system at hand, but rather a vector representation of features on a slightly lower level. Feature extraction from a pre-trained network is commonly performed by removing the fully connected layers that reduce the dimensionality at the end of the network [34, p. 143]. Thus, the output from the last average pooling layer of 2048 dimensions in the ResNet model (see Figure 2.4) is used to represent an image. All experiments are run using the ImageNet and Places models separately. This to conclude if object or scene descriptors are most relevant for this specific task. Furthermore, the output from the two models is also concatenated into a single representation, similarly to what is done in [35], which results in an image feature vector of 4096 dimensions.

3_{http://www.iri.upc.edu/people/aramisa/BreakingNews/. Last accessed March 24, 2020.} 4_{https://pytorch.org/docs/stable/torchvision/models.html. Last accessed April 28, 2020.} 5_{https://github.com/CSAILVision/places365. Last accessed April 25, 2020.}

(25)

3.2. Implementation

Headline

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Text embedding model 0.45 0.31 -0.05 . . . -1.04 0.90 0.34 Image classiﬁcation model 1.02 -1.16 0.53 . . . -0.01 0.74 0.66 Canonical correlation analysis

Joint vector space

Figure 3.1: An overview of the system pipeline. Texts and images are translated into vector format by a text embedding model and an image classification model. The vectors are then transferred into a joint vector space where related texts and images are placed close to each other.

In order to improve the classifications on the in-domain images, the models could be fine-tuned with such images. However, [35] did not notice any significant improvements by doing so. Moreover, the images in SaabNews do not have any classes assigned to them, which would be needed to perform fine-tuning. For this reason, it was omitted in this thesis.

Preprocessing

Before the images are fed into the classification model they need to be transformed into floating-point tensors of a fixed size that is computationally feasible. The following preprocessing steps are used, which are common for networks trained on ImageNet.

1. Resize the image proportionally so that the smallest of width and height is scaled to 256 pixels.

2. Crop the image at the center to 224 x 224 pixels.

3. Normalize the tensor image with mean and standard deviation. The RGB channels are subtracted by mean values (0.485, 0.456, 0.406) and then divided by the standard deviations (0.229, 0.224, 0.225), respectively.

3.2.4 Text Representations

A common way to represent word meanings mathematically is by creating embeddings for each individual word in a dataset. There are also methods that aim to capture meanings of phrases, sentences and even longer texts. This section describes how the embeddings are retrieved in the current system.

(26)

3.2. Implementation

Combined Word Embeddings

Word2vec is utilized for generating word embeddings using the implementation in Gensim. Models are trained with the CBOW algorithm, 5 negative samples, vector sizes 100, 300, 500, 700 and 900 and window sizes 5 and 10 to find the best option for the system. Although subsampling rate is pointed out as an important parameter to tune (see Section 2.1.2), it is left as Gensim’s default value of 0.001 due to time constraints. The rest of the parameters in Word2vec are also left as default.

Recall that Word2vec only builds vectors for individual words and not for phrases. Because of this, the word embeddings are averaged into a single representation. Averaging is chosen over summation since it better handles texts of different lengths. A slight modification to the averaged Word2vec model is also evaluated. To favor important words, the word vectors are weighted by their IDF score according to

vt=

1 ∣t∣ ∑w∈t

IDFwvw

where t is a text, w is a word and IDFw is the IDF score of w. This representation will further

be denoted Word2vec IDF. The weighting scheme is the same as weighting all unique words in a text by their TF-IDF score. TF-IDF weighted word vectors have previously been used in the same way in [6], and in [7] on a keyword level. Scikit-learn’s TfidfVectorizer class with default setting is used to calculate IDF weights for each word in the dataset.

Paragraph Embeddings

In [8], experiments in both sentiment analysis and information retrieval were performed in order to understand the behaviour of paragraph embeddings. In the experiments, three differently sized datasets were used, including the IMDB dataset by Maas et al. [36] that consists of 100,000 movie reviews with an average length of 230 words. Moreover, a follow-up paper by Dai Google et al. [37] performs experiments for document similarity with Wikipedia and arXiv datasets consisting of 4,490,000 and 886,000 articles, respectively. Paragraph vectors perform better or equally well as compared to baselines such as BoW in all experiments in [8] and [37], which suggests that Doc2vec can produce high-quality vectors over a large range of dataset sizes.

Despite a thorough web search, no up-to-date pre-trained Doc2vec model was found. For this reason, the Doc2vec models used in this thesis are trained on the domain data. Since the BreakingNews dataset is in the same order of magnitude as the previously mentioned IMDB dataset, this is considered to be a sufficient amount of data to produce rich paragraph vectors. Gensim’s Doc2vec implementation is used to generate the paragraph vectors. The training algorithm is set to PV-DM, the threshold for minimal word occurrences is set to 5 and both window sizes 5 and 10 are tested. The training is run for 20 epochs on BreakingNews and 40 epochs on SaabNews, to compensate for its smaller size. As in Word2vec, separate models are trained for vector sizes 100, 300, 500, 700 and 900. The other parameters are left as Gensim’s default values.

Preprocessing

Before the article texts are fed to the embedding models, they go through a few preprocessing steps. Different NLP models prefer different preprocessing methods, but general practices for text preprocessing are lowercasing, tokenization and stop word removal. Lowercasing is per-formed so that words are treated equally, regardless of capitalization. Tokenization translates a string representation into an array where all words and punctuations are separated (see Figure 3.2). NLTK’s word tokenizer is used for the task in the current system. In sentiment analysis, stop words might dilute the context of a text and thus decrease an algorithm’s effec-tive window size. Hence, these words – in this case from NLTK’s stop word list – are removed.

(27)

3.2. Implementation

'Hi, this is a text.' ÐÐÐÐÐÐÐÐ→ ['Hi', ',', 'this', 'is', 'a', 'text', '.']T okenization Figure 3.2: An example tokenization

Finally, words that are not recognized by the embedding model are removed since they do not have any proper representations in the vector space. This could occur for words that did not reach the threshold for minimal number of words set in the models.

Mikolov et al. do not mention any preprocessing for Word2vec in [3]. Likewise, Le and Mikolov do not seem to preprocess their data before performing their experiments with Doc2vec [8]. Lau and Baldwin [38] use minimal preprocessing – only tokenization and lower-casing – in their Doc2vec model with satisfactory results. The preprocessing steps used in this thesis are described below.

1. Lowercase all words in the text. 2. Tokenize the text.

3. Remove stop words.

4. Remove words that are not in the model vocabulary.

3.2.5 Date Representations

In an attempt to determine if the date of publication matters for the selected image, exper-iments are performed where the text representation is extended with a date representation. It is assumed that the date only has an impact on a monthly level, rather than on a daily. Thus, only the month is described in the new representation. As illustrated in Figure 3.3, the month is represented by a one-hot encoded vector. In attempts at improving the representa-tion further, it is also extended to include the month number as well as the quarter in which the month appears.

0 0 0 0 1 0 0 0 0 0 0 0

5 0 1 0 0

One-hot encoded month Month number One-hot encoded quarter

Figure 3.3: The date representations.

3.2.6 Translation and Alignment

When both the text and image models have been trained, their outputs should be transformed so that it is possible to compare the two modalities. This translation is usually the key to multimodal retrieval. When approaching the retrieval problem with CCA, text embedding models and image classification models can be trained independently, without the influence of translation and alignment. The Python module Pyrcca [23] features implementations of both CCA and KCCA with regularization. The methods are used to find the weights used for transformation into the joint vector space. The only hyperparameters that are set manually in the models are the number of canonical components and the regularization parameter. When training the KCCA model on BreakingNews, one of the problems stated in Section 2.4.3 is encountered – the prospective kernel matrix becomes too large for the computer memory. This is handled by reducing the dataset into the size of SaabNews, which works fine. Pyrcca’s KCCA implementation supports three kernels, linear, Gaussian and polynomial, which are all

(28)

3.3. Evaluation

Table 3.1: Example evaluation of the system where a text and image from the same pair share id. MRR= 1/3(1/3 + 1/4 + 1/1) = 19/36 = 0.53 according to Equation 3.1.

Text query id Suggested image ids Rank Reciprocal rank

24 1, 67, 24, 33 3 1/3

6 235, 9, 86, 6 4 1/4

73 73, 102, 244, 4 1 1/1

tested. Since the CCA methods prefer the input data to be of zero mean, the representations (X and Y in Section 2.4.2) are processed accordingly before both training and retrieval.

Distance Measure

When the two modalities are represented in the same space, their similarity can be measured. This is done by cosine similarity, which measures the angle between two non-zero vectors. The cosine similarity between two vectors with the same orientation is 1, which follows from

cos(0) = 1. Similarly, the cosine similarity between two vectors with opposite orientation is

−1. Cosine distance is the complement of cosine similarity according to Dc(x, y) = 1−Sc(x, y),

where Dc and Sc denotes cosine distance and cosine similarity between the vectors x and y,

respectively. Cosine similarity is defined as

Sc(x, y) = cos(θ) = x⋅ y ∣∣x∣∣ ∣∣y∣∣ = ∑ n i₌₁xiyi √ ∑n i₌₁x2i √ ∑n i₌₁y2i .

3.3 Evaluation

SaabNews and BreakingNews are both used for evaluation, where the former represents a niche type of company news while the latter represents newspaper articles written for the public. The evaluation process is performed using cross-validation. In all experiments, the text embedding models are trained on both title and running text. The CCA models are first trained on the training data and evaluated on the validation data to find a beneficial set of hyperparameters. In the next step, these parameters are used when training the system once more, this time with both the training and validation data. Evaluation with this final model is performed on the test data. The produced results are presented in Chapter 4.

3.3.1 Evaluation Metrics

Three metrics were used to determine the performance of the system: Mean Reciprocal Rank (MRR), Recall at K (R@K) and median rank. A slight problem with the evaluation of this kind of system is that only one document is considered to be relevant to each query, no matter how well other, possibly higher ranked, suggestions fit the input text from a subjective point of view. However, it is necessary to use the previously assigned images as ground-truth data to measure the system’s performance.

Mean Reciprocal Rank

MRR is an evaluation metric that is commonly used in information retrieval tasks where only one document is relevant. MRR is calculated as

MRR= _∣Q∣1 ∣Q∣∑

i=1

1

Ki

, (3.1)

where Q is a sample of queries and Ki is the rank of the relevant document for query i.

(29)

3.3. Evaluation

Recall at K

R@K is an evaluation metric for information retrieval and recommender systems that deter-mines if the relevant item is among the K first suggestions. The results are then averaged for all queries. R@K is calculated as

R@K= 1 ∣Q∣

∣Q∣

∑

i=1

Number of relevant items in top K Number of relevant items ,

where the number of relevant items, in this case, is always equal to 1. K is set to 1, 10 and 20.

Median Rank

Another common evaluation metric in information retrieval is median rank, which simply calculates the median of a collection of ranks. In comparison to MRR, median rank has the advantage of being less sensitive to outliers.

3.3.2 Hyperparameter Tuning

Before running the final tests, the hyperparameters are tuned. Both the text embedding models and the CCA models have parameters that are evaluated by utilizing grid search over all combinations. At times, the experiments are run a second time, this time on a more fine-grained grid around the points that give the best results. The hyperparameters investigated in this process are listed below.

• Vector size - The vector size denotes the number of dimensions the output from the text embedding models should consist of. Both Word2vec and Doc2vec are evaluated for 100, 300, 500, 700 and 900 dimensions. As stated in Section 2.1.2, the best vector size is said to be specific to the task, but also that larger vectors require more training data to make the embeddings informative.

• Window size - As mentioned in Section 2.1, a sliding window is used during training of both Word2vec and Doc2vec. It was concluded in [8] that Doc2vec performs well with a window size between 5 and 12 and that the difference in performance within the span was minimal. Both Word2vec and Doc2vec are evaluated for sizes 5 and 10.

• Number of canonical components - The dimensionality of the joint space is essen-tially determined by this parameter. Evaluation is performed on a grid of values between 1 and the vector size of the text model.

• Regularization - The regularization parameter is evaluated for the values 0.001, 0.1, 10 and 1000 in theKCCA experiments. In CCA it is set to a constant 0.1 to avoid errors that otherwise sometimes occur during training. The value is not tuned thoroughly since it does not seem to affect the end results notably in CCA.

• Kernel type - KCCA is evaluated with the linear, polynomial and RBF kernel described in Section 2.4.3.

For all experiments both R@K, MRR and median rank are calculated. However, these metrics do not always agree on what combination of hyperparameters is the best. The purpose of the image suggestion system is to suggest a collection of about 20 images that might fit the article. Since R@20 tells if the correct image is within the top 20 suggestions, this seems to be a suitable metric to focus on during both the hyperparameter tuning and the final evaluation. The mutual order within the 20 suggestions is not of importance, neither for the metric nor the final system in action.

(30)

4 Results

This chapter presents the evaluation results produced by the proposed system.

4.1 Evaluation

Before running the tests, some different model setups were evaluated. The Word2vec model was trained with texts from the respective datasets, only the pre-trained GoogleNews model and a combination of the two. It turned out that there was no significant difference in performance between the three alternatives. Therefore, the model trained on the domain data was used in the coming tests. This choice is motivated by a lighter model (GoogleNews is significantly larger than the datasets used in these projects) and the fact that Word2vec can then be directly compared with the Doc2vec model as they are trained on the same data. The hyperparameter tuning showed that a window size of 10 performed slightly better than 5 for both Word2vec and Doc2vec. Additionally, KCCA’s linear kernel was clearly better than both polynomial and Gaussian in the task at hand. Additionally, KCCA performed best with the regularization parameter set to 0.001. Finally, it was shown that the extended month feature did not perform better than the original version. The remaining hyperparameters – vector size and number of canonical components – did not have the same constant behaviour, which is why they are presented in the results below.

After each combination of models, queries and parameters had been evaluated on the validation set, the best performing setup was used to run the experiments again, this time on the test set. The results achieved from the latter experiments are the ones presented in this chapter. All text and image models in combination with the two different text queries were evaluated together with CCA. The experiments were then run with the KCCA model, but this time only with the best performing text embeddings from the CCA experiments to reduce the runtime. As a final step, the text query was also extended to include the month of publication.

4.1.1 SaabNews Experiments

Table 4.1 presents the results that were produced with SaabNews using the CCA model. On the left hand side of the table, the configurations of text query, embedding, vector size and number of canonical components (CCs) are laid out for each image model. On the right hand side, the results for each model are presented in terms of the metrics described in Section 3.3.1. The

Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/026--SE

Automated Image Sugges ons

for News Ar cles

An Evalua on of Text and Image Representa ons in an Image

Retrieval System

Automa ska bildförslag ll nyhetsar klar

Pontus Svensson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

Acronyms

1

Introduction

1.1 Motivation

1.2 Aim

1.3 Research Questions

1.4 Background

1.5 Delimitations

2

Theory

2.1 Natural Language Processing

2.1.1 Term Frequency-Inverse Document Frequency

CBOW

Skip-gram

2.1.2 Word Embeddings

2.1.3 Combination of Word Vectors

2.1.4 Paragraph Embeddings

2.2 Computer Vision

2.2.1 Image Classification

2.2.2 Convolutional Neural Networks

2.2.3 ResNet

2.3 Transfer Learning

2.4 Multimodal Machine Learning

2.4.1 Multimedia Retrieval

2.4.2 Canonical Correlation Analysis

2.4.3 Kernelized Canonical Correlation Analysis

3

Method

3.1 Candidate Datasets

3.1.1 Image Datasets

3.1.2 Text Dataset

3.1.3 Multimodal Datasets

3.2 Implementation

3.2.1 Development and Replicability

3.2.2 System Overview

3.2.3 Image Representations

Headline

3.2.4 Text Representations

3.2.5 Date Representations

0 0 0 0 1 0 0 0 0 0 0 0

5 0 1 0 0

3.2.6 Translation and Alignment

3.3 Evaluation

3.3.1 Evaluation Metrics

3.3.2 Hyperparameter Tuning

4

Results

4.1 Evaluation

4.1.1 SaabNews Experiments