Using Deep Learning to Answer Visual Questions from Blind People

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018 ,

Using Deep Learning to Answer Visual Questions from Blind People

DENIS DUSHI

(2)

Using Deep Learning to Answer Visual Questions from Blind People

DENIS DUSHI

Master’s Thesis at ICT

Academic Supervisor: Sarunas Girdzijauskas

Examiner: Henrik Boström

(3)

Abstract

A natural application of artificial intelligence is to help blind people overcome their daily visual challenges through AI-based assistive technologies. In this regard, one of the most promising tasks is Visual Question Answering (VQA): the model is pre- sented with an image and a question about this image. It must then predict the correct answer. Recently has been introduced the VizWiz dataset, a collection of images and questions origi- nating from blind people. Being the first VQA dataset deriving from a natural setting, VizWiz presents many limitations and peculiarities. More specifically, the characteristics observed are the high uncertainty of the answers, the conversational aspect of questions, the relatively small size of the datasets and ul- timately, the imbalance between answerable and unanswerable classes. These characteristics could be observed, individually or jointly, in other VQA datasets, resulting in a burden when solving the VQA task. Particularly suitable to address these aspects of the data are data science pre-processing techniques.

Therefore, to provide a solid contribution to the VQA task, we answered the research question “Can data science pre-processing techniques improve the VQA task?” by proposing and studying the effects of four different pre-processing techniques. To address the high uncertainty of answers we employed a pre-processing step in which it is computed the uncertainty of each answer and used this measure to weight the soft scores of our model during training. The adoption of an “uncertainty-aware” training pro- cedure boosted the predictive accuracy of our model of ∼ 10%

providing a new state-of-the-art when evaluated on the test split of the VizWiz dataset. In order to overcome the limited amount of data, we designed and tested a new pre-processing procedure able to augment the training set and almost double its data points by computing the cosine similarity between answers rep- resentation. We addressed also the conversational aspect of ques- tions collected from real world verbal conversations by proposing an alternative question pre-processing pipeline in which conver- sational terms are removed. This led in a further improvement:

from a predictive accuracy of 0.516 with the standard question processing pipeline, we were able to achieve 0.527 predictive ac- curacy when employing the new pre-processing pipeline. Ul- timately, we addressed the imbalance between answerable and unanswerable classes when predicting the answerability of a vi- sual question. We tested two standard pre-processing techniques to adjust the dataset class distribution: oversampling and un- dersampling. Oversampling provided an - albeit small - improve- ment in both average precision and F1 score.

Keywords — Visual Question Answering; VizWiz; Deep

Learning.

(4)

Sammanfattning

Användning av Deep Learning för att Svara på Visuella Frågor från Blinda

En naturlig tillämpning av artificiell intelligens är att hjälpa blinda med deras dagliga visuella utmaningar genom AI-baserad hjälpmedelsteknik. I detta avseende, är en av de mest lovande uppgifterna Visual Question Answering (VQA): modellen pre- senteras med en bild och en fråga om denna bild, och mås- te sedan förutspå det korrekta svaret. Nyligen introducerades VizWiz-datamängd, en samling bilder och frågor till dessa från blinda personer. Då detta är det första VQA-datamängden som härstammar från en naturlig miljö, har det många begränsningar och särdrag. Mer specifikt är de observerade egenskaperna: hög osäkerhet i svaren, informell samtalston i frågorna, relativt liten datamängd och slutligen obalans mellan svarbara och icke svar- bara klasser. Dessa egenskaper kan även observeras, enskilda el- ler tillsammans, i andra VQA-datamängd, vilket utgör särskilda utmaningar vid lösning av VQA-uppgiften. Särskilt lämplig för att hantera dessa aspekter av data är förbehandlingsteknik från området data science. För att bidra till VQA-uppgiften, svara- de vi därför på frågan “Kan förbehandlingstekniker från området data science bidra till lösningen av VQA-uppgiften?” genom att föreslå och studera effekten av fyra olika förbehandlingstekni- ker. För att hantera den höga osäkerheten i svaren använde vi ett förbehandlingssteg där vi beräknade osäkerheten i varje svar och använde detta mått för att vikta modellens utdata-värden under träning. Användandet av en ”osäkerhetsmedveten” träningspro- cedur förstärkte den förutsägbara noggrannheten hos vår modell med ∼ 10%. Med detta nådde vi ett toppresultat när model- len utvärderades på testdelen av VizWiz-datamängden. För att övervinna problemet med den begränsade mängden data, kon- struerade och testade vi en ny förbehandlingsprocedur som näs- tan dubblerar datapunkterna genom att beräkna cosinuslikhe- ten mellan svarens vektorer. Vi hanterade även problemet med den informella samtalstonen i frågorna, som samlats in från den verkliga världens verbala konversationer, genom att föreslå en alternativ väg att förbehandla frågorna, där samtalstermer är borttagna. Detta ledde till en ytterligare förbättring: från en förutsägbar noggrannhet på 0.516 med det vanliga sättet att be- arbeta frågorna kunde vi uppnå 0.527 prediktiv noggrannhet vid användning av det nya sättet att förbehandla frågorna. Slutligen hanterade vi obalansen mellan svarbara och icke svarbara klas- ser genom att förutse om en visuell fråga har ett möjligt svar.

Vi testade två standard-förbehandlingstekniker för att justera

(5)

datamängdens klassdistribution: översampling och undersamp-

ling. Översamplingen gav en - om än liten - förbättring i både

genomsnittlig precision och F1-poäng.

(6)

Acknowledgments

I would like to thank all my colleagues at SAP SE for fostering a creative and stim- ulating working environment. I wish to express a sincere thank you to my academic supervisor Sarunas Girdzijauskas for his guidance and support. I would like to express my gratitude to my examiner Henrik Boström for giving helpful and valuable feedback and for professionally addressing all my doubts and concerns.

Finally, I want to express my profound gratitude to my family for their unconditional love and support.

Stockholm, February 25, 2019

Denis Dushi

(7)

List of Figures

2.1 Illustration of a 2D convolution. The kernel (gray matrix) convolves across the width and height of the 2D input(white matrix on the left). At each position is computed the dot product between the kernel and its receptive field in order to construct the feature map (white matrix on the right). Images from [34] . . . 10 2.2 Convolution of a kernel 5 × 5 × 3 over an image. . . 11 2.3 Example of CNN architecture adapted from LeCun et al. [37]. The network

alternates between convolutional layers with non-linear activation function and pooling function. The features maps of the last pooling layer are fed to three fully-connected layers that ultimately output the class probabilities. . . 12 2.4 Residual building block. Image adapted from [44]. . . 13 2.5 A recurrent neural network.(Left) The RNN summerizes the history of inputs

x in the internal state h. (Right) RNN unfolded in multiple time steps. The weight matrices U, V , W are shared across time steps. Image adapted from [35]. 14 2.6 LSTM memory cell. Adapted from [51]. . . 15 2.7 Skip-thoughts architecture. (Right) the encoder creates a fixed lenght repre-

sentation of the sentence s

i

: "I could see the cat on the steps". (Left) the two decoders generate one word at the time the of the target sentences s

i−1

("I got back home" ) and s

i+1

("This was strange"). The unattached arrows symbol- ize the output of the encoder f(s

i

), that is given at each time step, together with the previous target word, to the decoders. Image adapted from [59] . . . 17 2.8 Examples of visual questions and corresponding ground truth in the VizWiz

dataset. The examples include answerable visual questions (top row) and unanswerable or unsuitable visual questions (bottom row). . . 19 2.9 Distribution of the first words of all the questions in VQA (a) and VizWiz

(b). Images from [3], [5]. . . 21 2.10 Distribution of samples with respect of the number of unique answers in VQA

v2.0 and VizWiz. Images from [69]. . . 24 2.11 High level structure of the multi-class VQA model. . . 26 2.12 Feature extraction employing ResNet-152. (a) no-attention features vector

(b) attention features tensor that retains spatial information. . . 27

(10)

2.13 Text encoder. W

embedding

is the embedding matrix exploited to encode words in the vector space. The last hidden state of the Recurrent Neural Network (RNN) is selected as vectorized representation of the input question. . . 28 2.14 Question processing pipeline. . . 30 2.15 Choice of ground truth when minimizing Cross-Entropy loss. In the example

illustrated, bottle is the most frequent answer and therefore the label used to compute the loss. Consequently, the other answers like tv or office are not used during training. . . 32 3.1 Format of annotations json files. In the image are indicated the ground truth

information for the challenge tasks. . . 37 3.2 Format of the json result files accepted by the evaluation server. Image

from [83]. . . 38 4.1 Illustration of the Soft Cross-Entropy loss. The loss is computed as dot

product between negative log-probabilities and vector of weights, where the weights are the number of times each unique answer appears in the ground- truth set divided by the total number of answers (10). . . 43 4.2 An example of how the artificial augmentation samples are created. (a) The

most similar answer to "ramen noodles" in VQA v2.0 is "noodles". Therefore we select a VQA v2.0 sample that has as most frequent answer "noodles" and (b) construct a new sample using the image and question of the VQA v2.0 sample but the VizWiz answer "ramen noodles". . . 45 4.3 The histogram illustrates how the 14269 augmentation samples are distributed

based on the similarity of their original answer and they current "artificial"

answer . . . 47

(11)

List of Tables

2.1 Number of visual questions in VizWiz and VQA v2.0. . . 21 2.2 First 20 most frequent answers in VQA v2.0 and VizWiz. . . 23 2.3 Number and percentage of samples covered by using the top-k answers. . . . 25 2.4 Distribution of unanswerable among the samples. Example: in 1878 samples

unanswerable appears exactly twice in the 10 annotations. . . 25 4.1 Comparison of the performances of the state-of-the-art and of our baseline. . 41 4.2 Performance obtained when employing the attention mechanism described in

Section 2.6.4. . . 41 4.3 Accuracy of uncertainty-aware model using N classes in prediction. . . 43 4.4 Randomly selected couples of VizWiz-VQA v2.0 similar answers observed

when constructing the augmentation set. Examples are drawn from the head (table on the left) and the tail (table on the right) of the similarity distribu- tion. Couples with cosine similarity 0 or 1, have not been included since they are trivial. . . 47 4.5 Comparison of accuracies obtained when training on the original VizWiz train

split and when training on the augmented set. . . 48 4.6 Performance obtained when employing the customized question processing. . 49 4.7 Comparison of accuracies: trained from scratch LSTM based encoder vs.

pre-trained Skip-Thought encoder. . . 49 4.8 Performances of our final multi-class model when evaluated on val, test-dev,

test-challenge splits. We report also the accuracies for each answer type (columns 2-5). . . 50 4.9 Comparison of the performances on test-challenge of the state-of-the-art and

of our final model. . . 50 4.10 Comparison of the performances on test-dev of the different answerability

models trained. . . 52 4.11 Average Precision and F1 score of our final answerability model on the test-

challenge split compared to the state-of-the-art performances. . . 52 5.1 Rows 2-5: Number and percentage of samples/annotations filtered on the

basis of the top-N answers (row 1). Last row: Model accuracy using N

classes in prediction. . . 53

(12)

5.2 Confusion matrix from predictions on val split. The two most frequent an-

swers (“unanswerable” and “unsuitable”) are grouped together as positive

class, all the other answers are considered to be the negative class. . . 54

(13)

Acronyms

AI Artificial Intelligence.

AP Average Precision.

CNN Convolutional Neural Network.

GRU Gated Recurrent Unit.

LSTM Long Short Term Memory.

NLP Natural Language Processing.

NN Artificial Neural Network.

ReLU Rectified Linear Unit.

RNN Recurrent Neural Network.

VQA Visual Question Answering.

(14)

Chapter 1

Introduction

Artificial Intelligence (AI) is not just making our lives easier by automating mundane and tedious tasks, but it is unlocking myriad possibilities to people with disabilities and promising them unique ways of experiencing the world. The industry of assistive technol- ogy has advanced way beyond wheelchairs, prostheses or vision and hearing aids thanks to AI-powered technologies. For instance, visual assistive technologies like Seeing AI [1]

or OrCam [2] are helping blind people overcome daily challenges by facilitating simple everyday tasks and breaking down accessibility barriers. Computer Vision (CV) meth- ods hold great promise in making the lives of blind people much more easier by helping them overcoming their daily challenges and breaking social accessibility barriers [3]. For example, OrCam’s small and light-weighted wearable device employs face recognition, object recognition and optical character recognition in order to help visually-impaired people in their daily life. In this regards, one of the most promising CV tasks is Visual Question Answering (VQA). In the most common form of Visual Question Answering task, the model is presented with an image and a free-form open-ended textual question about this image. It must then predict the correct answer [4]. VQA constitutes a truly AI-complete task [5]: It employs Computer Vision (CV) techniques for acquiring and processing images and Natural Language Processing (NLP) methods for processing and understanding textual questions. A major distinction between VQA and other tasks in computer vision is that the question are not structured and do not have a predefined set of terms but are free-form and open-ended. The content of the visual question and the set of reasoning operations required to answer it is unknown until run-time and could be very diverse [4]. One of the biggest claims of the research community, when highlighting the importance of the task, is that VQA could drastically improve the life quality of blind people [5]. Research community has been working on using computer vision mod- els to advance visual assistive technologies for blind people. In 2010, Bigham et al. [6]

developed the VizWiz app, which enabled blind users to take pictures with their phones,

ask questions about these pictures and receive almost real-time spoken answers from re-

mote sighted employees. However these systems are constrained because rely on humans

to provide the answers. A VQA automated solution could provide a cost-efficient, low

latency and scalable solution [3]. In light of recent advances in VQA models, Gurari et

(15)

CHAPTER 1. INTRODUCTION

al. [3] took advantage of the data collected through the VizWiz app and put together the VizWiz dataset using over 31,000 questions collected from blind people. VizWiz presents unique many peculiarities if compared to existing VQA datasets. Its unique characteris- tics make VizWiz a very challenging dataset for today’s VQA architectures. This year’s (2018) ECCV

¹

conference featured a VizWiz Grand Challenge with the aim of urging the research community to join forces, solve the challenges of the VizWiz dataset and the VQA task at large, and come up with new approaches that meet the needs of blind people. In this thesis, we present our deep learning solutions to the VizWiz Grand Chal- lenge that allowed us to position among the top three performing teams in both the tasks of the challenge. Ultimately, we elaborate on our solutions and highlighting the shortcomings and limitations of current VQA models and evaluation metrics.

1.1 Background

The work of this thesis contributes to the broad field of study of Artificial Intelligence (AI). The term “AI” was coined by John McCarthy, that in 1956 organized the famous Dartmouth Conference, recognized as the birth of the research field. Back in 1978 the applied mathematician Richard Bellman defined AI as:

«The automation of activities that we associate with human thinking, activities such as decision-making, problem solving, learning».

A broader definition presents AI as the study of “artificial agents” [7]. AI is a multi- disciplinary research field, that has inherited many viewpoints and ideas from tradi- tional disciplines: psychology, linguistics, mathematics, computer science, philosophy and many others [8]. In the past, the field has experienced several hype cycles in which phases of enthusiasm and interest have alternated to periods of funding cuts and pes- simism also known as “AI winters” [9]. However, in the last years, AI has experienced a wave of optimism and enthusiasm. Nowadays, AI applications are deeply embed- ded in the infrastructure of almost every industry thanks to the increasingly available computational power and to new theoretical understandings. The major challenge that artificial intelligence has to face is solving the tasks that are easy for people to perform but hard for people to describe formally [10]. Answering questions about a visual con- tent has been considered to be an “AI complete” task [5]. Humans are able to answer visual questions intuitively, however, to solve the task an AI agent is required to have advanced multi-modal reasoning capabilities beyond a single sub-domain. The Visual Question Answering task was initially proposed to connect Computer Vision (CV) and natural language processing (NLP) and push the boundaries of both research fields [4].

Typically, Convolutional Neural Networks (CNNs) are employed to extract feature rep- resentations of the input images. Recurrent Neural Networks (RNNs) instead are used to process and encode the textual questions. The two modalities are then merged and the resulting multi-modal embedding is exploited to generate an answer to the input

1

European Conference on Computer Vision

(16)

CHAPTER 1. INTRODUCTION

visual question. The design of an VQA models goes beyond simply merging the two modalities. Previous literature illustrates how is possible to integrate attention proba- bilistic distributions over the vision and text [11]–[13], how to exploit pre-trained vector representations [14], [15] and how to enhance the expressive power of the multi-modal embedding [14]–[16]. We provide an exhaustive background study in Section 2.

1.2 Problem

The research community claimed many times that VQA could empower blind people to ask questions about the surrounding environment, and therefore breaking many ac- cessibility barriers and allowing them to live an healthy and independent life [3], [5].

However, previous work has focused on designing deep learning architectures which ex- hibit state-of-the-art performances only on artificially created datasets. Typically, the datasets employed for training and testing VQA models (e.g. VQA v1.0 [5] and VQA v2.0 [17]) present questions crowd-sourced from sighted workers and images that are ei- ther synthetic or gathered from the web. It is reasonable to think that methods tailored for these datasets would be inadequate and perform poorly when employed to answer visual questions originating from blind users. There is a large domain shift that should be considered when designing VQA solutions to empower blind people.

Recently, Gurari et al. [3] introduced VizWiz, the first publicly available dataset capturing a real-world interest of real users of a VQA system. Differently from any other VQA dataset, VizWiz data originates from blind users that were authentically asking questions about the surrounding environment to overcome their daily challenges [3].

Gurari et al. [3] benchmarked nine methods on the VizWiz dataset. Three of them [13], [17], [18] are top-performing methods on VQA v2.0. However, on VizWiz these methods perform poorly mainly because they have been designed exclusively for datasets created in artificial settings.

When compared with other artificially created datasets (e.g. VQA v2.0 [17]), VizWiz presents many peculiar characteristics not addressed by previous solutions that could be interpreted as the motive for the poor results obtained. The following characteristics could be visible individually or jointly, in many other datasets and data science pre- processing techniques seem to be the right tool to address them:

High uncertainty of answers In VizWiz and similarly in other VQA datasets, for each visual question are available 10 crowd-sourced answers. Differently from datasets created in artificial settings, VizWiz presents a high disagreement between annotators.

This may be due to the images that often are blurry, low quality and fail to correctly frame the content of interest.

To select the ground truth for the optimization of the VQA models have been adopted

several pre-processing techniques, however none of them addresses clearly the answer

uncertainty. Antol et al. [5] and Lu et al. [12], [19] count how many times a unique

answer appears in the annotation set of the visual question and select the answer with

highest count as ground truth. Ben-younes et al. [15] tried to pick the answer to use

(17)

CHAPTER 1. INTRODUCTION

as a ground truth based on a probability that is proportional to the number of times that answer appears in the annotation set. Ilievski and Feng [20] introduced a soft cross-entropy loss that allows to optimize for all the possible answers weighting the contribution of each answer. The objective function was introduced to bridge the gap between loss function and evaluation metric while improving the convergence of the training procedure. Our intuition is that by employing a soft cross-entropy loss, while computing as a pre-processing step the weights of each answer to resemble its uncertainty level, we would be able to increase the performance of the model trained on VizWiz. If supported by experimental results, this could be a simple, yet efficient, way to tackle the high uncertainty of VQA datasets arising from natural settings.

Conversational questions VQA datasets constructed collecting data “in the wild”, could present uninformative noise in both the visual and textual modalities. Usually the questions in the dataset arise from an artificial setting in which annotators are educated to compose simple and short sentences. However, when trying to solve real world problems, like helping blind people in their daily life, it could be necessary to collect data from real verbal interactions. Questions asked verbally are often conversational and in average longer, presenting several uninformative terms. It is reasonable to think that a pre-processing step in which uninformative terms are removed, could allow the VQA model to learn a better representation of the question, and therefore achieve better performances when predicting the exact answer or the answerability of a visual question.

Relatively small size The collection of data from real world settings is usually more expensive and time consuming, therefore datasets composed by data collected “in the wild” are typically smaller than datasets originating from artificial settings. Deep VQA models require large amount of data to learn underlying concepts and refine high-level reasoning capabilities [21]. This is one of the reasons why all the previous benchmarked methods have achieved extraordinary results on VQA v1.0/v2.0 but they poorly per- form when trained on VizWiz that is orders of magnitude smaller than commonly used datasets. Gurari et al. [3] tried to overcome the limitations deriving from the small size of VizWiz by pre-training the three methods [13], [17], [18] on VQA v2.0 and fine-tuning them on VizWiz. However, this experiment did not produce the desired results. The authors attribute the poor improvement to the small overlap between the set of answers available in VizWiz and the set of answers available in VQA v2.0. In this regard, we pro- pose a pre-processing procedure able to augment the answerable samples in the training set while maintaining the same answer distribution of the initial dataset. To the best of our knowledge, this is the first augmentation method for Visual Question Answering.

Imbalance between answerable and unanswerable classes Artificially created

VQA datasets are constructed such that each visual question is answerable and has at

least one correct answer from which the model can learn. In a real world scenario a visual

question may not have always a correct answer. For example, blind photographers often

fail to frame the content of interest and to take pictures with satisfying quality. This

(18)

CHAPTER 1. INTRODUCTION

results in a large percentage of unanswerable visual questions asked. In literature are not available previous methods to predict whether a visual question is answerable or not since, before VizWiz, none of the available datasets presented unanswerable samples.

Gurari et al. [3] benchmarked on VizWiz the only method [22] available for predicting when a question is not relevant for an image (a specific case of unanswerable visual question). The authors provide also the performances of their best multi-class model by using its output probability that the predicted answer is “unanswerable”. Both the methods perform poorly and moreover they do not consider the imbalance in the dataset distribution. In fact, in VizWiz the number of answerable visual questions is much larger than the number of unanswerable ones. We believe that training a binary classifier on a balanced version of the dataset originating from an oversampling/undersampling pre- processing step could achieve higher average precision when predicting if a visual question is answerable or not.

1.3 Purpose

Together with the dataset, the VizWiz Grand Challenge has been proposed in order to encourage a larger community to collaborate on developing algorithms for assistive technologies. Designing an architecture able to achieve state-of-the-art performances on data collected “in the wild”, is not only an opportunity to contribute in disman- tling accessibility barriers for blind people but also an important deep learning research contribution. The challenge addresses two tasks:

• Task 1: Given an image and question about it, the task is to predict an accurate answer.

• Task 2: Given an image and question about it, the task is to predict if the visual question cannot be answered (providing a confidence score).

The VizWiz Grand Challenge allows us to benchmark our deep learning solutions de- signed to overcome the limitations and address the peculiar characteristics of VQA datasets arising from natural settings, such as VizWiz.

More specifically, the purpose of this thesis is to propose and evaluate data science pre-processing techniques to address the following aspects that could be observed in certain VQA datasets:

• High uncertainty of answers

• Conversational questions

• Relatively small size

• Imbalance between answerable and unanswerable classes

By doing so, we aim at improving the VQA task providing a solution to the problems

stated in the previous chapter (Chapter 1.2) and experimental results to evaluate the

degree of improvement. Given the previous considerations, our research question is:

(19)

CHAPTER 1. INTRODUCTION

Can data science pre-processing techniques improve the VQA task?

1.4 Objectives

In order to answer the previously stated research question, we propose in our research study the following goals:

1. Provide an in-depth analysis of the VizWiz dataset and an extensive comparison with the most used Visual Question Answering dataset (VQA v2.0).

2. Design and implement a model that, given a visual question answer, predicts the correct answer. The design of the components of the model should be guided by the intuitions gathered on the VizWiz dataset.

3. Design and implement an augmentation technique in order to double the data points in the training set.

4. Design and implement a model that, given a visual question predicts if it can be answered or not.

5. Elaborate on our solutions and examine the limitations of our models.

1.4.1 Benefits, Ethics and Sustainability Benefits

Our environment often assumes the ability to see. We rely on colors and labels to distinguish food products, cloths and many other objects with which we have to in- teract daily [6]. Blind people face many daily visual challenges that collectively lead to decreased independence. Our research contributes to the rise of AI-based assistive technologies, that in a near future promise to transform the lives of blind people by facilitating the simple tasks of everyday life and granting them more independence and freedom. Nowadays, deep learning models can be easily deployed on smartphones and wearable devices thanks to the increasing availability of computational resources and to the power-efficient implementations. A portable and automated VQA system would drastically increase the life quality of blind people while being a cost-efficient, low la- tency and scalable solution. Ultimately our solution is open-source. As suggested in the

“Good Citizen of CVPR” event [23], we made our implementation available to every- one on GitHub. We believe that sharing our code has tremendous value for the research community and more specifically for who, in future, would like to build upon our method.

Ethics

The main ethical concern when designing visual assistive technology is related to the

privacy and safety of the users. Ahmed et al. [24] interviewed many visually impaired

participants to investigate which are their privacy concerns and needs. The authors

(20)

CHAPTER 1. INTRODUCTION

discovered that often visually impaired people are willing to compromise their privacy and to reveal personal information to a stranger in exchange for assistance. The VizWiz dataset has been anonymized; the visual questions containing personally-identifying in- formation, location or suspicious scenes have been filtered [3]. Another concern regarding AI-based visual assistive technology are adversarial attacks. Previous literature [25]–[28]

have demonstrated that by applying small but intentionally worst-case perturbations to the input samples, it is possible to trick a neural network in predicting a wrong an- swer. Brown et al. [29] illustrated how it is possible to fool a deep classifier by printing and adding to real-world scenes small physical adversarial stickers. Attackers can also generate adversarial commands against speech recognition models [30]. A VQA system, that enables blind people to ask visual questions, is particularly susceptible to these techniques. Adversarial attacks could lead to extremely dangerous consequences for its users. For instance, a malicious agent could apply an adversarial sticker to a commonly used object in order to deceive or defraud a visually impaired user. Therefore, when de- ploying a new AI-based assistive technology is necessary to design a robust architecture considering and anticipating a variety of possible malicious attacks.

Sustainability

In 2015, world leaders agreed, in the United Nations General Assembly, to 17 goals with the aim of achieving a better and more sustainable future. From a sustainability perspective, this thesis contributes in accomplishing goal number 3: “Ensure healthy lives and promote well-being for all at all ages” and goal number 10: “Reduce inequalities within and among countries” . In fact, this work contributes to the research in assistive technology with the aim of improving the quality of life of blind people and granting them more independence. By being an active part in the research community through the participation to the VizWiz workshop [31], we also contribute in fostering innovation and therefore pursuing goal number 9.

1.5 Methodology

In order to answer the research question we employ an experimental research method [32].

The experimental setup that we adopt is incremental. We start from a simple baseline

and we add further complexity to our models by adding or replacing modules of the ar-

chitecture or changing the data on which it is trained. This approach allows to maintain

in our architectures only the design choices that contribute in improving the perfor-

mances. Architectural design choices are driven by previous literature (e.g. Attention

Mechanism) and/or by VizWiz peculiarities observed during the preliminary study in

Section 2.5 (e.g. Soft Cross-Entropy loss). We design a multi-class model for solving the

Task 1 and an answerability model for the Task 2. The multi-class model is evaluated

using the VQA evaluation metric, while the answerability model is evaluated using Av-

erage Precision (AP). We design and evaluate an augmentation technique in which we

double the number of samples in the training set. We evaluate several configurations

(21)

CHAPTER 1. INTRODUCTION

and pre-processing techniques locally and, ultimately we submit the predictions of best performing multi-class and answerability models to the evaluation server.

1.6 Delimitations

The research question previously stated has deliberately a broad scope, in order to capture all the pre-processing techniques proposed to improve the VQA task. These techniques are essentially very different and they address several aspects of the data used to train a VQA model. We propose and test the use of pre-processing techniques that address the high uncertainty of answers, the conversational aspect of questions, the small size of the dataset and the imbalance between answerable/unanswerable classes.

VQA datasets arising from natural settings present many challenges also regarding the visual modality. For example, in VizWiz many images are blurry, present poor quality and fail to frame the content of interest. In this project we do not consider pre-processing techniques addressing the image content. In the research question we argue if the pre- processing techniques can improve the VQA task. In this project we addressed only the improvement in predictive performances, without considering latency, complexity or other aspects of our solutions.

1.7 Outline

The rest of this thesis is structured as follows. In Chapter 2, we introduce relevant

theory and provide an extensive review of the related work. We report a detailed pre-

liminary analysis of the VizWiz dataset, providing a comparison with the commonly

used VQA v2.0 dataset. Moreover, we presents an overall view of the components of a

VQA model. In Chapter 3, we present the research methods and methodologies used in

our degree project. Chapter 4 we illustrate the pre-processing techniques adopted and

their experimental results. In Chapter 5 we elaborate on our solutions and on the lim-

itations of VQA models, providing our perspective on future directions to improve the

VQA task. Finally, in Chapter 6 we conclude the thesis summarizing our contributions

and answering directly to our research question.

(22)

Chapter 2

Extended Background

The Visual Question Answering (VQA) task was initially proposed by Malinowski and Fritz [33] to connect Computer Vision (CV) and Natural Language Processing (NLP).

The aim was to create an holistic architecture that could pass the visual Turing challenge in open domains. A VQA model is presented with an image and a question about it. It must then predict the correct answer. Therefore, it is required to process and understand text and visual content and join the two modalities representation in a common feature space. In this Section we provide an extended background of the modules typically belonging to a VQA model and a summary of the most relevant approaches adopted to solve the VQA task.

2.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a class of deep feed-forward neural networks specialized for processing data having grid-like topology (e.g. images). VQA models use CNNs to process the visual content and extract generic features from it. The name derives from "convolution", a mathematical operation that plays a major role in CNNs.

Before further discussing CNNs we need to introduce the convolution operation.

2.1.1 Convolution

The convolution operation, denoted with ‘∗’, requires two arguments. The first one is

typically called input the second one kernel (or filter). The output, instead is referred

as feature map [10]. In Computer Vision applications, the input is a multidimensional

array of real values (the image) and the kernel is a multidimensional array of parameters

of the network that are optimized during training [10]. When having a 2D input, the

filter slides (i.e. convolves) across the width and height of the 2D input. At each step, is

computed the dot product between the filter and the input section covered by the filter,

also called receptive field. The results of the dot products computed while sliding the

filter across the input generate a 2-dimensional activation map. The process is illustrated

in Figure 2.1

(23)

CHAPTER 2. EXTENDED BACKGROUND

Figure 2.1. Illustration of a 2D convolution. The kernel (gray matrix) convolves across the width and height of the 2D input(white matrix on the left). At each position is computed the dot product between the kernel and its receptive field in order to construct the feature map (white matrix on the right). Images from [34]

Convolution can be applied also to tensors (3D or more), like images that come in

the form of 3 channels (RGB), each represented by a 2D matrix [35]. Kernels are usually

small in terms of width and height but they always have the same depth of the input

tensor. For example, if we consider images of size 32 × 32 × 3, we could use a kernel

of size 5 × 5 × 3, where 3 is the depth of the kernel and the number of channels of the

input image [36]. An example of a convolution having an image as input is illustrated

in Figure 2.2.

(24)

CHAPTER 2. EXTENDED BACKGROUND

Figure 2.2. Convolution of a kernel 5 × 5 × 3 over an image.

The basis of the success of CNNs is their robustness to shifts and distortions of the input. This is due to a key property of the convolution operation called equivariance to translation [10]. Basically, if the input is subject to a small translation, the output will translate accordingly. When convolving a kernel across the input, we are more interested on whether some feature is present rather than in the exact location where it is [10].

This characteristic of the convolution operation allows CNNs to learn to detect features even when they appear in slightly different location of the training data.

2.1.2 Convolutional Layer

A convolution layer is composed by many kernels, each containing distinct parameters.

The set of kernel extracts several features from multiple locations of the input [37].

During the forward pass of a training iteration, each filter is convolved across the input

tensor. The resulting feature maps are stacked along the depth to form the output

of the layer. During training, the kernels are optimized to detect different type of

features. Simple features like edges, corners, end points are detected by earlier layers of

the CNN. Subsequent convolutional layers convolve on previous maps in order to detect

higher-order features [37]. When processing a high-dimensional input like an image,

is impractical to use fully-connected neural networks, since this would require a matrix

multiplication resulting in a huge amount of parameters to be learned [37]. For example,

a VizWiz image of shape 448 × 448 × 3 would lead to 448 ∗ 448 ∗ 3 = 602, 122 parameters

only for the first neuron of the first fully connected layer. When kernels are smaller than

the input, convolutional networks have sparse interactions [10]. This allows CNNs to

efficiently describe complex interactions while limiting the number of free parameters to

be learned [10]. Ultimately, convolutional layers are characterized by shared (or tied)

weights . Each parameter of a kernel is used at every position of the input. This makes

CNNs dramatically more memory efficient than fully-connected neural networks [10].

(25)

CHAPTER 2. EXTENDED BACKGROUND

2.1.3 CNN Architecture

Convolutional neural networks are typically composed by several hidden layers. Each layer consists of three stages. The first one is the convolutional stage, in which a set of kernels are convolved across the input and, consequently, several feature maps are constructed and piled up. In the second stage a non-linear activation function (e.g.

ReLU) is applied to the feature maps. The third stage introduces a pooling function [10].

A pooling function summarizes a local patch of units in one feature map. For example, max pooling computes the maximum value of its receptive field [35]. Pooling contributes in making the representation invariant to small shifts and distortions of the input [37].

Ultimately, pooling progressively reduces the dimension of the representation, reducing the number of parameters required in subsequent layers. In the majority of the cases, the hidden layers are followed by a series of fully-connected layers that have the role to inter-connect all the activations allowing high-level reasoning. The output of the last fully-connected layer is fed to a softmax function in order to generate a probability distribution over the classes. An illustration of a simple CNN introduced by LeCun et al. [37] is available in Figure 2.3

Figure 2.3. Example of CNN architecture adapted from LeCun et al. [37]. The network alternates between convolutional layers with non-linear activation function and pooling function. The features maps of the last pooling layer are fed to three fully-connected layers that ultimately output the class probabilities.

2.1.4 Literature of CNNs

The idea of convolutional neural networks was first proposed in 1998 by LeCun et al. [37]

that used the architecture to recognize handwritted digits. However, CNNs started being largely adopted after Krizhevsky et al. [38] won the ILSVRC-2012

¹

by largely outper- forming methods that were constructing hand-crafted features. After the introduction of AlexNetKrizhevsky et al. [38], many new architectures have been proposed. Szegedy et al. [39] won ILSVRC-2014 with their GoogleNet. The authors introduce a new module, Inception Module capable of drastically reducing the number of free parameters in the

1

ImageNet Large Scale Visual Recognition Competition 2012

(26)

CHAPTER 2. EXTENDED BACKGROUND

network. With their follow-up work [40] they were able to further improve the perfor- mances of their architecture. Simonyan and Zisserman [41] proposed VGGNet showing that is essential to have very deep networks in order to achieve good performance.

2.1.5 ResNet

Previous works [41] have shown that deep CNNs are able to achieve top performances in visual challenges. However, in deeper networks a degradation problem has been ex- posed [42], [43]:when the network depth increases, the accuracy first saturates and then rapidly degrades [44]. He et al. [44] formulated an experiment in which they consider a shallow architecture and an equivalent deep architecture that builds upon the shallow one by adding identity mapping layers. The authors show that the deep architecture performs worse than the counterpart even if, by design, the two networks should per- form equivalently. Driven by this intuition, He et al. [44] realize a convolutional neural network with “shortcut connections”. Based on the assumption that neural networks can asymptotically approximate any arbitrary complex function, instead of letting the network fit the desired underlying mapping H(x), every few stacked layers, they make the network fit the residual function F (x) = H(x) − x. This is possible by introduc- ing shortcut connections that simply perform identity mapping. The authors propose a building block that is formally described as:

y = W

2

σ (W

1

x ) + x (2.1)

The residual building block is also illustrated in Figure 2.4.

Figure 2.4. Residual building block. Image adapted from [44].

Experimental results in [44] show that deep residual nets are easier to optimize than the counterpart “plain” nets that stacks layers without shortcut connections. “Plain”

nets are subject to the accuracy degradation when the depth increase. On the other

hand, residual networks produce substantially better results while not introducing any

(27)

CHAPTER 2. EXTENDED BACKGROUND

additional parameter or computational cost [44]. ResNet developers were able to im- plement the deepest CNN seen at the moment, with 152 layers. Their solution based on ResNet-152 achieved 1st place in ILSVRC-2015. Additional information on residual networks and experimental results are available in the official paper [44]. In this degree project, we use ResNet-152 to extract two type of features from images of the VizWiz dataset. The process is described in Section 2.6.2.

2.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a family of neural networks able to process sequential inputs, like time series, speech or language, thanks to their temporal dynamic behavior [35]. Differently from convolutional neural networks, RNNs store an internal state h

^t

that allows to summarize the task-relevant aspects of the past sequence of inputs until time step t [10]. Interestingly, the state at time t, h

^t

,is computed as function of the previous state h

^t−1

and the current input:

h

^t

= f(h

^t−1

, x

^t

, θ), (2.2)

where θ are the parameters to be learned. The output o

^t

is computed given the cur- rent state h

^t

. Therefore, the current state captures all the previous sequence of inputs (x

⁰

, x

¹

, ..., x

^t−1

, x

^t

). The RNN learns to selectively memorize aspects of the sequence important to solve the task of interest. A graph representation of an RNN is shown in Figure 2.5.

Figure 2.5. A recurrent neural network.(Left) The RNN summerizes the history of inputs x in the internal state h. (Right) RNN unfolded in multiple time steps. The weight matrices U , V , W are shared across time steps. Image adapted from [35].

The recurrent structure of the graph can be unfolded to show multiple time steps of the sequence processing. The parameters θ, however are shared across time steps. The unfolded graph shows intuitively how the RNN could be trained with backpropagation.

Werbos [45] introduced backpropagation through time (BPTT), a procedure specifically

(28)

CHAPTER 2. EXTENDED BACKGROUND

designed for training RNNs through backpropagation. However, training recurrent neu- ral networks results challenging due to the long-range dependencies that the unfolded graph presents [46]. When propagating the gradient across many time steps, it either grows exponentially or shrinks close to zero [47], [48]. These problems are named re- spectively vanishing and exploding gradients. The problem is generated by the fact that weights are tied across time steps. A long sequence, implies repeated multiplication by the same weights matrix W

^t

[10]. Therefore, the longer is the sequence, the exponentially faster the gradient grows or goes to zero [46].

2.2.1 Long Short Term Memory

To provide a solution to the exploding-vanishing problem, Hochreiter and Schmidhuber [49] proposed the Long Short Term Memory (LSTM) architecture. The basic idea of LSTM is to endow the state unit with a self-loop in which the gradient can flow backwards without vanishing or exploding. The term “long short term memory” is strictly related to the underlying idea. RNNs are equipped with a short-term memory consisting of the activations of the recurrent link and with a long-term memory consisting of the weights that are slowly changing during training [49]. [49] introduced a gated unit called memory cell. The memory cell has a linear unit presenting a self-loop and other two multiplicative units, output gate and input gate. input gate is designed to protect the stored information of the current unit from uninformative inputs. output gate protects the information of the next unit by irrelevant information saved by the current unit.

The architecture was further improved by Gers et al. [50] that proposed a forget gate that learns which stored information to flush away. An illustration of the LSTM cell is provided in Figure 2.6.

Figure 2.6. LSTM memory cell. Adapted from [51].

In the past has been proposed many variation of the LSTM cell [52], [53], but the standard version is composed by:

Forget gate: This unit is a sigmoidal layer that looks at the concatenation of input x

^t

and previous internal state h

^t−1

and learns which information of the previous cell state

(29)

CHAPTER 2. EXTENDED BACKGROUND

C

^t−1

is not important, and therefore, should not be included in the current cell state C

^t

[46], [51].

f

^t

= σ(W

f

[h

^t−1

, x

^t

] + b

f

), (2.3) Input gate: This unit decides which new information is going to be included in the new cell state C

^t

. A candidate value C ^e

^t

is created by a tanh layer. C ^e

^t

is later weighted by the input gate value i

^t

[51]. Eq. 2.6 describes how the current cell state is computed.

i

^t

= σ(W

i

[h

^t−1

, x

^t

] + b

i

), (2.4)

C e

^t

= tanh (W

C

[h

^t−1

, x

^t

] + b

C

), (2.5) C

^t

= f

^t

∗ C

^t−1

+ i

^t

∗ C e

^t

. (2.6) Output gate: This unit learns which information of the current cell state not include in the internal state [51].

o

^t

= σ(W

o

[h

^t−1

, x

^t

] + b

o

), (2.7)

h

^t

= o

^t

∗ tanh (C

^t

). (2.8)

2.2.2 Gated Recurrent Unit

Gated Recurrent Unit (GRU) is another type of gated recurrent neural network intro- duced by [54]. Similarly to LSTM employs gates to learn which information to store in the internal state and which to ignore. More specifically, GRU adopts an update gate and a reset gate that allow to control the flow of information deriving from the previous internal state. The reset gate is used to compute the candidate internal state. The update gate instead decides how much the candidate will impact on the current state.

For a detailed mathematical formulation we redirect our audience to the original work of Cho et al. [54] and to Section 3 of [55].

2.3 Word Embeddings

A Word embedding is a semantically meaningful vector representation of a word [56]. The underlying idea is that, instead of representing words in a space having one dimension per word (one-hot encoding), a dense representation that preserves semantic relationships should be learned for each word [56]. The first to propose the idea were Bengio et al. [57]

however the major breakthrough was the introduction of Word2Vec by Mikolov et al.

[58]. Word2Vec is a word embedding model which training objective is to discover word

representations that are useful for predicting the surrounding words in a document [58].

(30)

CHAPTER 2. EXTENDED BACKGROUND

Specifically, Mikolov et al. [58] proposed a novel architecture called Skip-gram model that is trained to predict the surrounding words within a certain range given the current word. Given a sequence of words w

1

, w

₂

, ..., w

_T

The learning objective to maximize can be formalized as:

1 T

T

X

t=1

X

−c≤i≤c,i6=0

log p(w

t+i

|w

_t

) (2.9)

An interesting aspect of this approach is that simple arithmetical operations can be applied to Word2Vec vectors and they result in meaningful representations. For example, adding the vector of the word "Germany" to that of the word "capital" results in a vector very close to the representation of the word "Berlin" [56]. We exploit this property in our augmentation experiment, described in Section 4.1.4, to combine vector representations of multi-word answers.

2.4 Skip-thoughts

Skip-Thoughts is an unsupervised approach for learning generic distributed sentence representation. Inspired by the Skip-gram model [58], Kiros et al. [59] designed an architecture aiming at optimizing the encoding of a sentence to predict the sentences around it in a corpus of contiguous text. Formally, given three contiguous sentences (s

i−1

, s

_i

, s

_i+1

), the model encodes the sentence s

i

and tries to reconstruct the previous and next sentences respectively s

i−1

and s

i+1

.

Skip-thoughts employs an encoder-decoder framework [53]. Instead of using LSTM, it employs GRU units as encoder and decoder. In the architecture are used one encoder that encodes the full current sentence s

i

in a fixed-length representation f(s

i

) and two decoders, that take as input f(s

i

) and try to generate, one word at the time, the target sentences s

i−1

and s

i+1

. At each time step, the decoders take as input also the target word of previous time step. The process is illustrated in Figure 2.7.

Figure 2.7. Skip-thoughts architecture. (Right) the encoder creates a fixed lenght

representation of the sentence s

i

: "I could see the cat on the steps". (Left) the two

decoders generate one word at the time the of the target sentences s

i−1

("I got back

home") and si+1

("This was strange"). The unattached arrows symbolize the output

of the encoder f (s

i

), that is given at each time step, together with the previous target

word, to the decoders. Image adapted from [59]

(31)

CHAPTER 2. EXTENDED BACKGROUND

Skip-thought vectors can be used as off-the-shelf sentence representations for several NLP tasks as shown in the original paper [59]. In our research we performed an ex- periment in which we incorporate the Skip-thoughts pre-trained encoder [60] in a VQA model to compute dense representations of the questions.

2.5 The VizWiz Dataset

2.5.1 Introduction

Historically, progress in many Computer Vision fields has been a consequence of the introduction of large publicly available datasets that have catalyzed the attention of the research community toward specific problems. For example the introduction of ImageNet [61] and its annual challenge have pushed significantly research in object detection and classification providing immense value to the entire Deep Learning community [38]. Over the last few years, have been proposed many VQA datasets [5], [17], [33], [62]–[65]. The most remarkable, in terms of attention drawn to the task, have been the VQA dataset [5] and its balanced version VQA v2.0 [17]. The VQA dataset (we refer to the real images split) has been constructed using images from the MS COCO dataset [66] and crowdsourcing questions and answers. Visual content in MS COCO is originating from a web-based image search and it is typically high-quality. Moreover VQA authors have instructed the crowdsourced workers in order to collect interesting, diverse and well- posed questions.

Back in 2010, Bigham et al. [6] introduced VizWiz, a phone application aimed at helping blind people with their daily visual problems. VizWiz was allowing visually impaired users to take a picture, ask verbally a question that they would like answered about the picture. It was then enabling visual questions to be answered by recruiting multiple workers from existing online marketplaces (e.g. Amazon Mechanical Turk).

Ultimately the project was shut down but the data collected did not remain unused.

Eight years later in fact, Gurari et al. [3] gathered the data and fed it to a rigorous filtering and anonymization process aimed at protecting the privacy of any individual associated with them. For each image - question pair, they crowd-sourced answers to support training and evaluation of AI models. They proposed at the conference on Computer Vision and Patter Recognition 2018 (CVPR18 ) the first “goal oriented” VQA dataset: not constructed from an artificial setting but built upon data originating from blind people.

Unlike other Visual Question Answering datasets, VizWiz arises from a natural set- ting, reflecting a use case where a person asks questions about the surrounding world.

The blind person that takes the picture is the same that subsequently asks the question.

Spoken questions, recorded through the application, are transcribed by turkers while the

answers to the visual questions are originating from crowdsourced workers. Ultimately,

VizWiz is the first publicly available dataset originating from blind people and therefore

addressing the specific goal of helping visually impaired people in their daily challenges

[3].

(32)

CHAPTER 2. EXTENDED BACKGROUND

(a)

Q: Hi is there a Speedo logo on the side of the swimming hat?

A: ’yes’, ’yes’, ’yes’,

’yes’, ’yes’, ’yes’, ’yes’,

’yes’, ’yes’, ’yes’

(b)

Q: What is this?

A: ’iphone charging’,

’phone’, ’phone’,

’phone’, ’smartphone’,

’cell phone’, ’iphone be- ing charged’, ’iphone’,

’iphone’, ’cell phone charger attached’

(c)

Q: What is this?

A: ’scissors’, ’scissors’,

’scissors’, ’scissors’,

’scissors’, ’black scissors’

(d)

Q: What is the date on this milk?’

A: ’feb 08 13’, ’feb 08 13’, ’feb 09’, ’feb 08 13’, ’feb 8 2013’, ’feb 08 13’, ’feb 08 2013’, ’feb 08 2013’, ’feb 08 13’,

’february 8 2013’

(e)

Q: How many calories are in this bottle of beer?

A: ’145’, ’unanswer- able’, ’unanswerable’,

’unanswerable’, ’unan- swerable’, ’300’,

’unsuitable’, ’unan- swerable’, ’145’, ’1925 caloriesmodeloy’

(f)

Q: What is on this card?

A: ’unanswerable’,

’numbers’, ’unsuit- able’, ’unsuitable’,

’unsuitable’, ’num- bers’, ’unsuitable’,

’tartan’, ’unanswer- able’, ’unsuitable’

(g)

Q: Guy look like in this picture.

A: ’unanswerable’,

’unanswerable’, ’sky’,

’no guy in picture’,

’unanswerable’,

’unsuitable’, ’unan- swerable’, ’no picture’,

’0’, ’unsuitable’

(h)

Q: This is a medication bottle and I’d just like to know if there is a label and dosage infor- mation visible. Thank you.

A: ’yes’, ’no’, ’unsuit- able’, ’yes’, ’not visi- ble’, ’unsuitable’, ’no’,

’unanswerable’, ’has la- bel not sure about dosage’, ’but image in blurry’

Figure 2.8. Examples of visual questions and corresponding ground truth in the

VizWiz dataset. The examples include answerable visual questions (top row) and

unanswerable or unsuitable visual questions (bottom row).

(33)

CHAPTER 2. EXTENDED BACKGROUND

2.5.2 Peculiarities

VizWiz is a challenging dataset for modern vision algorithms. Since blind people are taking pictures and asking verbally questions:

• images are often characterized by poor quality due to poor lighting, focus, and framing of the content of interest (e.g. last row Figure 2.5.1).

• questions are in average more conversational and sometimes are incomplete due to audio recording imperfections such as clipping a question at the end or recording background audio content (e.g. (h) and (g) in Figure 2.5.1).

Moreover, blind people are not able to verify the correctness of the picture captured and this often results in a mismatch between content of interest and question asked. The characteristics described above result into a large number of visual question that are deemed unanswerable by crowdsourced workers. Indeed, turkers have been instructed to answer: “unsuitable” if the image quality is inadequate “unanswerable” if the answer cannot be answered given the image content. A visual question can be unsuitable if the quality of the image is extremely poor and therefore is not possible to infer an answer. A visual question can be unanswerable if, for example, the blind photographer fails to frame the content of interest. From the evaluation prospective, "unanswerable" and "unsuit- able" are two different answers: predicting “unanswerable” while all annotators provide

“unsuitable” as answer results in zero accuracy. However in the dataset “unanswerable”

and “unsuitable” are often overlapping and express the same wider intuition which is

“the question cannot be answered”. This information is also available in the binary field

“answerable” (see Figure 3.1). Often turkers use the two terms interchangeably and this makes it difficult to discriminate between them (e.g. (g) and (f)).

2.5.3 VizWiz vs. VQA v2.0

In this section we compare the VizWiz dataset with the real images split of VQA v2.0.

Since most of the VQA models in literature (e.g [12], [13], [15], [16], [18]) are designed to achieve state-of-art performances on the VQA dataset, it is essential to know in what VizWiz differs from the most commonly used dataset. The peculiarities and unique aspects of VizWiz guided the design of the models proposed in Section 2.6. Moreover, benchmarking VizWiz against VQA v2.0 is useful to understand if and how would be possible to augment the training set with samples from VQA v2.0.

Size

The first aspect that we can notice when comparing the two datasets is the large differ-

ence in size. VQA v2.0 contains almost 35x more visual questions than VizWiz.

(34)

CHAPTER 2. EXTENDED BACKGROUND

dataset n. samples VizWiz 31,173 VQA v2.0 1,1 M

Table 2.1. Number of visual questions in VizWiz and VQA v2.0.

Questions

(a) (b)

Figure 2.9. Distribution of the first words of all the questions in VQA (a) and VizWiz (b). Images from [3], [5].

Questions in VQA v2.0 have a small set of common initial words (e.g. “What”,

“How”, “is” . . . ) (Figure 2.9a) while in VizWiz often they start with a rare word:

28% of VizWiz questions starts with a term that occurs less than 5% of the times [3]

(Figure 2.9b). This is to attribute to the conversational nature of questions in VizWiz.

Questions like “Hi, can you please tell me which is . . . ” in VizWiz, would be simply

“Which is . . . ” in VQA v2.0. From a qualitative analysis, we can infer that VizWiz

users are more prone to ask simple generic questions often related to objects in the

foreground. The most common question in the dataset is “What is this?” [3]. VQA

crowdsourced questions instead are often very specific since annotators are able to see

the content in the image. Finally, blind users were aware of communicating to a human

when using the VizWiz application [6]. Therefore not only they often use conversational

terms like “Thank you”, “Hi”, “Please”, “Okay”, etc. but sometimes they also give

verbose information not related with the question asked.

Using Deep Learning to Answer Visual Questions from Blind People

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018 ,

Using Deep Learning to Answer Visual Questions from Blind People

DENIS DUSHI

Using Deep Learning to Answer Visual Questions from Blind People

DENIS DUSHI

Master’s Thesis at ICT

Academic Supervisor: Sarunas Girdzijauskas

Examiner: Henrik Boström

Abstract

Keywords — Visual Question Answering; VizWiz; Deep

Learning.

Sammanfattning

Användning av Deep Learning för att Svara på Visuella Frågor från Blinda

Vi testade två standard-förbehandlingstekniker för att justera

datamängdens klassdistribution: översampling och undersamp-

ling. Översamplingen gav en - om än liten - förbättring i både

genomsnittlig precision och F1-poäng.

Acknowledgments

Finally, I want to express my profound gratitude to my family for their unconditional love and support.

Stockholm, February 25, 2019

Denis Dushi

Contents

1 Introduction 1

1.1 Background . . . . 2

1.2 Problem . . . . 3

1.3 Purpose . . . . 5

1.4 Objectives . . . . 6

1.4.1 Benefits, Ethics and Sustainability . . . . 6

1.5 Methodology . . . . 7

1.6 Delimitations . . . . 8

1.7 Outline . . . . 8

2 Extended Background 9 2.1 Convolutional Neural Networks . . . . 9

2.1.1 Convolution . . . . 9

2.1.2 Convolutional Layer . . . 11

2.1.3 CNN Architecture . . . 12

2.1.4 Literature of CNNs . . . 12

2.1.5 ResNet . . . 13

2.2 Recurrent Neural Networks . . . 14

2.2.1 Long Short Term Memory . . . 15

2.2.2 Gated Recurrent Unit . . . 16

2.3 Word Embeddings . . . 16

2.4 Skip-thoughts . . . 17

2.5 The VizWiz Dataset . . . 18

2.5.1 Introduction . . . 18

2.5.2 Peculiarities . . . 20

2.5.3 VizWiz vs. VQA v2.0 . . . 20

2.6 VQA Model . . . 26

2.6.1 High level structure . . . 26

2.6.2 Feature Extraction . . . 26

2.6.3 Text Encoder . . . 27

2.6.4 Attention Mechanism . . . 30

2.6.5 Multimodal Pooling . . . 31

2.6.6 Classifier . . . 31

2.6.7 Loss . . . 31

2.7 VQA Related Work . . . 32

3 Methodologies 34 3.1 Research Methods . . . 34

3.2 Data Methods . . . 35

3.3 Dataset Structure . . . 36

3.4 Annotations Format . . . 37

3.5 Challenge Submissions Format . . . 37

3.6 Evaluation Metrics . . . 38

3.7 Technologies and Code . . . 39

4 Results 40 4.1 Multi-class Model . . . 40

4.1.1 Baseline . . . 40

4.1.2 Attention Mechanism . . . 41

4.1.3 Uncertainty-aware Training . . . 42

4.1.4 Data Augmentation . . . 44

4.1.5 Customized Question Processing . . . 48

4.1.6 Skip-Thought Off-the-shelf Vectors . . . 49

4.1.7 Final Submission . . . 49

4.2 Answerability Model . . . 51

4.2.1 Dataset Imbalance . . . 51

5 Discussion and Future Work 53 6 Conclusions 56 6.1 Answer to the Research Question . . . 57

Bibliography 59

List of Figures

x in the internal state h. (Right) RNN unfolded in multiple time steps. The weight matrices U, V , W are shared across time steps. Image adapted from [35]. 14 2.6 LSTM memory cell. Adapted from [51]. . . 15 2.7 Skip-thoughts architecture. (Right) the encoder creates a fixed lenght repre-

sentation of the sentence s