Multimodal Model for Construction Site Aversion Classification

(1)

September 2020

Multimodal Model for Construction

Site Aversion Classification

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Multimodal Model for Construction Site Aversion

Classification

Michael Appelstål

Aversion on construction sites can be everything from missing material, fire hazards, or insufficient cleaning. These aversions

appear very often on construction sites and the construction company needs to report and take care of them in order for the site to run correctly. The reports consist of an image of the aversion and a text describing the aversion. Report categorization is currently done manually which is both time and cost-ineffective.

The task for this thesis was to implement and evaluate an automatic multimodal machine learning classifier for the reported aversions that utilized both the image and text data from the reports. The model presented is a late-fusion model consisting of a Swedish BERT text classifier and a VGG16 for image classification.

The results showed that an automated classifier is feasible for this task and could be used in real life to make the classification task more time and cost-efficient. The model scored a 66.2% accuracy and 89.7% top-5 accuracy on the task and the experiments revealed some areas of improvement on the data and model that could be further explored to potentially improve the performance.

(4)

(5)

Avikelser p˚a byggarbetsplatser kan vara allting fr˚an att material saknas, brandfara eller otillräcklig städning. Avikelser som dessa sker ofta p˚a byggarbetsplatser och byggföretagen m˚aste rapportera in samt hantera dessa avikelser för att arbetsplatsen ska fungera säkert och korrekt. Inrapporteringen best˚ar av en bild p˚a avikelsen samt en text som beskriver vad avikelsen handlar om. När dessa inraporteringar sker har görs idag kategoriseringen manuelt vilket b˚ade är tidskrävande och dyrt.

Uppgiften för detta projekt var att implementera och evaluera en automatisk multimodal maskininlärnings klassificerare för de inrapporterade avikelserna. Denna model utnytt-jar b˚ade text och bild datan fr˚an rapporterna. Modelen som presenteras är en late-fusion model best˚aende av en Svensk BERT för text klassificering samt en VGG16 för bild klassificering.

Resultaten visar p˚a att en automatisk klassificerare är möjlig för denna uppgift och en automatisk klassificerare kan användas för att göra rapporterings processen b˚ade snab-bare och billigare. Modelen fick en top-1 accuracy p˚a 66.2% och en top-5 accuracy p˚a 89.7%. Projektet avslöjade även potentiella vidareutvecklingsomr˚aden b˚ade p˚a datan och modelen som eventuellt kan öka prestandan.

(6)

1 Introduction 1

2 Aims and purpose 2

3 Background 3

3.1 Construction site aversions . . . 3

3.2 Artificial Neural Networks . . . 5

3.2.1 Supervised Classification . . . 6

3.2.2 Forward pass . . . 6

3.2.3 Backwards pass . . . 7

3.3 Transfer Learning . . . 7

3.4 Natural Language Processing . . . 7

3.5 Text classification . . . 8

3.5.1 Representing text as features . . . 8

3.5.2 Word Embedding . . . 8

3.5.3 BERT . . . 9

3.5.4 Swedish BERT models . . . 10

3.5.5 Tokenization . . . 12

3.6 Convolutional Neural Networks . . . 12

3.7 Image Classification . . . 13

3.7.1 The VGG16 architecture . . . 13

3.8 Multimodal learning . . . 14

4 Method 16 4.1 Model overview . . . 16

(7)

4.3 Data set . . . 18 4.4 Preprocessing Data . . . 19 4.5 Text classification . . . 20 4.5.1 Preprocessing text . . . 21 4.5.2 Text classifier . . . 22 4.6 Image classification . . . 23 4.6.1 Preprocessing images . . . 23 4.6.2 Image classifier . . . 24 4.7 Fusion of classifiers . . . 25

5 Experiments and Results 27 5.1 Text Classifier . . . 27 5.2 Image classifier . . . 29 5.3 Fused model . . . 30 6 Qualitative Results 35 7 Discussion 38 7.1 Research Questions . . . 38 7.2 Data discussion . . . 39 7.3 Early fusion . . . 41 8 Future Work 42 8.1 Dataset . . . 42

(8)

1 Introduction

Construction sites are full of risks and aversions in a wide range of categories. There can be less impactful faults like missing material, or insufficient cleaning somewhere on the construction site. But there can also be greater impact risks such as missing hand rails on a highly located pathway or a non properly secured scaffolding which can cause serious injuries or even death. For construction companies it is important to report and act on these risks in the correct way to minimize work related injuries and to make sure that their sites are working in a correct and efficient way.

Today most of the aversion reporting is done via a mobile phone application. With a mo-bile phone the construction workers can take a photo of the aversion, write a comment explaining it and then chose a suiting aversion category from a menu of different aver-sion categories. This report is then sent to another unit that identifies what the averaver-sion is, classifies it again and contacts the appropriate person or unit to handle the aversion. This process is time and cost ineffective. As a result workers often do not report smaller risks or do not have time to fill it out in a detailed way. This in turn makes the work of the person classifying the risks much harder and time consuming.

Reporting on faults and risk should be easy to do and it should never be something anyone skips out on because it takes time or because it is difficult.

Automatic classification of these aversions could be done with machine learning. Then the classification made by the construction worker could be skipped if it worked suffi-ciently well. The report could also skip the proxy human checking the report and go directly to the unit responsible for solving said aversion.

This project tackled the problem by implementing a machine learning classifier that combines the text input with the image of the aversion to make a classification of what category the risk is in and thus simplifying the task of reporting by reducing the number of steps. This would also be cost efficient because the proxy human doing the classifi-cation are no longer needed.

The thesis was done at the consulting company Decerno and for one of their customers which is a leading Swedish construction company.

(9)

2 Aims and purpose

The aim of this thesis is to create a model which can as accurately as possible classify the aversions on a construction site based on an input consisting of an image and a sentence. This will be achieved by assessing the performance of different image and text classifiers as well as how well these classifiers can be fused into one multimodal model.

The purpose of this thesis is to explore the possibility of automated classification of construction site aversions. By providing information of how well an automated model could work, the construction company can make a decision on weather they would want to update their current manual system or not.

The questions this thesis aims to answer are:

• How well can a machine learning model perform the task of classifying construc-tion site aversions?

• Is the performance of the model increased if multiple modals are used as input? • Can this model replace the manual classification that is being done today?

(10)

3 Background

This section aims to provide necessary background knowledge for the report. Firstly a description of what construction site aversions are and how they can look is given then some technical background is brought up. An introduction to what Machine Learning and Artificial Neural Networks and how they work are presented. The more specific theories and methods used in this thesis, such as image and text classification, are then explained.

3.1 Construction site aversions

A construction site aversion is anything out of the ordinary occurring at a construction site that needs to be dealt with. As previously stated these aversions is in a wide range of categories such as ”General Order”, ”fire hazard” or ”High elevation work”. In this sec-tion some examples of reported aversions are brought up to give a good picture of how the reports could look like. The comments and the categories of the reported aversions that was obtained from the construction company was originally in Swedish. The model implemented in this thesis uses the Swedish comments and category names, however for this report both the comments and the category names have been translated to English to make it easier to read and understand.

Figure 1 shows two examples of reported aversions for the largest class ”General order”. In the figure both the image of the aversion and the description of the aversion is given.

(11)

Figure 1 Figure showing 2 examples of reported aversions in the class ”General order”. The image is of the aversion and the comment under is the description given when reported.

Figure 2 shows two more examples of reported aversions, the aversions is from the class ”Fire hazard”. Similarly here both the image of the aversion and the description of the aversion is given.

(12)

Figure 2 Figure showing 2 examples of reported aversions in the class ”Fire Hazard”. The image is of the aversion and the comment under is the description given when reported.

These examples shows that information can be obtained from both the image and the comments. This thesis aims to make use of both these data sources when classifying the aversions.

3.2 Artificial Neural Networks

Artificial Neural Networks (ANN) is an area in Machine Learning (ML) aimed to solve problems in a way that is inspired by how a biological brain would. The biological brain is made up of millions of neurons connected in a big network. These neurons are simple cells which are only able to solve a small part of a problem but since the neurons are connected the result of one neuron will affect other neurons and together they are able to perform very complex tasks[3].

Similarly ANNs are made up of small nodes called perceptrons. These perceptrons can

(13)

weighted inputs is sufficiently large it will send out an output signal y. To determine if a signal should be sent and how large that signal would be, an activation function is used. An example activation function is ReLU (Rectified Linear Unit). Using a single perceptron limits the network to only perform simpler binary tasks. To tackle more complex problems ANNs use a lot of perceptrons in different layers, where the output of each perceptron in one layer is connected as an input to all the perceptrons in the following layer until the output layer is reached. These layers are called hidden layers and the more hidden layers used in a ANN the deeper the network is. By having a deep network with many perceptrons there will exist a lot of connection between them each

having a specific weight wi. These weights are the values being updated during the

training of a model to better fit the output of the model to the desired output[3].

In this thesis the ANNs where trained for supervised classification tasks. Training for that kind of task is done with a forwards and a backwards pass often referred to as backpropagation[3]. This method was introduced by Rumelhart et al. in 1986 in their article Learning representations by back-propagating errors[20]. How this works is explained in sections 3.2.2 and 3.2.3

3.2.1 Supervised Classification

Supervised classification is a machine learning task in which a model learns how to classify input data, while also knowing what output class the data correlates to. This is accomplished by having the input data labeled before training the model and using both

the input data xi and the correlating labels yi when training the model. This way the

model knows what is the correct class for the input and can update the model according to how far off the prediction was.

3.2.2 Forward pass

In the forward pass the input xiis fed into the first layer of the model g. The perceptrons

in this layer takes the inputs and applies their weight wi to its inputs. The activation

function then determines if the signal should be sent forward to the perceptrons in the

next layer. This continues until the output layer is reached. The output is given as g(xi)

and since the target output yi is known in Supervised Learning the loss or error can be

(14)

3.2.3 Backwards pass

Backwards pass or backpropagation is the part of the training where the weights of the model is updated. This is done by calculating the loss gradient for the model with respect to each of the weights in the model. This is done for each layer of the model starting from the end and propagating backwards through the model. By calculating the partial derivative of the error C the loss gradient is obtained. The chain rule is then used on these partial derivatives to get the effect each node has on the output. The weights in the nodes are then updated to give a better answer for the next input and reduce the loss of the model[20].

How much and in what direction the weights are updated is decided by an optimization algorithm. This algorithm aims to reduce the loss of the model and thus make the model more optimized. One popular optimization algorithm and the one used in this thesis is Adam [12].

3.3 Transfer Learning

Transfer Learning is a technique used where training is done for one task and that trained model is then reused as a base for a second task. The technique utilises the knowledge learnt in the first task to have a head start in training of the seconds task. This makes training a model easier and faster while also utilising a data source not originally ment for the specific task.

A common use for this is in computer vision where popular models have been trained on very rich data sources such as ImageNet [6] which contains close to 1.3 million im-ages in 1000 classes. These models can then be used as a base to your own classifier by adding extra layers to fit your problem. In section 3.7.1 the model vgg16 is described and for this model pretrained weights can be easily obtained and used for other classifi-cation tasks. Normally it would take a very long time to train a vgg16 with that amount of data, but by using transfer learning that can be heavily reduced.

3.4 Natural Language Processing

Natural Language Processing is a field in AI which aims to make computers interpret and understand natural human language. The field dates back all the way to the 1950s when Alan Turing published the book Computing Machinery and Intelligence [25] and has since then been applied to to a variety of tasks such as Speech recognition, Question

(15)

answering,Text Classification and many more.

3.5 Text classification

Text Classification is one of the many topics in Natural Language Processing. It aims to classify a sequence of text into categories based on the provided sequence. It can be applied to classify larger segments such as documents or websites but also on a smaller level classifying sentences or comments. There exist a multitude of solutions and models to classify text. This section discusses some of the methods that have proven to be most successful and the necessary preprocessing work done in order to effectively apply said methods.

3.5.1 Representing text as features

In order to classify a sequence of text the sequence needs to be represented by features in some way. Unlike most other forms of data, text does not carry much or any information if it is just used in its raw unprocessed form. For example there is nothing indicating, to a computer, that the word ’ship’ is related to the word ’boat’ because the words are made up from four completely different characters. The words are therefore very different in its raw byte form, while having a similar meaning. In this case the word ’shop’ is significantly more similar to ’ship’ on a byte level than the word ’boat’ is.

The same problem is found when using longer sequences of text, such as sentences or documents. Because of this another way to represent text is necessary in order to efficiently apply ML algorithm on it.

3.5.2 Word Embedding

Word Embeddings is a way of representing words as vectors in a d-dimensional space. This is done by mapping each word of a model’s vocabulary to an individual vector representing that word in the d-dimensional space. In this vector space words that are similar will appear close to each other and words that are different will be far away. These vector representations are learnt by different ML models and often uses rich data sources for training which results in accurate representations that can be used easily in other ML models. Popular word embeddings are for example BERT[7], word2vec[15] or GloVe[18], which all use different methods to find the words vector representations. Word embeddings has shown to give good performance across numerous NLP tasks

(16)

[2][24][5]. The embeddings used in this thesis is from BERT and represents each word in the models vocabulary as a 768 dimensional vector.

3.5.3 BERT

BERT stands for Bidirectional Encoder Representation from Transformers and is a deep learning model for Natural Language Processing tasks developed by Google. Described in Delvin et. al 2019 [7] BERT is a deep neural network (DNN) aimed to create pre-trained feature representation for words by considering the context of the word bidirec-tionally at each layer of the network. It is based on the Transformers model introduced in Vasvani et al. [26] and it has shown state of the art performance on a variety of

down-streamNLP tasks [7] while still remaining conceptually simple. Bidirectional in

this model means that when training feature representations for a word it will not only depend on the words preceding it in a sentence, but also the words appearing after. The bidirectionality gives a more accurate embedding representation for words since more context is taken into account.

This is achieved because of how the word representations are pretrained. Pretraining BERT is done on two different tasks Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) and uses a rich data source of Wikipedia articles and Book-Corpus for the training.

The MLM model is an extension of previously existing Language Modeling (LM) mod-els. In previous LM such as Radford et al.[19] representations are trained by predicting the next word of a sequence unidirectional left-to-right or right-to-left. In these LMs the prediction of a word would only be dependent on either the words appearing before or the words appearing after the word to predict. The reason for using only words pre-ceding or following when predicting a word was that if the full sentence was given, the word to predict would also be given thus making the prediction trivial. MLM solved this by masking out a percentage of words in a sequence at random by replacing the word with a [MASK] token and then try and predict the words that are masked out. BERT uses this and masks out 15% of word tokens in a sequence and can therefore obtain a bidirectional representation of the word tokens [7].

The second training task used when pretraining BERT is NSP. This task takes two sen-tences A and B. The first sentence A is fed through the model resulting in a prediction of weather the second sentence B is the next sentence in a text. B will only be the actual next sentence in 50% of the cases and a random sentence from the text source. This training aims to train the model to understand the relation between sentences[7]. After the pretraining is done the BERT model can be applied on a number of

(17)

down-Figure 3 An image showing the pretraining and fine-tuning done with BERT. Both uses the identical architecture except for the output layer in which the fine-tuning part uses an output specified for its downstream task. Credit:[7]

stream tasks such as sequence classification, question answering or sequence predic-tions. This is done by simply adding a task specific extra output layer to the model. When training the downstream task all the pretrained values from BERT will be

fine-tunedto better fit the specific task. Figure 3 shows an example of BERT being pretrained

and then fine-tuned for three different tasks MNLI, NER and SQuAD the pretraining and

fine-tuninguses the same architecture except the fine-tuning has an extra output layer

specific to the downstream task. The downstream tasks in Figure 3 is some of the tasks for which the performance was measured in [7], they are commonly used tasks for mea-suring the performance on NLP tasks.

3.5.4 Swedish BERT models

For BERT to be effective it has as previously stated been pretrained on rich data sources, the English BERT model is trained on BookCorpus and the English wikipedia[7]. This model have through training learned good feature vector representation for most English words and will work well even when used on non English NLP tasks. In order for BERT to achieve its full potential on other languages models can be trained specifically for these languages. For the Swedish language a model (KB-BERT) has been presented by Kungliga Biblioteket (KB) [14]. This model has been trained on a variety of Swedish text sources such as the Swedish wikipedia, digitized newspapers, official reports of the Swedish government and more. Pretraining and fine-tuning is done the same way as

(18)

Figure 4 An image depicting the Transformer model used by BERT. The architecture is based solely on attention mechanisms. Credit: [26]

(19)

the regular BERT model. KB-BERT has shown to outperform both the original BERT model and other multilingual BERT models across a range of Swedish NLP tasks[14].

3.5.5 Tokenization

Before a text sequence can be represented as features it needs to be split up into workable sizes. A popular way to do this is with tokenization. Tokenization takes a text sequence and splits it up into tokens, these token can be sentences, words or even smaller byte level tokenizations.

The BERT model described in 3.5.3 uses WordPiece tokenization (Wu et al. [27]) which splits the text sequence into sub word tokens or wordpieces. An example sentence tokenized with the Swedish BERT version described in 3.5.4 could be:

Sentence: St¨ada upp kring arbetsplatsen

Tokenized sentence: ’stad’, ’##a’, ’upp’, ’kring’, ’arbetsplatsen’

Now the sentence is split up into smaller tokens which exists in the model’s vocabulary and can be mapped to an embedding. This example sentence is taken from the data used in this thesis.

3.6 Convolutional Neural Networks

Convolutional Neural Networks or CNNs is a type of ANN that uses at least one convo-lutional layer in addition to its other layers. Convoconvo-lutional layers was first introduced by Kunihiko Fukushima in 1980 when he proposed the Neocognitron [9] which was a NN model for visual pattern recognition. It was based upon how the visual nervous system in mammals worked. A convolutional layer much like a normal dense layer receives an input and produces an output. The difference is that a convolutional layer does not react to all the inputs individually but instead has a number of filters of fixed sizes that convolves over the input reacting to a chunk of the data at once. This way the spatial in-formation of an input is preserved while heavily reducing the number of connections in the NN [17]. These filters are updated with back propagation during training to identify local patterns in the data.

New architectures and models using CNN has since then seen a lot of improvements and variations. This has led to CNNs being able to solve a variety of tasks such as Image Classification [13], Video Classification [10], sentence classification [11] and many more.

(20)

3.7 Image Classification

Image Classification is the task of classifying an image or an object appearing in an image. An example of this could be to classify what kind of animal appears in an image or as in this thesis what construction site aversion that appears in the image. CNN has in recent years consistently been one of the most used and best performing models for image classification. Many different CNN architectures often top the charts in classifi-cation challenges such as ImageNet Large-Scale VisualRecognition Challenge [21].

3.7.1 The VGG16 architecture

The VGG16 is a 16 layer deep CNN introduced by Simonyan et al. [22] for the Ima-geNet Large-Scale VisualRecognition Challenge 2014 [21] where it performed well on the classification and localization tasks [22]. The VGG16 uses small filter sizes (3x3) with a stride of 1, stride meaning how much the filter moves with each step. Smaller filters results in fewer parameters and because of this the number of convolutional lay-ers in the model and the number of filtlay-ers for each layer could be increased[22]. This resulted in a very deep model, which as a result made the model more accurate[22]. The full architecture of the VGG16 model is shown in Figure 5. This image shows that the model takes an input of size 224x224x64 and it consist of 13 convolutional layers fol-lowed by three fully connected layers at the end. The model also uses five max pooling layers, after some of the convolutional layers, to reduce the size of the model.

(21)

Figure 5 An image depicting the VGG16 architecture. Credit: [8], image is based on theory described in [22]

3.8 Multimodal learning

Data obtained from the real world often comes in multiple types or modals. For example an image could come with a text description or a title and a video could also contain audio. When training a single modal model only features from one modality would be learnt. In contrast multimodal learning aims to learn features for more than one modality to complete its task.

An example of this is Ngiam et al. [16] where multiple techniques to train deep multi-modal networks are presented. Here the models are trained on both audio and video in order to do speech recognition.

There exists many techniques and methods to train a multimodal model. Two popular techniques for combination of modals is early-fusion and late-fusion. The two tech-niques are well explained and compared in Snoek et al. [23] where a multi modal model is trained for semantic video analysis. For the tasks in Snoek et al.[23] the late-fusion models outperformed the early-fusion ones most of the time at the cost of more

(22)

compu-tation, but for some taks early-fusion outperformed late-fusion as well.

Early fusion combines the modals on a feature level and a model is then trained on the combined features of both the modals. An example of how early fusion could look like for a classification task is shown in figure 6.

Figure 6 An exaple of early fusion

Late fusion will as the name suggests fuse the modals later on in the model. For example the modals could each first train separate models and have their outputs fused together. The model then makes a decision on the the combined output. This decision could be done by either a policy or by another network trained to do this decision. An example of how late-fusion could look like for a classification task is shown in Figure 7.

(23)

4 Method

This section explains the methods used to solve the multimodal classification task and explains the reasoning for using these methods. The model overview section 4.1 aims to give a high level explanation of the model architecture and how the different parts of the model interacts. After that the model parts is explained separately and more in depth.

4.1 Model overview

The model developed in this thesis consists of three main parts; the text classifier, the image classifier and the fusion of the two classifiers. A overview of the model architec-ture is shown in figure 8.

The flow of the model can be described in the steps as follows, each of these numbers correspond to the same number in Figure 8:

1. An input consisting of an image and a sentence is fed into the model.

2. The input is divided into two separate inputs, one which is used as the input for the text classifier and the other which is used as input for the image classifier. 3. The inputs are fed into the two classifiers which each generates a probability

vec-tor as output. The probability vecvec-tor generated from the text classifier is a vecvec-tor describing how probable the sentence is to belong in a certain class. Similarly, the probability vector generated from the image classifier is describing how probable it is that the image belongs to a certain class.

4. The two probability vectors are then concatenated into one vector which is used as an input to a third classifier.

5. The third classifier generates a final probability vector based on the new concate-nated input. This output is the final prediction of the model and gives the predicted class.

The training of the image and text classifiers was done separate from each other and then frozen while training the third classifier.

(24)

Figure 8 An overview of the model architecture and the data flow. The image classifier and the text classifier is trained separately and their outputs are fused. The combined outputs is used to train a third classifier which gives the final prediction of the model.

(25)

4.2 Tools used

The implementation was done in Python3 and the decision to use Python3 was done be-cause of the powerful existing machine learning libraries that Python3 offers. The mod-els were built with the Keras[4] library which is a neural network library for python3 which is build upon Tensorflow[1]. The reason to use Keras is because it provides the functionality needed for this thesis, such as building ANN, preprocessing images, pre-processing text and more. With Keras it is also easy to load pretrained models such as the VGG16 and BERT models used in this thesis.

Training deep neural networks can be computationally heavy and time consuming, es-pecially when working on high feature data such as images. To tackle the computation and training time a Google Cloud Virtual Machine was used. The choice of Google Cloud VM was done since they are easy to setup for deep learning tasks and the system specifications such as GPU, memory and disk can be chosen and changed to fit a spe-cific tasks. The machine that was used in this task used an NVIDIA Tesla K80 GPU and 16GB of ram.

4.3 Data set

The data set obtained from for this thesis contained 115.372 data points which each represented one reported aversion on a real life construction site. Each data point was labeled with one class and had a comment describing the aversion. Out of all the data points 65.456 or nearly 60% also contained one or more image of the aversion. In addition to this data the data points also contained some metadata about the reported aversion, such as date, location of the report, which company made the report and which project the aversion was reported in.

The reports where distributed unevenly over 207 classes where the largest class ”Gen-eral order” accounted for over 30 % of all datapoints or 35.000 aversions. In contrast there existed classes such as ”Workclothes” or ”permissions” with as few as 1 reported aversion. Other larger classes were for example ”Fire hazard”, ”Work at height” and ”fences”. In figure 9 the class distribution of all the classes is plotted to further show the class distribution of the data set.

(26)

Figure 9 A class distribution chart where the bars corresponds to the number of aver-sions for a class. The largest class ’General order’ has 32000 reported averaver-sions which correlates to 31% of the total reported aversion in the data set.

4.4 Preprocessing Data

Figure 9 shows that the data set is imbalanced and contained several smaller classes. The performance of the model depends on the quality and size of the data set and how many occurances each class has. Therefore, only classes with more than 50 reports were used in the project and the rest was removed from the data set and not used for training the model. Filtering out the smallest classes decreased the number of classes from 207 to 90.

After filtering out the smaller classes the data set was still imbalanced and several classes were found to be almost identical. For example both the classes ”Machines - Tools” and ”Machines tools” existed in the data set as two different classes while referring to the same kind of aversion. In order to remove the overlapping classes concatenation of classes that were deemed similar was done. This reduced the amount of classes in the model from 90 to 55.

(27)

The preprocessing of data reduced the number of classes from 207 to 55, the number of data points from 115.000 to 90.000 while also increasing the quality of the data significantly.

The new class distribution of the preprocess data is shown in Figure 10. The data was still imbalanced but not as imbalanced as it was before.

Figure 10 A class distribution chart for the dataset after preprocessing of the data had been done. The bars corresponds to the number of aversion for a class.

4.5 Text classification

The text classification part of the model aims to train a model to classify the input sentences received for this task. The goal was to have a model that worked well on the specific data obtained for this task and design choices were made to achieve that. The text classifier consisted of three steps, preprocessing, a fine-tuned Swedish BERT model (KB-BERT) and a dense ANN. Figure 11 shows how an example sentence would go through these three steps before a prediction is given.

(28)

Figure 11 An image depicting the text classifier. The sentence ”Gas canister should be stored in container” is sent as input. It goes through preprocessing and then into a

fine-tunedBERT model which gives a class prediction as output. In this case the predicted

class was ”Fire hazard”.

4.5.1 Preprocessing text

The first step was to preprocess the sentences in order to make them suited for training. The preprocessing that was done during this step was:

• All comments that provides no value were removed. This included empty ments or non describing comments such as ”ok” or ”¨ovrigt”. Since these com-ments appeared in almost all classes and provided no valuable information it would be almost impossible for a model to classify them correctly, therefore they where removed.

• All non Swedish comments were removed. For this project 95% of comments where in Swedish and the model used for classifying them is pretrained on a Swedish vocabulary. Therefore non Swedish comments where removed.

• All the sentences where tokenized, which is a way of dividing the sentences into word piece tokens. This was done using the BERT tokenizer pretrained for Swedish.

• The tokenized sentences got a fixed length of 30 tokens to fit the model. This made each input to the text classifier consist of 30 tokens and was done by padding short sequences with zeroes and cutting of the longest sequences after 30 tokens. The reason for choosing 30 was to keep the input size down while removing as little useful data as possible. A large majority (99.3%) of the sentences used 30 or less tokens while the longest sentence used 284 tokens long.

• The inputs was split into training and validation data with a split of 90/10.

(29)

4.5.2 Text classifier

The model used as the text classifier is a combination of BERT embeddings described in 3.5.3 and a dense ANN. The layers of the model is shown in figure 12 and consists of an embedding layer and three dense layers followed by the output layer. The embedding layer was extracted from the embeddings of the pretrained Swedish version of BERT developed by KB described in 3.5.4. These embeddings were chosen since they had shown good performance on a range of Swedish NLP tasks. This way the pretrained embeddings from BERT was utilized without having to fine-tune the full BERT model. The reason for only using the embeddings was because the full BERT model performed at a similar level as the dense ANN while taking 3,5 hours to train. In comparison the dense NN took between 15 and 20 minutes to train.

The training was conducted on the preprocessed input sentences and the model was updated during training with backpropagation to minimize the loss. The model used the Adam optimizer with an initial learning rate of 0.001. This learning rate was low-ered automatically every time the model did not improve the validation loss value for 5 epochs in a row. Training was continued until the validation loss had converged and a minima was reached.

The last layer or the output layer gave a probability vector as output. The probability vector is an n dimensional vector where each entry represents the probability of an occurrence. Entries represent all the possible occurrences and does not overlap, so all the entries summed up is always 1. In this case each entry represented the probability that a sentence that belonged to a certain class. The entry with the highest probability was the most likely class for the sentence so that was the model’s prediction.

(30)

Figure 12 An image showing the model architecture of the implemented text classifier. The first layer is an embedding layer extracted from the Swedish BERT model trained by KB. The following layers are hidden Dense layers and the last layer is the output layer of the model. The output given is a probability vector of the classes for a given input.

4.6 Image classification

For image classification VGG16 was used. This choice was made based on the shown performance of the VGG16 in similar tasks and its performance metrics when compared with other image classifiers. Since the image classifier was trained separately from the text classifier all images could be used, even the ones that corresponded with sentences that were removed in the text classification data. However some other preprocessing steps was done to fit the data.

4.6.1 Preprocessing images

The different preprocessing steps used for the image classifier is listen below with rea-soning to why that step was used.

(31)

• The data received was sorted by project, but in order to feed it to the model it needed to be sorted after class. A script was made which looked up the class for the image by finding the correlating report in a CSV file. The image was then moved into a directory with the class name.

• Resize the images to 224x224 in order to fit the VGG16 architecture. This was done with the Keras preprocessing tool. Originally the images varied heavily in size, the largest being 4128x3096 and the smallest being 60x40, but with a majority of images being around 1920x1080.

• Image augmentation techniques was done to artificially expand the image dataset by providing slightly altered versions of all images. This was done using the

ImageDataGeneratorin keras. The different augmentations used was:

– Horizontal and vertical shifts moving the image slightly in one direction – Horizontal and vertical flips

– Random zoom which zooms either in or out

– Random brightness alterations which changes the brightness and colors of an image

Adding these steps makes the model learn more generalized features and patterns in images, which is good since the model should be able to recognize any item. For example a fence, even if the fence is upside down or another color than any fence before seen by the model.

4.6.2 Image classifier

To initialize this model weights were transferred from a VGG16 trained on ImageNet data[13]. This was done to take advantage of the rich data sources available in order to improve performance. The ImageNet dataset contains as previously stated 1.3 million images and is trained on 1000 classes. This is magnitudes more data than the 60.000 images obtained for this task.

After the last layer of the VGG16 an output layer was added. This layer was a dense layer using a softmax activation and gave the probability vector of the image classifier. The softmax activation normalizes the output of the model and turns it into a probability vector. This probability vector is the same size and works the same way as the proba-bility vector of the text classifier. Each entry in the vector represents the probaproba-bility that an image belongs to a certain class.

(32)

Supervised training was done since all the images had a correlating label. The weights of the VGG16 network and the additional dense layers were updated through backprop-agation.

An image depicting the flow of the model is shown in 13, it shows the steps the input image takes before a class prediction is made for that image.

Figure 13 An image depicting the image classifiers steps. An image is sent as input. It goes through preprocessing and then into a VGG16 model which gives a class prediction as output. In this case the predicted class was ”Fire hazard”.

4.7 Fusion of classifiers

The final step of the model is to combine the text classifier and the image classifier and make a prediction based on both their outputs. This was done by pretraining the text classifier and the image classifier separately and concatenating their last layers into one. This concatenated output was then used as an input to a third and final classifier. When training the third classifier both the previous two where frozen and not updated. The third model now got trained to predict a class of an input based solely on what the previous classifiers predicted and how certain they where of their own predictions. This follows the architecture of a late-fusion model.

Since the output of both the image and the text classifier was probability vectors of the same size these two was easily concatenated into a twice as long vector which contained both of the classifier’s predictions. The model implemented for this last step was a 3 layer dense ANN outputting a probability vector with length equal to the number of possible classes. In Figure 14 this models architecture is shown. Training this model was done supervised and the network got updated through backpropagation.

(33)

(34)

5 Experiments and Results

In this section different experiments done on the model is described and their perfor-mance on the experiments shown. The results presented in this section are taken from the models trained as described in the methods section. As previously stated the differ-ent parts of the model was trained and evaluated separately and therefore this section is divided into subsections for each part of the full model.

The metrics measured in the experiments include top-1 accuracy, top-5 accuracy, pre-cision and recall. Top-1 accuracy is most common accuracy which is calculated by the number of correct predictions divided by the number of predictions made.

Accuracy = CorrectP redictions

T otalpredictions

Top-5 accuracy is calculated by the number of times the correct class appeared in the top-5 most probable classes for an input divided by the total number of predictions. This is used to see how often the model was close in its prediction.

Precision and recall is calculated individually for each class and then a weighted mean can be calculated over all the classes. Precision and recall is calculated by:

P recision = T P

T P + F P

Recall = T P

T P + F N

Where T P = True Positive, F P = is False Positive and F N = False negative. Pre-cision is a good measurement to use for imbalanced dataset since it says how well a prediction can be trusted for a specific class and the mean precision says how well the system can be trusted overall. The weights for the weighted mean is dependant on how many occurances of that class appear in the input data.

5.1 Text Classifier

The text classification model was trained on 55 classes and used 79.304 sentences as input data. This data was divided into training and validation sets with a 90/10 split resulting in 70.492 sentences for training and 8812 sentences for validation. Because the data was so imbalanced stratified sampling was used to make the split. Instead of

(35)

taking 10% of the data for validation at random, stratified sampling takes 10% from each class to make the validation set. The validation set will be the same size as with random sampling but give a better estimate of how well the model performs. The accuracy of the model is shown in figure 15. The model converged fast and achieved an approximate 63% top-1 accuracy on the validation data and 84% top-1 accuracy on the training data. The top-5 accuracy, which means the probability that the correct class was one of the top-5 predicted classes for an input, was higher at an approximate 86% .

Figure 15 Training and validation accuracy of the text classifier over the number of epochs. Both the top-1 and top-5 accuracies are shown for training and validation data.

When looking at the performance of each class in the text classifier it showed that the performance for the individual classes was imbalanced. Some classes got an accuracy of over 80% while other got close to 0%. Looking closer at which classes performed best/-worst showed that generally it was the larger classes that performed the best while the smaller classes performed worse. The best performing class was ”General order” which got an approximate 82% top-1 accuracy, this is also the class with the most reported aversions.

When trying to solve the problem of the uneven dataset three methods were tested: upsampling, downsampling and weighted loss. All these functions aim to make the model give more attention to small classes.

(36)

• Upsampling duplicates random inputs of smaller classes, this artificially inflates the number of observations for that class and thus makes the dataset more bal-anced.

• Downsampling reduces the number of inputs of larger classes which makes the class distribution more even.

• Weighted loss adds a weight for each class to the loss function of the model. Weights for small classes are larger than for large classes. This makes the loss bigger for miss classifying a small classes and thus makes the model pay more attention to them.

None of the methods tested worked for the classes in this thesis and the model ended up performing significantly worse when they where applied. A reason for this could be that the classes where too similar on a feature level. When the model started to focus more on the smaller classes the larger classes were now miss classified instead.

5.2 Image classifier

The Image classification model was trained on 60.764 images in 55 classes. The reason for the amount of training data being lower when training the image classifier is that not every reported aversion included an image, but all of them included the text. The images were divided into training and validation data with a 90/10 split and using stratified sampling, this resulted in datasets of 54705 images for training and 6059 images for validation. In Figure 16 the training and validation accuracy is shown over the epochs in the training.

The image classifier got a top-1 accuracy of approximately 52%, top-5 accuracy of al-most 85% and precision of 53%. This classifier was even more imbalanced in the top-1 accuracy of individual classes than the text classifier, with only the 5 largest classes getting accuracy of over 30%. Attempts to solve this was done with upsampling, down-sampling and weighted loss, but similarly to the text classifier it did not improve the performance.

The training of this model was done on a gcloud VM with a tesla k80 GPU and took approximately 2 hours to train.

(37)

Figure 16 Training and validation accuracy of the image classifier over the number of epochs. Both the top-1 and top-5 accuracy is shown.

5.3 Fused model

The fused model was built as described in the method section. The input got classified by the image and text classifiers. Their output concatenated and that vector was used as input to the last classifier. When training the fused model the image and text classifiers where frozen and their weights were not updated.

Training of this model was done with 79.304 sentences and 60.764 images in 55 classes. There were fewer images than sentences since not all of the reported aversions contained an image. When an aversion without an image was sent though the network the proba-bility vector of the image classifier was set to all zeroes.

In Figure 17 the accuracy and validation accuracy of the fused model over the epochs in training is shown. The model converged fast and the network was relatively small thus the training of this network was very quick. The training time for this model was around 10 minutes on a k80 tesla GPU. The model achieved a 66.2% accuracy and 64.3% precision which is an improvement on the separate text and image classifiers. The top-5 accuracy of the fused model reached 89.7% which also improved on the separate classifiers performance.

(38)

Figure 17 Training and validation accuracy of the fused classifier over the number of epochs.

Table 1 shows a comparison of the 3 classifiers performance compared to the others. This shows that the fused model has higher performance than the text and image classi-fiers had separately.

Model top-1 accuracy top-5 accuracy precision

Text 62.5% 86.3% 61.7%

Image 52.2% 84.1% 53.1&

Text + Image fused 66.2% 89.7% 65.4%

Table 1 Table showing the performance of the separate image and text models as well as the fused model.

Similarly to how the individual classifiers performed the fused model also had an im-balance in performance on different classes. Figure 18 shows the accuracy of each class and Figure 19 shows the precision of each class. These bar graphs show how imbal-anced the performance was between the classes with the best classes performing well at around 80 % and precision up to 80-100 %. The worst performing classes got ac-curacy and precision scores of 0 %. Each bar in the figures represents one class and the height of the bar is the performance that class got. Generally the best performing classes were the ones with many reported aversions, to further illustrate this Table 2

(39)

shows the performance of some classes and how many reported aversions that existed for that class.

Figure 18 Bar graph of accuracy for all classes in percent. Each bar represents the accuracy of one class.

(40)

Figure 19 Bar graph of precision for all classes in percent. Each bar represents the precision of one class.

Class # reports acc precision

General order 35372 82.2% 76.3% Fences 10538 66.1% 66.1% Work at height 7289 62% 54% Lights 2384 74.2% 70.1% Concealed area 150 25% 25% Person fell 90 0% 0%

Table 2 Table showing performance of some individual classes and how many reported aversions existed for that class.

Experiments was also made using a threshold of how high the probability had to be in order for the model to predict a certain class. This means that the model will only predict a class when the probability vector has one input over a certain threshold. Otherwise the model will not predict any class. If the threshold is 0 the model will work as usual, predicting the largest class. But if the threshold is set to for example 0.7 a probability of 0.7 is needed for a prediction, if a class with over 0.7 probability exist it will be

(41)

the largest and therefore predicted. This experiment was done using a few different thresholds and monitoring performance of the model and how large percentage of the aversions could be predicted using that threshold. In Table 3 the performance for five different thresholds is shown. This shows that the performance is increased by a large percent when increasing the threshold but with the downside of fewer aversions being classified.

threshold accuracy % precision % recall % f1 % % classified %

0 65.2 64.3 62 61.6 100

0.7 73.8 71.2 74 72.3 76

0.8 77.2 74.9 76.8 75.6 67

0.9 81.1 79.3 80.1 78.4 61

0.99 91 88 89 85 31

Table 3 Table showing the performance of the fused model when using different thresh-olds for predictions. The % classified column says how much of the data that could be classified when using that threshold.

(42)

6 Qualitative Results

In this section more qualitative results are presented to show some examples of images and texts that where predicted with the classifiers. Examples of reports that where hard or easy to predict is presented to give an insight into when the model performed well and when it did not.

Figure 20 Image showing a correctly classified reported aversion.

The first image shown in 20 shows the model working well and predicting correctly on both the image, text and fused model parts. This example is the one that has been used previously in the report when explaining how the model works.

(43)

Figure 21 Comparison of two reports from different classes that got predicted as the same.

Figure 21 shows two examples that belongs in different classes but was classified as the same by the image, text and fused model. This could be because the reports are similar on a feature level or it could be a sign that the classes ”Dust” and ”General Order” are very similar and could perhaps be combined into one.

Figure 22 Example report that did not contain an image and miss classified the text sequence.

(44)

classi-case. A reason for this could be that the report contained very little useful information since it had no image and the text sequence mostly contained words which were not in the text model vocabulary. The image could not give any prediction and the text could not be represented by word embeddings.

(45)

7 Discussion

At the beginning of this report a few questions were asked that will be discussed in this section. In addition to the questions some other discussion topics are brought up for discussion.

7.1 Research Questions

How well can a machine learning model perform the task of classifying construc-tion site aversion?

This question could be answered by referring to the performance results given in the experiments and results section. Table 1 shows the results given by the different parts of the model and also how the final combined model performed. The accuracy of the fused model reached 66.2% top-1 accuracy and a 89.7% top-5 accuracy. Shown in table 2 the model performed best on classes with many reported aversion reaching over 82% accuracy on the most reported class while some small classes had 0% accuracy. This could mean that with more data for the smaller classes the performance of the model could increase. So as more data is collected it would be interesting to retrain the model and study how the performance might differ.

Does multi modality improve the performance in this case?

The results obtained from the fused model and the two single modality models shows that the fusion gave a slight increase over all measured metrics. The improvement gained from using multiple modalities was small, an approximate 4% increase in ac-curacy and precision from the single modality text classifier. One theory on why the improvement when using multiple modalities was small is that both the single modal-ity models performed well on the same classes. If instead the models performed well on different sets of classes the fused model could learn to listen more on one of the classifiers when it predicted a certain class.

Another reason might be the relatively low performance of the image classifier. The image classifiers did as described in the Experiments and results section only perform well on the largest classes and thus did not provide much value for the fused model on any of the smaller classes.

Can this model replace the manual classification that is being done today? The model implemented and trained in this thesis would not be able to be used as a fully automated model in the current state. Since the model achieved a 66.2% accuracy the

(46)

model would miss classify 3 out of 10 aversions that were reported.

The model could be used to classify the aversion but before it is sent in the person filling in the report would have to confirm the class, if the automatic classification is correct it gets confirmed and sent. Otherwise that person doing the reporting can chose to manually select another class perhaps from a list of recommended classes sorted by their probability obtained from the model.

Another way it could be used semi automatically is as a recommendation system for categories. While this would not be a fully automatic system it would still simplify the aversion reporting significantly for the construction workers and for the unit responsible for handling the aversion. In this type of system the person reporting would be presented with a list of suggested categories that could be chosen. The suggested categories would be based on the most probable classes according to the model. The model reached a 90% top-5 accuracy so if 5 suggestions for categories are given one of the recommended would be correct roughly nine out of ten times.

A third way the model could be used is to only classify when the model is sufficiently confident in the class prediction, otherwise it would be done manually. In the experi-ment section experiexperi-ments where done choosing different thresholds that had to be ful-filled in order for the model to do a prediction of an aversion. For example a threshold of 90% could be set, meaning that the model needs to be 90% confident in order to auto-matically classify the aversion. If the model is not 90% confident, manual classification would instead be used. Even if not all the classification would be done automatically the construction company would still save a lot of time and money if a portion of 60-70% would be done automatically.

Either of these methods could be used initially while more data is collected. Since data is being collected continuously the data source will become richer and richer, the model would then be retrained after a while and the new performance of the model would determine whether it can be used fully automatically or if the other semi automatic so-lutions could be tweaked to give a better user experience. The different semi automatic models could also work at the same time, for example a prediction could be done if a threshold for probability was reached otherwise provide a list of other suggested classes etc.

7.2 Data discussion

The experiments and results section showed that the models performance was imbal-anced for the different classes in the data. It also showed that some attempted methods for tackling the data was ineffective. The reasons and solution for these problems would

(47)

need to be studied further.

One possible reason however is that the data shares an overlap between classes on a feature level. This means that the features for the different classes are very similar. This could make the different classes hard to distinguish from other classes and the model would naturally predict the most common classes more often since the probability for an input to belong in the larger classes are magnitudes larger than that it would belong in one of the smaller ones. An example of this is brought up in Figure 21, here two reports belonging in different classes where very similar on a feature level resulting in a similar prediction.

This in turn could mean that more data is needed especially for the smaller classes in order to find more distinguishable features, or the classes that overlaps heavily on a fea-ture level could be combined into one and thus make the classes more distinguishable. This could also be done by unsupervised clustering of the input data, the classes that go clustered together is therefore similar on a feature level and could possibly be combined into one. However there might exist cases where the features of some classes are similar but they should still remain as different classes. To make this clustering work well more studies would have to be made into features of the data.

Another problem with the data that was found while doing the experiments was that some reports contained non valuable data or data that was hard for the model to extract features from. An example of this was the sentences used in the text classifier sometimes contained out of vocabulary words. This means that the word does not exist in the vocabulary of the model. Such words were for example misspelled words or very area specific words/abbreviations. An example of this was shown in Figure 22. The model can not assign a word embedding for the words which it has never before seen or trained embeddings for. This leads to that word not being valuable for when the model does its prediction. To solve this the vocabulary could be extended and new embeddings for these words could be trained, but in order to do that to an efficient level a lot data containing the word is needed. An interesting solution for this would be if multiple Swedish construction companies worked together to build and train an extension, based on already existing Swedish models such as KB-BERT, that in addition to the normal vocabulary also contained area specific construction phrases/abbreviations.

Non valuable data also occurred in some cases simply because reporting was not done properly, examples of this was that people wrote their name instead of a comment de-scribing the aversion, or sometimes the description was non informative phrases like ”fix this” or even single characters as input. The problem with non informative data also ex-isted in the image classification part where some images was either all one coloured or too small to contain any valuable data. The smallest image in the reports was 60x40 pixels and almost all of the image was one color. To improve the quality of the data

(48)

some guides or rules on how to report could be given to the people reporting. The qual-ity of data could also improve naturally as people are getting more and more used to the system and how reporting is done.

7.3 Early fusion

In this thesis a late fusion model was built, trained and tested. In this model the two modals where fused late on in the model after they had gone through individual clas-sifiers first. One interesting thing to try could be implementing an early fusion model instead and compare which fusion point gives the best performance for this task. The modals would then be fused at the start of the model on a feature level. In order to create an early fusion model for this task studies would have to be done on how to properly fuse the data at a feature level in an effective way.

(49)

8 Future Work

As future work for this project there exists some potential improvements that was not done due to time constrains or other things limiting the model. In this section some potential future work is presented.

8.1 Dataset

One improvement of the model would be to use a richer training source containing more reported aversions. As of now all available data was used. But aversion reports are being continually reported on construction sites which steadily builds up a richer data source. An interesting idea would be to train the model again in a couple of months when new data is available and see if the performance is increased.

Another potential improvement for the data is to better preprocess the data so the classes are more distinguishable from each other. This could be done manually by merging classes that are similar or it could potentially be done with an unsupervised clustering method. The clustering algorithm would group inputs with similar features together without looking at what label they initially had. These groups would now be the new classes used for the classification task. This would make reported aversions in a class more similar to the other aversions in that class and more distinguishable from the aver-sions in other classes, since the classes would be made from aversion that already was similar on a feature level. However a risk doing this is that classes that should be han-dled by different units of the construction company mistakenly could end up grouped together. To avoid this a more thorough data analysis and supervised clustering would need to be done.

8.2 Application to real life

For this model to be used in real life at construction sites the currently used app would need to be updated or another app would need to take its place. As of right now the model is only usable on a computer while manually inputting aversions. For the app to use the model and automatically classify the aversions, the model would need to be loaded into the app and each time an aversion is reported it is sent through the model for classification.

Since the accuracy of the model was 66% a fully automatic model would miss clas-sify aversions roughly three out of ten times. Other solutions to this could be to have

(50)

the model suggest the most probable classes to the worker sending in the report. This would not make the system fully automatic but would simplify the manual classification task. But from the top5 performance of 90% we see that this system could give good recommendations for classes.

Another way for this to be applied to real life is by only classifying aversions when the model is sufficiently confident in its prediction. From the experiments we saw that the model achieved relatively high confidence on a large percentage of the reported aver-sions, and when a classification threshold was used a high performance was achieved. When more and more data is collected the model would be retrained and the number of aversions with that has high confidence should increase and thus the percentage of classified aversions.

(51)

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and

X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous

systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

[2] R. Al-Rfou, B. Perozzi, and S. Skiena, “Polyglot: Distributed Word Representa-tions for Multilingual NLP,” arXiv e-prints, p. arXiv:1307.1662, Jul. 2013. [3] I. A. Basheer and M. Hajmeer, “Artificial neural networks: fundamentals,

comput-ing, design, and application,” Journal of microbiological methods, vol. 43, no. 1, pp. 3–31, 2000.

[4] F. Chollet et al., “Keras,” https://keras.io, 2015.

[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, no. 76, pp. 2493–2537, 2011. [Online]. Available: http://jmlr.org/papers/v12/collobert11a.html

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F. F. Li, “Imagenet: a large-scale hierarchical image database,” 06 2009, pp. 248–255.

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association

for Computational Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers). Minneapolis, Minnesota: Association for

Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423

[8] M. Ferguson, R. ak, Y.-T. Lee, and K. Law, “Automatic localization of casting defects with convolutional neural networks,” 12 2017, pp. 1726–1735.

[9] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” in Competition and

(52)

cooper-[10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceed-ings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.

[11] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.

[12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing sys-tems, 2012, pp. 1097–1105.

[14] M. Malmsten, L. B¨orjeson, and C. Haffenden, “Playing with words at the national library of sweden – making a swedish bert,” 2020.

[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed repre-sentations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.

[16] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in ICML, 2011.

[17] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” arXiv preprint arXiv:1511.08458, 2015.

[18] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[19] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.

[20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.

[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[22] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv e-prints, p. arXiv:1409.1556, Sep. 2014.