• No results found

On The Effectiveness of Multi-Task Learning

N/A
N/A
Protected

Academic year: 2021

Share "On The Effectiveness of Multi-Task Learning"

Copied!
96
0
0

Loading.... (view fulltext now)

Full text

(1)

On The Effectiveness of Multi-Task

Learning

An evaluation of Multi-Task Learning techniques in deep learning models

Sofia Tovedal Spring 2020

Degree Project in Interaction Technology and Design, 30 credits Supervisor: Thomas Hellström

Extern Supervisor: Erik Westenius Examiner: Ola Ringdahl

(2)

Multi-Task Learning is today an interesting and promising field which many mention as a must for achieving the next level ad-vancement within machine learning. However, in reality, Multi-Task Learning is much more rarely used in real-world implementa-tions than its more popular cousin Transfer Learning. The ques-tion is why that is and if Multi-Task Learning outperforms its Single-Task counterparts.

In this thesis different Multi-Task Learning architectures were uti-lized in order to build a model that can handle labeling real tech-nical issues within two categories. The model faces a challenging imbalanced data set with many labels to choose from and short texts to base its predictions on. Can task-sharing be the answer to these problems?

This thesis investigated three Multi-Task Learning architectures and compared their performance to a Single-Task model. An authentic data set and two labeling tasks was used in training the models with the method of supervised learning. The four model architectures; Single-Task, Multi-Task, Cross-Stitched and the Shared-Private, first went through a hyper parameter tuning process using one of the two layer options LSTM and GRU. They were then boosted by auxiliary tasks and finally evaluated against each other.

Keywords

(3)

1 Introduction 1 2 Objective 3 2.1 Research Questions 3 3 Background 4 3.1 Data 4 4 Theory 6 4.1 Machine Learning 6

4.2 Natural Language Processing 6

4.3 Multi Task Learning 6

4.3.1 Aspects of Multi-Task Learning 7 4.3.2 Contamination in Multi Task Learning 8 4.3.3 Approaches in Multi-Task Learning 8 4.3.4 Multi-Task Learning Using Auxiliary tasks 8

4.4 Reccurent Neural Networks 9

4.4.1 Long Short-Term Memory 10

4.4.2 Gated Recurrent Unit 11

4.5 Pre Processing 11 5 Method 13 5.1 Process 13 5.2 Metrics 14 5.3 Preparing data 15 5.3.1 Balancing data 15

5.3.2 Handling Missing Labels 16

5.3.3 Preprocessing 17

(4)

5.3.6 Padding 18 5.3.7 Word Embedding 19 5.4 Designing Models 19 5.4.1 Model Architectures 19 5.4.2 Single-Task Model 19 5.4.3 Multi-Task Model 20 5.4.4 Cross-Stitched Model 21 5.4.5 Shared-Private Model 21

5.4.6 Hyper Parameter Tuning 22

5.4.7 Layer Types 23

5.4.8 Boosting Performance using Auxiliary Tasks 23 5.5 Multi-Task Architecture Evaluation 28

6 Results 29

6.1 Hyper Parameter Tuning 29

6.1.1 Single-Task Model 29

6.1.2 Multi-Task Model 32

6.1.3 Cross-Stitched Model 35

6.1.4 Shared-Private Model 38

6.2 Layer Type Evaluation 40

6.3 Auxiliary Tasks Evaluation 41

6.3.1 Auxiliary Tasks Sharing the Whole Model 41 6.3.2 Auxiliary Tasks Sharing Half of the Model 42 6.4 Multi-Task Architectures Evaluation 44

6.4.1 Ranged Accuracy 44

6.4.2 Confusion Matrices 47

7 Discussion 50

7.1 Data 50

7.2 Layer Types 51

7.3 Auxiliary tasks boosting 51

7.4 Architectures 52

(5)

8.1.1 Data limitations 55

8.1.2 Time Limitations 55

8.2 Future Work 56

Bibliography 57

Appendix A The Most Common Words in the Swedish Language,

2020 59

Appendix B Enlarged Figures 60

B.1 Hyper Parameter Tuning Results 60

B.1.1 Single-Task Model 60

.1.2 Multi-Task Model 66

.1.3 Cross-Stitched Model 72

.1.4 Shared-Private Model 78

.2 Architecture Evaluation Results 84

.2.1 Single-Task Model 84

.2.2 Multi-Task Model 86

.2.3 Cross-Stitched Model 88

(6)

1 Introduction

Today most people carry a phone, both in the workplace and in their private lives, as it has become a necessity. This is due to the rapidly growing functionality a phone can provide and the tasks that can be completed with it. These tasks range from the private life, where people complete payments, communicate and even identify themselves through apps, to the professional space where companies are increas-ingly adapting mobile applications as a tool for the worker on the go. Especially in professions that require frequent reports and documentation when out in the field, such as social workers and mechanics, the phone has become the main tool to report anomalies and issues.

However, the task of reporting issues itself has not been effectively adapted to this new format. A phone screen is smaller and harder to type on than the clipboards of the past, but they still require just as much information. Often the worker has tens or even hundreds of categories to tag their report with, which are nested in tree structures that have to be clicked through without a proper overview.

In this thesis it is investigated how machine learning can be used for these kind of applications.

In machine learning, models are constructed that specialize in labeling data. If this technology could be implemented in an app, a model could help the employee reporting the issue by suggesting labels based on the entered description of the issue, enhancing efficiency and eliminating tedious work.

Such a model needs to fill a few hard requirements. It needs to be able to out-put several labels of different types as most often an issue is tagged in several ways. It also needs to handle a large quantity of labels of the same type as these systems can have several hundreds of labels in tree structures. The last requirement is that of performance, to help workers in the field the model needs to be small enough to store locally on the device for offline-work and efficient enough to give suggestions fast enough so that the worker does not need to wait for their suggestions.

In addition to this, the data used for training the model may not be of the best quality. The data used in this study offers several challenges due to the fact that it is domain-specific, in Swedish and each entry is short and often misspelled. This makes feature engineering troublesome and puts even more pressure on the deep learning algorithm itself to learn meaningful connections.

(7)

itself naturally to the paradigm of Multi-Task Learning as both tasks could con-tribute to one another. Multi-Task Learning is today an interesting and promising field which many mention as a must for achieving the next level advancement within artificial intelligence. However in reality Multi-Task Learning is much more rarely used in real-world implementations, especially when compared to Transfer Learning. The question is why that is, and if Multi-Task Learning outperforms its Single-Task counterparts.

This thesis aims to not only find models which are able to meet these requirement of labeling issues correctly, but also to evaluating different kinds of Multi-Task Learn-ing architectures and methods. Three Multi-Task LearnLearn-ing Models usLearn-ing different kinds of task-sharing are constructed, optimized and benchmarked against a Single-Task counterpart. We try to boost their performance using auxiliary tasks as well as looking deeper at the field of model design.

(8)

2 Objective

The objective of this study is to contribute to a better understanding of how and when different Multi-Task Learning techniques work or not, and why Multi-Task Learning is not as commonly applied as its cousin Transfer Learning. To achieve this objective, methods were compared using three different kinds of knowledge sharing between tasks and the practice of applying auxiliary tasks to enhance performance was investigated.

The use case in this study is how to predict labels from text describing a technical issue within two different categories; symptom and object. By constructing and optimizing models of each of the three task-sharing architectures, and evaluating them against each other on identical data sets, the best approach can be identified. The architectures will be evaluated for both task-specific and general accuracy. As the use case allow for taking more than the first guess into account a range of the models’ best guesses will be evaluated.

2.1 Research Questions

The first research question is how three multi-task models perform relative to each other and to a single-task model.

The second research question is whether using auxiliary tasks with a high knowledge correlation when training models improves performance. One part of this question is how auxiliary tasks should be best chosen.

(9)

3 Background

Bontouch is a bureau which develops apps for many large companies in Sweden and abroad. One of their customers is Statens Järnvägar(SJ), who runs most of the train traffic in Sweden. Bontouch are providing not only the customer app where tickets are being purchased but also the internal apps for employees where they can validate passenger tickets as well as reporting anomalies and issues with the trains themselves. These issues include technical issues such as damages, need for mainte-nance, or disturbances.

When the employees of SJ are reporting issues they do so by filling out a form with free text, and then tagging it with two labels; one describing the symptom and one describing the object. Each of these labels is multi-leveled, meaning you choose a group-level label and then a sub-level labels. An example of this could be a group-label of ”Kitchen” with a sub-label of ”Sink”. This sounds like a fair system for reporting, but it turns into a daunting task when it is apparent that the sheer number of labels to choose from are in the hundreds and arranged in deep tree structures that are tricky to traverse and/or overview.

However, today this is the system which have been used for an extended time, and this has led to a respectable amount of labeled data. From this data the idea was sparked that perhaps a model could be trained to predict said labels from the text that is entered in the free text field, thus minimizing the time and effort needed to make accurate reports.

3.1 Data

The data used for training and testing is authentic data derived from issues filed from employees working in public transport regarding mechanical errors or anomalies. Each report contains one text sequence and is tagged with a label and a group-label for symptom and a label and a group-label for object.

(10)

Figure 1: Word counts in the current data set

There are two kinds of labels the employees tag the reports from; objects and symp-toms. In Table 1 the total number of group-labels and labels used in the study are displayed.

Table 1 Total number of labels

Category Group Label count Label count

symptoms 10 97

objects 113 195

(11)

4 Theory

Artificial intelligence and machine learning are paradigms that emerged in the middle of the 1900s. Both gained a lot of traction over the last few decades as major advancements were made within this field of science as well as within the field of computer hardware. An incredible enhancement in processing power of computers have enabled training of much larger and more powerful machine learning models. These advancements along with the availability of the enormous data sets needed to train these models have made a near exponential development possible. Their implementations, applications and consequences in the world is a hot topic both in the scientific and the commercial world.

4.1 Machine Learning

Machine learning is the practice of using algorithms to analyze and learn from data, then use the result to make a decision or prediction about new data. There is a host of techniques and machine learning algorithms available to fit a model to a specific problem using preexisting data. Lately Deep Learning, which produces models that use multiple layers of decision making nodes to allow high feature-detection, has advanced beyond the more classic machine learning algorithms in both simplicity of use and performance. Deep Learning uses back-propagation to teach the model in training, which is meant to take away the need to manually select features.

4.2 Natural Language Processing

Machine Learning is a vast field with many applications and sub fields such as Knowl-edge Reasoning, Computer Vision etc. One of these is the field of understanding the human language as people speak it and it is known as Natural Language Processing. In Natural Language Processing a model is to be able to draw knowledge from text in order to complete a task such as translating, answering questions or, as in the case of this study, categorize it[1].

4.3 Multi Task Learning

(12)

leverage domain-specific information contained in the training signals in between the different tasks, and allow for a larger number of features to be learnt. These signals also work as an inductive bias for the tasks, which is proven to enhance their ability to generalize[2][3][4].

The intuition behind Multi-Task Learning is that by learning several tasks at the same time, we can learn better by inducing knowledge from one task to the other. For example, if we were to learn how tomatoes look and only train on classifying tomatoes the case might be that the tomatoes presented to us are always red. We might then think that it will always be red and any green tomatoes in our way will be classified instantly as not a tomato. However, if we at the same time learned to identify apples, we might learn that apples can be many colors; green, red and yellow. This gives us the knowledge that fruit can have different colors. We then have this knowledge also in the tomato case, allowing us to see that the color of the fruit did not have as much of an importance as we thought.

In a model, this means that the idea is to find a common representation in the early layers of a neural network, while allowing individual representations to form in later single-task branches of the network for each task.

The architecture described is however not the only way to share tasks in a model. There are endless opportunities of sharing such as combining input/output, sharing weights, having private and shared layers etc.

4.3.1 Aspects of Multi-Task Learning

There are several different ways in which Multi-Task Learning can provide better learning opportunities for a model. Below are some listed as described by Sebastian Ruder[2].

• Implicit Data Augmentation, where the extra task can provide the model with a larger sample size which provides further knowledge. A larger sample size naturally leads to better performance as there is increased possibility for the model to train.

• Attention Focusing, where multiple tasks can provide evidence for how rele-vant/irrelevant different features are. In general, it is found that a Multi-Task model will choose the features that are shared throughout the tasks, which makes it more possible that relevant features will be kept.

• Eavesdropping, where the model can learn features from one task which can be useful for another task. Sometimes a feature is more prominent in one task, even though it could still be important in another. This adds to the probability that the model will choose to learn the important feature.

• Representation Bias, where the model is biased to prefer representations that other tasks prefer.

(13)

over-fitting is a huge problem in Machine Learning. The multiple tasks act reg-ularizing for each other preventing the model to get too specified in certain features.

All these aspects work together for the model to make better predictions.

4.3.2 Contamination in Multi Task Learning

Training models using Multi-Task Learning is however not as simple as one might hope. When sharing tasks the model does come into risk of having task specific features contaminating the results, lowering its performance. Simply put, Multi-Task Learning can prevent the models’ potential of specializing on its specific task due to the other task altering or even overwriting the features it has recorded in its connections. One way of handling this is the approach of keeping both shared and private layers in a model, with the intention that the private layers would contain the task specific features and the shared layers would contain the general features present across tasks.[5]

4.3.3 Approaches in Multi-Task Learning

The struggle between knowledge-sharing and contamination has produced several different approaches to solve the problem.

One is using Cross-Stitched Units where the task specific models are kept sepa-rate from each other but their activation is combined in between layers, allowing them to share activation without having to share all weights and connections [6]. Another is using the Shared-Private practice of keeping both shared and private layers in parallel with the intention that the private layers will keep task specific features and the shared layers will keep shared features. The reality is however that a shared layer usually can contain both shared and domain-specific tasks [5].

4.3.4 Multi-Task Learning Using Auxiliary tasks

(14)

Choosing auxiliary tasks

It was found by Liebel and M. Körner in 2018 that even seemingly unrelated tasks can improve the results of main tasks, but the auxiliary task has to be chosen carefully in order to aid the model. They state that the auxiliary task should be easy to learn and have labels that require low effort to obtain. Therefore, trying out several tasks to find the most helpful ones are beneficial [10]. In another paper, however, it is argued that the more mutual information the auxiliary task shares with the main task the more effectively it helps. Mutual information in this case is defined as the amount of information that is obtained of one tag set, given the other tag set [11].

Supervising lower level auxiliary tasks

A lower level task is a task that could be considered a sub task for the main task the model aims to perform at. For example a sub task to language translation could be to identify verbs. Usually these tasks are left for the model to figure out on its own while the model is training. However, when supervising these tasks, there is proven benefit on supervising the model on learning lower level tasks inside the network and not only on the output layer to improve performance [12].

4.4 Reccurent Neural Networks

There are many different layers used in deep machine learning today ranging from the simple multi-perceptron layer of the 80s to the much more complicated con-volutional layers and recurrent layers that have emerged in the last few decades. When processing text the most used layers are the recurrent ones, producing what is often referred to as Recurrent Neural Networks(RNN). This is due to their ability to consider time as a factor when computing activation. This is very valuable when learning from text-based input as the order of the words carries substantial mean-ing. The most simple Recurrent Neural Network neuron takes its former output into consideration when computing the current steps activation. This is effective, but also introduces the problem of exploding and vanishing error gradients which made deeper networks much harder to build and performance limited.

The problem occurs when the error gradient either grows uncontrollably when propa-gating backward through the network, causing oscillating weights and unstable learn-ing. Or they vanish completely due to the error shrinking during back-propagation, causing the networks learning to slow down or even stop completely.

(15)

This enables the cell to choose what information to remember and prevents ex-ploding or vanishing gradients. Another unit that has gained more traction and is shown to match or outperform the LSTM unit in some cases is the Gated Re-current Unit(GRU). It was presented in 2014 [14] and is a simpler version of the LSTM. It utilizes a memory cell as well but has only two gates. This also makes it computationally less expensive than the LSTM unit.

4.4.1 Long Short-Term Memory

Long Short-Term Memory (LSTM) is a representative variant of RNN, presented in 1997 as a way to implement sufficient long short-term memory back-propagation in neural networks, greatly reducing the problem of exploding and vanishing error gradients. The LSTM utilizes three gates and a memory cell in order to achieve this. The gates; the input gate, the output gate and the forget gate, enables each cell to decide on its own whether to update its value, forget its memory and/or use its input to produce its activation. Doing this enables it to carry information on previous values through many time steps.

LSTM was first introduced as a solution to the problem of exploding or vanishing error gradient of the traditional RNN. The architecture was constructed to allow constant error flow through the unit by using its central part; the Constant Error Carousel(CEC) [13].

The input gate(4.1) of the LSTM decides whether the cell should be updated with the new value, the forget gate(4.2) decides if the memory cell should keep its old value and the output gate(4.3) decides whether the cell outputs its activation.

it= σ(Wixt+ Uiat−1+ Vict−1) (4.1)

ft= σ(Wfxt+ Ufat−1+ Vfct−1) (4.2)

ot= σ(Woxt+ Uoat−1+ Voct) (4.3)

The candidate value is computed using the input as well as the previous steps acti-vation(4.4).

˜

ct= tanh(Wcxt+ Ucat−1) (4.4)

The memory is updated with the candidate value if it is let through the input gate. At the same time, the previous value of the memory cell needs to pass through the forget gate to stay in the cell(4.5)

(16)

The activation is then finally computed by passing the updated memory cells value through the output gate(4.6).

at= ottanh(ct) (4.6)

For better semantic understanding we can use tree-structured recurrent neural net-works [15].

4.4.2 Gated Recurrent Unit

In the Gated Recurrent Unit(GRU), another variant of an RNN and in some ways a simpler version of the LSTM was proposed in 2014. There is no consensus on which architecture is better and although LSTM has been more widely used and for a longer time GRU has been shown to outperform LSTM in certain tasks [16]. In a GRU neural network, each neuron or unit in the network keeps a memory cell used to produce its output. The memory cell enables the network to take into consider-ation previous time steps when computing its output, which enhances performance when processing time-dependent information greatly [14].

The memory cell in the GRU is guarded by two gates, the update gate zt(4.7) and the reset gate rt(4.8). They get their gate-like function from the sigmoid function

which asserts that its value will be very close to either 0 or 1, therefore dictating how much of the value is let through the gate.

zt= σ(Wzxt+ Uzat1+ bz) (4.7)

rt= σ(Wrxt+ Urat1+ br) (4.8)

The reset gate is used to decide if the memory cell value should be used when computing the new candidate value(4.9).

˜

ct= tanh(W xt+ U (rt· at−1)) (4.9)

The update gate decides whether the memory cell should update with the new candidate value computed with the new input, or if it should keep its memory from earlier time steps(4.10).

ct= (1− zt)at−1+ zt˜ct (4.10)

The computed value ct is the new memory cell as well as the activation of the

neuron(4.11).

at= ct (4.11)

4.5 Pre Processing

(17)

Tokenization

This process starts with tokenization, which is the process of taking text and seg-menting it into meaningful units, tokens, which are easier for the machine to un-derstand. The tokens do not have to be just words but should be an as accurate representation for each component of the text as possible. This process can go from the simple where you remove punctuation, and split by the word, to the more ad-vanced where you also look at a combination of words, keeping the special characters and turning them into their tokens as well as removing very common or very un-common words [18].

(18)

5 Method

To answer the research questions and fulfill the main aim of this study, one Single-Task model and three different Multi-Single-Task Models were constructed. They were optimized using hyper parameter tuning processes and auxiliary tasks were used as an attempt to boost performance. Their performances were then computed using the data set described in Section 3.1 and using the metrics described in Section 5.2. In addition, a smaller comparison of layer types GRU and LSTM was performed to answer the question of which layer type performs better. It was executed by optimizing two models per architecture and comparing their performance as part of the larger hyper parameter tuning process.

5.1 Process

A pre-processing pipeline was designed and kept identical for each evaluation to make sure the models performance was tested and not the method of pre process-ing. The training and validation data was split before each evaluation. The same data was used for the layer type evaluation and the final evaluation of architectures but slightly re-sampled for the auxiliary tasks evaluations for reasons described in Section 5.4.8.

The layer design focused on the effect of the most popular layer types used in Natural Language Processing today; GRU and LSTM(see Sections 4.4.2 and 4.4.1). These are both recurrent layer types, but have in studies shown different levels of per-formance. The more complicated and older type, LSTM, is used in most studies but have also been outperformed by the newcomer GRU which makes the question interesting. This comparison has less to do with Multi-Task Learning but is still performed to contribute knowledge to the subject of recurrent neural networks. It is also relevant in general for use cases where models are trained to be used on device as using the GRU type over LSTM produces much smaller models in terms of bytes stored.

(19)

performance.

The area of Multi-Task Learning architecture was focused on the different ways a model can be constructed to enable them to utilize and share information. There are several kinds of Multi-Task Learning architectures tested today, but it is not completely clear when, how and why they are useful. To investigate this, four ar-chitectures were evaluated. The first being a Single-Task model, which shared no tasks and was evaluated to act as a benchmark for the other Multi-Task models. The second being a fully shared Multi-Task network where all data and weights are shared and only the output layer was separated. The third being a Cross-Stitched Model, where data is shared but no weights are shared. The fourth being the Shared-Private Model, which utilizes both shared weights and data but also keep private layers where none is shared in parallel. These are described in detail in Section 5.4. Now, there is endless possibilities to how one could design a model and the structures chosen are also of great importance for the final performance of each model. This is also a great complication when researching Machine Learning as there is often to little time to explore every possibility. To tackle this we chose to keep the structures simple and each one based on a proposed architecture found in the research phase. There was some experimentation with number of layers and it was found that in general around 2-3 layers worked best in this case. The reason the designs were kept simple and quite similar to each other in dimensions is to ensure that the differences in performance recorder were indeed due to architectural differences and not the complexity of the model.

5.2 Metrics

First of all, the metrics had to be defined. Each evaluation measured two metrics; the exact accuracy and the ranged accuracy. The reason behind the ranged accuracy is that for a supervised suggestion system like the one in this study, the model could add a lot of value even if its first guess is incorrect, if its second or third guess is correct, as it can be displayed and accessed on the screen next to the first guess.

The exact accuracy of the model measurement was used in training as well as

eval-uation. It produced three measurements per model, two being the times the model guessed each label correctly and one being the times it guessed both of the labels correctly at the same time.

The ranged accuracy of the model is similar to the exact accuracy, but instead of

only considering the models best guess it considers the models best guesses within range. For example, for the range three the model will be considered to have guessed correctly if the correct label is within the three labels it is given the most probability. Each of these metrics were applied to the answers of the models both when eval-uating the individual accuracy, where the accuracy is calculated per label, and the

simultaneous accuracy, where the model is only considered to be correct if it predicts

(20)

5.3 Preparing data

The data at hand in this study compromised 4503 issues filed by train employees when on duty addressing different issues such as leakages etc. Each issue had been tagged with either one of or both of the categories, symptoms, and objects. For example, the issue shown in Table 2 contains the description, which is the input the models will base their predictions on, and four labels. The symptom code and the object code are the labels the models will try to predict, while predicting the symptom group and the object group are the tasks we will use as auxiliary tasks. As these groups are groups that the codes are a part of, they should hold a high information correlation with the codes.

Table 2 Example of an issue and its labels

Description Symptom Group Symptom Code Object Group Object Code Resenär som spillt kaffe, stolsen-het måste rengöras. Säte: V2 PL67

C-VISU VS09 844-INF 844E

The description in Table 2 describes in Swedish that a passenger has spilled some cof-fee on a chair which needs to be cleaned. The identification of the chair is also added. To this the symptom code ”Food Spillage”(VS09) and symptom group ”Visual”(C-VISU) has been attached along with the object code ”Chair”(844E) which is, in-terestingly, a part of the group ”Infotainment”(844-INF). Since these two tasks, predicting the symptom and the object codes, are at some degree correlated they should serve as a good task for Multi-Task Learning.

However, as described in the background of this study, the data was severely im-balanced as it contained many labels which were only used once or twice and a few that is used more than a thousand times. The data set was also far too small to apply any of the more straight forward data balancing techniques such as deleting random issues from the popular labels or removing labels with too few issues. All issues that could be spared was needed. To add to the problems, there were a lot of missing labels which also needed to be dealt with.

5.3.1 Balancing data

(21)

evened out a bit as illustrated in Figure 2. The method had the side effect of some oversampling, which increased the issue volume to 6231 issues split into a training data set of 5307 and a validation data set of 924. It was ensured that no issue occurred in both the training and validation data set.

Figure 2: Label distribution in validation and training data sets, displayed by plot-ting the labels on the x axis and their respective number of occurrences on the y axis.

However it is obvious that the data still is hugely imbalanced, especially in the training portion of the data. Since no issues could be removed class weights were instead utilized to fight overfitting. Class weighting is when you weight the loss function for each label in proportion to how many times it occurs in the training data. This means that when adjusting to an issue labeled with a popular label the error calculated will be much smaller than when adjusting to an issue labeled with a rare error. The equation used to calculate each class weight is displayed in 5.1 where the class weight is denoted w, the number of issues in the data set labeled with this class denoted n and the total number of issues denoted N.

w = (1/n)∗ (N/2) (5.1) It should be mentioned that the weights were calculated using only the training data and not the validation data.

5.3.2 Handling Missing Labels

(22)

themselves can distort the knowledge of the model and prohibit learning as issues falling under the missing label probably have little correlation.

5.3.3 Preprocessing

Preprocessing is an important step when implementing natural language processing. Data as text written by users must be turned into data which a neural network can learn from. Neural Networks are basically a sequence of matrix multiplications, so the input it takes must be numerical vectors. In this case, each word is represented as a numerical vector, which turns the sentences into a two dimensional matrix. This is called an embedding. It was produced by feeding a dense vector to an Embedding layer which maps each unique number to its vector and returns its corresponding two-dimensional matrix.

Before being fed to the Embedding layer, to produce the dense vector, the text was converted through a three-step pre-processing pipeline which consisted of clean-ing, processclean-ing, and tokenization. They are each described in detail in the following Sections.

This process was kept exactly the same for all evaluated models, to make sure the models performance is tested and not the pre processing method.

5.3.4 Cleaning

To reduce the noise of the data, the text was first cleaned by removing information with low relevance such as punctuation, special characters, capitalization, and digits. While such information does carry meaning, it is too general and ambiguous to actually add value for most neural networks. Removing it also makes it easier for the network to recognize that a word followed by a special character has the same meaning as the same word without the special character. As an example, the description from the actual training data from earlier in Figure 2 is displayed below in italic. The data is written in Swedish and describes a technical issue with a train.

Resenär som spillt kaffe, stolsenhet måste rengöras. Säte: V2 PL67

When considering the raw data, we see that the word ”Säte” is followed by a colon, a special character. Even though special characters do carry meaning, our network has more to gain from associating this noun to other texts where the word appears without it. Therefore, special characters were removed. This also applies to the fact that the network should not distinguish words and their capitalized version from each other. Digits are also too ambiguous and were removed to finally produce the cleaned text below.

(23)

5.3.5 Tokenization

After the data was cleaned, it was converted into numerical vectors through a pro-cess called tokenization. This propro-cess is simply to replace the words in each text with a unique index which the embedding layer can use to look up the words unique vector later on. In order to perform this a vocabulary was built from the training data. All training data was cleaned and then each unique word was extracted and given a unique index in a vocabulary. However, the words were only kept in the vocabulary if it appeared more than five times in the training data, to avoid filling the vocabulary with very rare words.

This vocabulary was then used to encode each text sequence before feeding it to the Embedding layer. The tokenization was performed by first splitting the string into a list of words.

[’resenär’, ’som’, ’spillt’, ’kaffe’, ’stolsenhet’, ’måste’, ’rengöras’, ’säte’, ’v’,’pl’]

Then each word that was not in the vocabulary was replaced with the tag for un-known words; <UNK> as they were not present in the embedding layer and there-fore cannot be interpreted. Replacing them with a tag made sure the structure of the sentence was not too malformed if it contains several unknown words. By the same argument as removing special characters in the cleaning process, very common words were also replaced by a tag as their meaning were too ambiguous and created noise. These kind of words are called stop words and hence were replaced by the tag <SW>. The stop words that were removed in this study consisted of the 20 most common words in the Swedish language. Below is the text after adding tags.

[’resenär’, ’<SW>’, ’spillt’, ’kaffe’, ’stolsenhet’, ’måste’, ’rengöras’, ’säte’, ’<UNK>’,’<UNK>’]

The special tags were already in the vocabulary, added in the same way as all other words by giving them a unique index, in our case 1 for unknown words and 2 for stop words. Afterwards, the text was converted to a numerical vector by replacing each word with its corresponding unique number:

[ 87, 2, 118, 163, 164, 23, 154, 212, 1, 1]

This is what is called the dense vector and it is fed into the model via the embedding layer.

5.3.6 Padding

(24)

excluding the longest 5% of the data, in this case 100. Data longer than this max value was simply truncated and data shorter was padded from the start of the sequence to full length giving us our final dense vector.

[0,..., 0, 1, 87, 1, 118, 163, 1, 2, 164, 2, 118, 1, 2, 1, 87, 1, 1]

We padded at the start of the sequence simply due to default behavior of our padding function. The padding zeros were later masked in the Embedding layer which reduces their significance.

5.3.7 Word Embedding

After pre-processing we fed the data to the Embedding layer. The Embedding layer is, simply put, a lookup table, which maps each given unique number to a vector. It outputs the whole sequence as a matrix for the neural layers to interpret. It is trained along the neural layers to act as an interface between pre-processing and the neural layers.

5.4 Designing Models

Multi-Task Learning can be achieved in several ways. In this study, four different alternatives of sharing information between tasks were investigated which use differ-ent styles of sharing tasks. This Section describes the method of constructing and designing four models, each using a unique way to share tasks.

5.4.1 Model Architectures

The model architectures were designed to use task sharing in different styles, ranging from no task sharing at all in the Single-Task Model to a fully shared model such as the Multi-Task Model. between we have the Cross-Stitched Model which only share information and the Shared-Private Model which has both shared and private layers. All models were lined with batch normalization layers to reduce overfitting, a big problem when working with this little data.

5.4.2 Single-Task Model

(25)

were only different in the type of layers they use.

Figure 3: Single-Task Models using different Layer Types

5.4.3 Multi-Task Model

The Multi-Task model shared all of its layers for both of the tasks, except from the last layer which were kept task-specific. This means both data and weights are shared when solving both of the tasks. The purpose of evaluating this is to see if the argument that this allows for better generalization and a more knowledgeable shared representation holds. This fully shared model is at a higher risk of contamination between the tasks, but has a good chance at benefiting from each others knowledge and to generalize.

The two models which went through the Random Search hyper-parameter tuning and which were then evaluated against each other are displayed in Figure 4, and were only different in the type of layers they use.

(26)

5.4.4 Cross-Stitched Model

The Cross-Stitched Model does not share any weights between the task specific lay-ers. However, by concatenating the output of all task-specific layers and using it as input for the next the model constantly shares data between tasks. This way the models benefits from each others layers while keeping their weights private. This ap-proach comes from an article described in Section 4.3.3, but this study uses a much simpler approach where the activation is simply concatenated and not weighted in-dividually.

The two models which went through the Random Search hyper-parameter tuning and which were then evaluated against each other are displayed in Figure 5, and were only different in the type of layers they use.

Figure 5: Cross-Stitched Models using different layer types

5.4.5 Shared-Private Model

The Shared-Private Model has private layers for both tasks, but utilizes a third shared layer in order to share information. This approach comes from an article described in Section 4.3.3, and is meant to keep contamination out of the shared layers. However, it does not utilize adversarial training.

(27)

Figure 6: Shared-Private Models using different layer types

5.4.6 Hyper Parameter Tuning

The models’ hyper parameters were tuned individually in order to properly com-pare them. Each model architecture was tuned using Random Search, an algorithm which does not cover all possibilities of combinations of parameters. The process was executed once using the LSTM layer type and once using the GRU layer type to enable evaluation of which kind of layer performs best on this task, and contribute further knowledge to the question in the field of if LSTM still stands as the better type.

The first hyper tuning process was performed using the Random Search algorithm. The Random Search algorithm was chosen for this step as it was preferred not to sacrifice the number of different parameters to test for the accuracy of the Grid Search.

Each model went through the exact same tuning process, tuning five different param-eter types: Dropout Rate, Embedding Size, Activation Function, Units per Layer and Learning Rate. The approach for this process was to do as many tests as possi-ble, and it was obtained by shortening the numbers of epochs trained and increasing the batch size.

Table 3 Parameter Space of the Hyper Parameter Tuning Process. Hyper Parameter Alternatives

Dropout 0.1, 0.3, 0.5 Embedding Size 32, 64, 128 Activation Function tanh, relu Units per Layer 16, 32, 64, 128 Learning Rate 0.01, 0.001

(28)

This ought to give a fair enough view of which parameters are good enough for our further testing.

Testing Algorithm

For each test a new set of parameters was generated randomly from the parameter-space described in Table 3. Before proceeding, the list of previous parameter-sets were checked and the set was skipped if it had been tested before. If not, all models were tested and measured using the parameter set.

This was repeated 75 times using the exact same training and testing data for all models and tests. Each training lasted for 10 epochs and used a batch size of 64. Out of the sets that produced the highest accuracy scores the parameters which were most common was chosen for the final model.

5.4.7 Layer Types

Finally, to finish the design of the models, each layer type was to be evaluated. At this point there were two versions of each model type, each tuned individually, using one of the two layer types. These were tested for their simultaneous accuracy against each other to evaluate which layer type performed best and which layer type the final version of the model should use.

Testing Algorithm

To evaluate them fairly, each model was trained and tested on the same train-ing and test data for 30 epochs. Then their simultaneous accuracy was measured and compared for the ranges 1 to 5 for each model. Their confusion matrices were also plotted in order to compare and in order to detect any obvious overfitting. The confusion matrices showed for each label how many times the model had guessed it out of the times it had been predicted.

5.4.8 Boosting Performance using Auxiliary Tasks

(29)

Data Re-sampling

In order to use the group labels as auxiliary tasks the data was slightly re-sampled. The data set which was sampled from remained the same. The motivation behind the re-sampling process is that when training on auxiliary tasks it is. for learning purposes, beneficial to have a good distribution among the auxiliary tasks as well, and to make sure there are no labels that are missing from either the validation data set or the training data set. Therefore, the same sampling technique described in Section 5.3.1 was used on both the main labels and the auxiliary tasks labels. This produced a data set with the distribution of main labels displayed in Figure 7 and auxiliary labels displayed in Figure 8. This also increased the data set size to 10186 labeled issues due to oversampling, split into 8790 training issues and 1396 validation issues. It was validated that no issue appeared in both the training data and the validation data.

(30)

Figure 8: Distribution of auxiliary task labels in the data set used for evaluating auxiliary tasks. Labels on the x axis and number of occurrences on the y axis.

Architectures

(31)

Figure 9: All architectures with auxiliary tasks applied which share the whole model

(32)

Figure 10: All architectures with auxiliary tasks applied which share half of the model

Loss Weights

Choosing the loss weights is an important step when training a model using auxil-iary tasks. The loss weights are, as their name implies, the weight put on the loss of each of the models outputs. Thereby adjusting how much the model adapts to each outputs loss function. This is usually described as how important each output is considered. The loss weights used in this study for the auxiliary tasks were 0, 0.3, 0.5, 0.7 and 1 while the loss weights for the main tasks were always 1. The model using 0 as a loss weight considers the auxiliary outputs not important at all which makes them equal to a model not using auxiliary outputs and its performance acts as the base line the other models are compared to. The models using 0.3, 0.5, 0.7 and 1 consider the auxiliary outputs increasingly more important, where the models using 1 as loss weights for its auxiliary outputs considers them just as important as its main tasks.

Testing Algorithm

(33)

and evaluated on both the simultaneous accuracy and the task specific accuracy. However they were evaluated only on their best guess so no ranged accuracy was used.

5.5 Multi-Task Architecture Evaluation

When evaluating the different kinds of Multi-Task architectures the models were trained using each of their optimal designs derived from the hyper tuning process and the layer type evaluation. They were all trained using the same data, the same batch size and for the same number of epochs. To ensure the quality of the evalua-tion is not decreased due to fluctuaevalua-tion in performance between model training, five models of each architecture was trained and their performance averaged to produce the result which was compared.

Since one of the aims of this study is to train models for suggestion services it is still valuable for the user even if the models second or third guess is correct as long as there is room on the screen for these guesses. Therefore it was investigated how the accuracy increases the more of the models top guesses you take into account. The most beneficial range was identified, the smallest one where the accuracy still increases rapidly for each increment.

(34)

6 Results

The study produced a large evaluation of the different techniques of Multi-Task Learning and their efficiency, as well as a smaller evaluation of the layer types LSTM and GRU. In this chapter all results are presented from each step of the evaluation.

6.1 Hyper Parameter Tuning

From each of the tuning processes the final parameters were chosen from the top 20% of the sets which produced the highest accuracy. In this case; the best 15 sets. If parameters are tied, the ones reaching highest average accuracy was chosen. This approach is more forgiving for low results as it is ignored by choosing only the top results. This can be both good and bad. it is good that it ignores some disaster results where the loss function has gone to Nan, which might not be this parameters fault, but it can be bad that this might bias a set with a large spread of results rather than a set which produces consistent but lower results.

In general, activation function and learning rate had clear winners and clear losers where the parameters that won often held around 87% of the top results, which indicates that there is a big difference in between the parameters and that they also carry great significance when it comes to the performance of the model. There was much more ambiguity when it came to units per layer and embedding size where the models sometimes were tied, their results should therefore not be taken as gospel for which parameter is the best.

While there was less difference between number of units in layers and embedding, the dropout rate was fairly conclusive and leaned towards the higher options.

6.1.1 Single-Task Model

(35)

Figure 11: Hyper parameter tuning result for the Single-Task Model.(See enlarged version in Appendix B.1.1)

Parameter evaluation

The best parameters for the Single-Task Model displayed in Table 4 shows that there are some minor differences to which parameters suit which layer the best. The model using the GRU layer opted for a unit count of 32 per layer, much lower than the model using the LSTM layer which performed best using 128 units, the largest number of units per layer. They also differed slightly in dropout rate, embedding size and learning rate though they both performed best using tanh as an activation function.

Table 4 Top Scored Parameter alternatives for the Single-Task Model

Parameter Best alternative (GRU) Best alternative (LSTM)

Units per Layer 64 (40%) 64 (33%) Dropout Rate 0.3 (47%) 0.5 (40%) Embedding Size 64 (40%) 128 (40%) Activation Function tanh (73%) tanh (67%)

Learning Rate 0.001 (87%) 0.01 (67%)

Individual Layer Evaluation

(36)

Figure 12: Confusion Matrices displaying the performance of Single-Task Models using GRU and LSTM layer types(See enlarged version in Appendix B.1.1)

It is quite clear when measuring the simultaneous accuracy of different ranges as in Table 5 that the model using the GRU layers perform better than the one using the LSTM layers by around 5 percent in all ranges.

Table 5 Evaluation of accuracy at different ranges for Single-Task Model using

GRU and LSTM layers

Range Accuracy using GRU Accuracy using LSTM Difference

1 24.56% 19.58% 4.98%

2 29.43% 23.91% 5.52%

3 32.56% 27.27% 5.30%

4 34.30% 29.32% 4.98%

5 36.14% 31.92% 4.22%

Final Model Design

(37)

the hyper parameter process for the Single-Task Model and leaves us with the final design displayed in detail in figure 13.

Figure 13: Final design of the Single-Task Model

6.1.2 Multi-Task Model

In figure 14 the results of all tests on the Multi-Task Model are displayed per pa-rameter. It performs better using a large number of units per layer as the runs are more concentrated on the higher percentile the more units it has. The difference between activation function alternatives is quite clear, and although it performs well on a dropout rate of both 0.3 and 0.5 the last alternative has fewer bad runs. Using the metric of the best 20% rules in favor for 0.5 as a dropout rate even though the best run used 0.3.

Figure 14: Hyper parameter tuning result for the Multi-Task Model using GRU and LSTM layers(See enlarged version in Appendix .1.2).

(38)

There were some similarities between the model using the LSTM layers and the model using the GRU layers in terms of which parameter they performed best on. In Table 6 it is apparent that both models perform best on large layers with a unit count of 128. However, the model using the LSTM layers performs better using a slightly smaller embedding size of 64 and a higher learning rate of 0.01.

Table 6 Top Scored Parameters for Multi-Task Model

Parameter Best alternative (GRU) Best alternative (LSTM)

Units per Layer 128 (40%) 128 (40%)

Dropout Rate 0.5 (47%) 0.3 (40%)

Embedding Size 128 (40%) 64 (40%)

Activation Function tanh (66%) tanh (73%) Learning Rate 0.001 (60%) 0.01 (53%)

Individual Layer Evaluation

(39)

Figure 15: Confusion Matrices displaying the performance of Multi-Task Models using GRU and LSTM layer types(See enlarged version in Appendix .1.2)

This is confirmed when looking at the ranged accuracies of the evaluation shown in Table 7. The model using GRU layers perform significantly better than the one using LSTM layers by around 5 percent in each range.

Table 7 Evaluation of accuracy at different ranges for Multi-Task Model using GRU

and LSTM layers

Range Accuracy using GRU Accuracy using LSTM Difference

1 24.17% 22.44% 1.73%

2 33.66% 29.54% 4.12%

3 38.45% 33.42% 6.60%

4 40.59% 35.64% 6.71%

5 43.32% 38.45% 4.12%

(40)

Figure 16: Final design of the Multi-Task Model

6.1.3 Cross-Stitched Model

The results of hyper tuning the Cross-Stitched Model is displayed in figure 17. The difference between the two layers are not too great. In some cases, however, the model using LSTM layers produced some 0 results where the loss function produced NaN values, probably due to vanishing/disappearing gradient errors. This only happened to the model using GRU layers once.

Figure 17: Hyper parameter tuning result for the Cross-Stitched Model(See en-larged version in Appendix .1.3).

Parameter evaluation

(41)

the other models of utilizing tanh as activation function, a low learning rate and a dropout rate on the higher end.

Table 8 Top Scored Parameters for Cross-Stitched Model

Parameter Best alternative (GRU) Best alternative (LSTM)

Units per Layer 32 (33%) 128 (33%)

Dropout Rate 0.3 (47%) 0.5 (50%)

Embedding Size 128 (60%) 128 (47%)

Activation Function tanh (60%) tanh (87%) Learning Rate 0.001 (87%) 0.001 (53%)

Individual Layer Evaluation

When observing the confusion matrices of the models in figure 18 there are some slight signs of over-fitting in the model using LSTM layers. The diagonal of the model using LSTM layers in the symptoms matrix is slightly more defined, but otherwise the matrices are quite similar.

(42)

The results from the layer evaluation displayed in Table 9 shows a strong advantage of the model using the LSTM layers, with a difference that grows from around 3 percent to as high as almost 11 percent when the range is increased to 5. This was the only model in which the one using the LSTM layers outperformed the one using GRU layers.

Table 9 Evaluation of accuracy at different ranges for Cross-Stitched Model using

GRU and LSTM layers

Range Accuracy using GRU Accuracy using LSTM Difference

1 24.56% 27.05% 2.49%

2 27.59% 34.74% 7.15%

3 31.16% 39.39% 8.23%

4 33.11% 42.85% 9.74%

5 33.76% 44.69% 10.93%

Final Model Design

Finally, it is concluded that the model using the LSTM layers is in this case supe-rior to the model using GRU layers, as it consistently outperforms it on all ranges. Therefore it is chosen for the final design of the model displayed in figure 19.

(43)

6.1.4 Shared-Private Model

The results from the hyper parameter tuning of the Shared-Private Model are plotted in figure 20 and show similarities between the layers foremost in learning rate and the pattern of units per layers where both models lean towards a low learning rate and a high number of units per layer.

Figure 20: Hyper parameter tuning result for the Shared-Private Model(See en-larged version in Appendix .1.4).

Parameter evaluation

The best parameters for the Shared-Private Model displayed in Table 10 were mostly similar to the other models. However it had the anomaly of performing better using relu as it is activation function as it produced 53% of it is best results. This goes against the trend of the other models which performed much better using tanh, all though it was a very close call in between the two activation function where relu only occurred in 3% more of the top cases.

Table 10 Top Scored Parameters for Shared-Private Model

Parameter Best alternative (GRU) Best alternative (LSTM)

Units per Layer 64 (47%) 128 (40%)

Dropout Rate 0.3 (40%) 0.5 (33%)

Embedding Size 128 (33%) 128 (40%)

Activation Function relu (53%) tanh (53%) Learning Rate 0.001 (87%) 0.001 (93%)

Individual Layer Evaluation

(44)

Figure 21: Confusion Matrices displaying the performance of Shared-Private Mod-els using GRU and LSTM layer types(See enlarged version in Appendix .1.4).

This is explained when we see that in the first few ranges of accuracy for the models there is only about a 2 percent difference in performance, and the model using GRU layers is the one outperforming the model using LSTM layers as displayed in Table 11. The difference increases when the range is larger to almost 5 percent for the range of 5.

Table 11 Evaluation of accuracy at different ranges for Shared-Private Model using

GRU and LSTM layers

Range Accuracy using GRU Accuracy using LSTM Difference

1 21.96% 19.26% 2.7%

2 26.73% 24.13% 2.6%

3 29.76% 27.59% 2.17%

4 32.35% 28.78% 3.57%

5 34.74% 29.76% 4.98%

(45)

All though the difference in performance is lower than in the other model archi-tectures, the model using GRU layers does still outperform the one using GRU layers. This leads to the final design of the model which is displayed in figure

Figure 22: Final Model Design of the Shared-Private Model

6.2 Layer Type Evaluation

One aim of this study was to further investigate which layer type performs best of the two most popular; GRU and LSTM. Each model was tuned to the layers, and their differences in accuracy are averaged over the range of 5 in Table 12.

Table 12 Average accuracy over a range of 5 for each model type with the best

performed accuracy highlighted in green.

Model Average accuracy (GRU) Average accuracy (LSTM) Diff.

Single-Task 31.40% 26.40% 5.00%

Multi-Task 33.94% 28.00% 5.94%

Cross-Stitched 30.09% 37.74% 7.65%

(46)

The results reveal that it is not unanimous which layer is the best. The models using the GRU layers outperforms the models using LSTM layers in three of the model types by around 3-5 percent on average. However, the Cross-Stitched model using LSTM layers outperforms the Cross-Stitched model using GRU layers by as much as 7.65 percent on average, and scores highest among all the model types. In addition to that some of the models using GRU layers have shown minor signs of overfitting. Overall however, GRU seems to perform better in general in most of the cases. For further evaluations in this study all models used the type of layer which suited them the best.

6.3 Auxiliary Tasks Evaluation

The results of the evaluation of the auxiliary tasks were all condensed into tables where the simultaneous accuracy was measured for each model and each loss weight of the auxiliary output. This was performed two times, one time using models with auxiliary tasks outputs placed at the end of model which allows it to share the whole model with the main tasks, and once where the outputs were placed on the middle of the model, only sharing the first half.

It is important to note that for this evaluation, the data set was re-sampled to include the auxiliary labels in both training and validation sets. The method re-sults in some oversampling from duplicating already existing issues but also in an increase of ”None” labeled issues hitch-hiking on the auxiliary labels. This reflects in slightly different results when evaluating compared to the layer evaluation and the architecture evaluation which follows in the next section.

6.3.1 Auxiliary Tasks Sharing the Whole Model

(47)

Table 13 Simultaneous accuracy per model and auxiliary task weight(w) using

auxiliary outputs sharing the complete model.

Model type w = 0 w = 0.3 w = 0.5 w = 0.7 w = 1

Single-Task Model 28.72% 27.51% 27.15% 28.87% 27.58% Multi-Task Model 26.72% 29.08% 26.07% 26.79% 27.01% Cross-Stitched Model 28.65% 27.93% 29.36% 29.15% 29.01% Shared-Private Model 25.50% 27.29% 27.65% 27.14% 25.93%

When considering the task-specific accuracy it is observed that similarly to the re-sults in Table 14, the differences are small. Compared to the simultaneous accuracy the symptom accuracy share the same pattern of improving the Shared Private Model and, when using larger weights, the Cross-Stitched Model. The object accu-racy, however, display almost the opposite effect where it has instead improved the Single-Task Model.

Table 14 Task specific accuracy per model and auxiliary task weight(w) using

auxiliary outputs sharing the whole model.

Symptom Accuracy Model type w = 0 w = 0.3 w = 0.5 w = 0.7 w = 1 Single-Task Model 38.83% 42.34 39.83% 43.05% 40.69% Multi-Task Model 39.18% 42.98 38.40% 40.04% 36.46% Cross-Stitched Model 43.12% 41.83 44.13% 42.69% 43.62% Shared-Private Model 37.60% 41.12 38.54% 40.40% 37.46% Object Accuracy Model type w = 0 w = 0.3 w = 0.5 w = 0.7 w = 1 Single-Task Model 38.87% 38.18 38.11% 38.90% 39.76% Multi-Task Model 39.76% 39.90 41.48% 41.26% 41.12% Cross-Stitched Model 41.62% 40.40 41.19% 42.26% 42.26% Shared-Private Model 38.83% 39.25 40.83% 39.47% 39.68%

6.3.2 Auxiliary Tasks Sharing Half of the Model

(48)

layers. Yet again the models using auxiliary tasks scored slightly higher than the one not using auxiliary tasks, but by so little it is hardly significant. However, also similarly to the first results, the Shared-Private Model did see a relatively larger positive increase in performance when applying the auxiliary tasks, this time by 1.29-3.08 percent.

Table 15 Simultaneous accuracy per model and auxiliary task weight(w) using

auxiliary outputs sharing the first half of the models.

Model type w = 0 w = 0.3 w = 0.5 w = 0.7 w = 1

Single-Task Model 28.72% 29.29% 27.93% 27.87% 28.44% Multi-Task Model 26.72% 27.51% 25.78% 27.36% 27.65% Cross-Stitched Model 28.65% 28.08% 29.72% 29.87% 28.87% Shared-Private Model 25.50% 27.36% 26.79% 26.86% 28.58%

When considering the task-specific accuracies displayed in Table 16 it is apparent that the differences are still small but a bit more pronounced than when it comes to the simultaneous accuracy. While small, all models saw an increase in performance when using auxiliary tasks with a full weight, even though the Single-Task Model had a decreased simultaneous accuracy for the same model.

Table 16 Task specific accuracy per model and auxiliary task weight(w) using

auxiliary outputs sharing the first half of the models.

(49)

6.4 Multi-Task Architectures Evaluation

Finally, the architectures were evaluated. This has of course been ongoing through-out the study, but this time all metrics were more closely considered. Since from the start one of the aims of this study is to build a model which can handle giving suggestions, which means it would be interesting to see how many suggestions one would allow to maximize the chance for the correct answer to be displayed. To in-vestigate this, the ranged accuracies has been plotted for all models in figure 23, the ranged accuracy being the allowed range in which the model is considered correct if it is best guesses contain the correct answer.

Figure 23: Task-specific ranged accuracies of alll architectures.

The accuracy of all the models increase drastically up to range 5, and then keeps increases at a lower rate but still significantly until range 10. After 10 it keeps increasing, but at a much lower rate.

6.4.1 Ranged Accuracy

(50)

Multi-Task Model scores less than the Single-Task Model on exact accuracy, range 1, but after that surpasses the Single-Task Model.

Table 17 Simultaneous accuracy for the different architectures, shaded using the

Single-Task model as a baseline.

Simultaneous Accuracy

Range Single-Task Multi-Task Cross-Stitched Shared-Private

1 25.58% 24.07% 27.57% 23.14% 2 30.78% 30.67% 34.36% 28.68% 3 33.70% 34.31% 38.28% 31.06% 4 35.76% 36.71% 40.64% 34.22% 5 37.38% 38.81% 42.37% 36.17% 6 38.35% 40.43% 44.40% 37.38% 7 39.09% 42.29% 45.86% 38.96% 8 39.89% 43.46% 46.97% 40.24% 9 40.63% 44.81% 48.27% 41.41% 10 41.47% 45.95% 49.19% 42.47%

(51)

Table 18 Task specific accuracy(symptom) for the different architectures, shaded

using the Single-Task model as a baseline.

Symptom Accuracy

Range Single-Task Multi-Task Cross-Stitched Shared-Private

1 39.31% 36.43% 41.61% 33.79% 2 46.88% 46.15% 50.95% 42.40% 3 50.82% 51.67% 56.76% 47.06% 4 54.13% 55.04% 61.01% 51.54% 5 56.45% 57.97% 63.61% 55.09% 6 57.90% 60.76% 66.18% 57.45% 7 59.11% 63.03% 67.51% 59.89% 8 60.48% 65.04% 68.97% 62.16% 9 61.58% 66.73% 70.29% 64.16% 10 62.73% 68.23% 71.43% 65.32%

Table 19 Task specific accuracy(object) for the different architectures, shaded using

the Single-Task model as a baseline.

Object Accuracy

Range Single-Task Multi-Task Cross-Stitched Shared-Private

1 36.26% 38.31% 41.48% 37.38% 2 42.42% 44.57% 47.89% 42.66% 3 45.26% 47.40% 51.03% 45.26% 4 47.40% 49.65% 53.22% 47.90% 5 49.13% 51.28% 55.01% 49.44% 6 50.26% 52.73% 56.85% 50.39% 7 51.34% 54.42% 58.36% 51.58% 8 52.47% 55.19% 59.23% 52.60% 9 53.20% 56.30% 59.98% 53.38% 10 54.29% 57.38% 60.69% 54.16%

(52)

6.4.2 Confusion Matrices

Finally, the exact performance of the models was plotted in confusion matrices. Since we plot a large label-space with few samples of many labels, they are hard to differentiate to each other but a good way of checking for signs of overfitting. In figure 24 the confusion matrices for the symptoms task is displayed. There are little signs of overfitting in any of the matrices.

Figure 24: Symptom Confusion Matrices for all Model architectures(See enlarged versions in Appendix .2).

(53)

Figure 25: Object Confusion Matrices for all Model architectures(See enlarged versions in Appendix .2).

(54)

Figure 26: Average simultaneous accuracy up to a range of 50 of all models.

It is evident that while the Single Task model keeps up with and even beats some of the Multi-Task architectures on the smaller ranges, it starts to diverge as early as around range 3. The gap then becomes larger as the range increases and after range 7 all Multi-Task architectures perform better than the Single-Task Model. When looking closer at figure 27 where the task specific accuracies have been plotted against each other this is particularly evident in the symptom task where the Multi-Task models are better than the Single-Task Model by more than in the object task. Even in the object task however the Multi-Task architecture models in the higher ranges outperform the Single-Task Model and continues to do so as the range increases. it is also observed that the Multi-Task Models outperform the Single-Task Model by more on the symptom task rather than the object task even though in the short ranges they seem to perform more consistently better on the object task.

(55)

7 Discussion

The study evaluated three different architectures that used some level of task sharing and benchmarked them against a fourth model that did not share tasks at all. These architectures were chosen to be simple and use different kinds of task sharing. The intention when constructing these models was to keep them simple and to have them each use a distinct kind of task-sharing to measure the techniques against each other. For example, the models only used one kind of one-directional layers throughout each architecture, something that might not be ideal if one is trying to find the best performance possible. This simplicity of the architectures could mean their potential performance is not justly displayed. However, this is beneficial in this study as it aims not to find the very best performance a model can have in this case, but to compare a chosen set of architectures. Each of these models can probably be improved by tinkering on them further, but such tinkering opens up so many possibilities of combinations and parameters to consider that there would be no time to complete the study.

7.1 Data

Insufficient data is often a problem in machine learning studies and this study is no exception. Due to the COVID-19 outbreak of 2020 and its subsequent consequences within businesses being put in crisis, the data in this study was limited to just a small portion of the data that was expected to be utilized. To prevent this a few techniques such as class-weighting and selective splitting of training and testing data were implemented.

Class-weighting was a very successful and useful technique in this case as the data set suffered from major imbalance. The beauty in this technique is that the data did not need to be limited by removing entries, something that would have decreased the already very small data size, but also did not sacrifice the integrity of the studies result by allowing major overfitting.

Selectively splitting the data set into training and testing data by ensuring that at least one entry of each label existed in both sets was also a useful method, as the result is more representative of the models actual potential than when labels are missing from either training or data sets.

References

Related documents

In order to answer the research questions, the researchers reviewed extensive amount of literature in four categories. The first category is about the concept of conflict

However, the results of the present study show a significant difference between the Exhaustion block and Color-naming block as well as between the Neutral and Color-naming

Grounded in research both on formative assessment but also on motivation in connection to foreign language learning, it is hypothesised that sharing the

För att besvara dessa frågor har vi valt ut 20 vetenskapliga artiklar inom området. Dessa har valts ut genom databassökning och manuell sökning. Artiklarna valdes ut efter

The joint training approach was performed with multiple values of l landmarks 2 { 1, 3, 5, 10 } while l segmentation = 1 to investigate the influence of the magnitude of the

What problem instance properties require the generation of multiple diverse representations over generating a single multi-dimensional representation.. Usual representation

The illustration of targeting legit websites with phishing can be seen in Figure 2.2 where one can see a visualization of the real Backmann.com. In Figure 2.3 we see the

Dotterbolaget Road &amp; Logistics erbjuder kombinerade transporter bestående av järnväg och lastbil. Järnvägen utnyttjas för de långväga transporterna och lastbilarna