IN THE FIELD OF TECHNOLOGY DEGREE PROJECT
MEDIA TECHNOLOGY
AND THE MAIN FIELD OF STUDY
COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2018 ,
An Investigation of Low-Rank Decomposition for Increasing Inference Speed in Deep Neural Networks With Limited Training Data
VICTOR WIKÉN
An Investigation of Low-Rank Decomposition for Increasing Inference Speed in Deep Neural Networks
With Limited Training Data
Victor Wik´en
Supervisor: Stefano Markidis Examiner: Erwin Laure
September 22, 2018
Abstract
In this study, to increase inference speed of convolutional neural networks, the optimization technique low-rank tensor decomposition has been implemented and applied to AlexNet which had been trained to classify dog breeds. Due to a small training set, transfer learning was used in order to be able to classify dog breeds. The purpose of the study is to investigate how effective low-rank tensor decomposition is when the training set is limited. The results obtained from this study, compared to a previous study, indicate that there is a strong relationship between the effects of the tensor decomposition and how much available training data exists. A significant speed up can be obtained in the different convolutional layers using tensor decomposition.
However, since there is a need to retrain the network after the decomposition and due to the
limited dataset there is a slight decrease in accuracy.
Sammanfattning
F ¨or att ¨oka inferenshastigheten hos faltningssn¨atverk, har i denna studie optimeringstekniken
low-rank tensor decomposition implementerats och applicerats på AlexNet, som har tr¨anats f ¨or
att klassificera hundraser. På grund av en begr¨ansad m¨angd tr¨aningsdata anv¨andes transfer
learning f ¨or uppgiften. Syftet med studien ¨ar att unders ¨oka hur effektiv low-rank tensor de-
composition ¨ar n¨ar tr¨aningsdatan ¨ar begr¨ansad. J¨amf ¨ort med resultaten från en tidigare studie
visar resultaten från denna studie att det finns ett starkt samband mellan effekterna av low-
rank tensor decomposition och hur mycket tillg¨anglig tr¨aningsdata som finns. En signifikant
hastighets ¨okning kan uppnås i de olika faltningslagren med hj¨alp av low-rank tensor decompo-
sition. Eftersom det finns ett behov av att tr¨ana om n¨atverket efter dekompositionen och på
grund av den begr¨ansade m¨angden data så uppnås hastighets ¨okningen dock på bekostnad av
en viss minskning i precisionen f ¨or modellen.
Contents
1 Introduction 9
1.1 Motivation . . . . 9
1.1.1 Problem domain . . . . 9
1.2 Research question . . . . 9
1.3 Societal interest . . . . 9
1.4 Ethics and sustainability . . . . 9
1.5 Report outline . . . . 10
2 Background 11 2.1 Machine learning . . . . 11
2.2 Supervised learning . . . . 11
2.3 Feed forward neural network . . . . 11
2.3.1 Activation function . . . . 12
2.3.2 Forward propagation . . . . 12
2.3.3 Loss function . . . . 12
2.3.4 Back-propagation . . . . 13
2.3.5 Gradient Descent . . . . 13
2.3.6 Stochastic Gradient Descent . . . . 13
2.3.7 Adam Gradient Descent . . . . 13
2.3.8 Regularization . . . . 13
2.3.9 Batch normalization . . . . 14
2.3.10 Overfitting . . . . 14
2.3.11 Dropout . . . . 14
2.3.12 Inference and prediction . . . . 14
2.3.13 Top-1- and Top-5 accuracy . . . . 15
2.4 Convolutional neural networks . . . . 15
2.4.1 Properties of convolutional neural networks . . . . 16
2.4.2 Pooling . . . . 16
2.4.3 Stride . . . . 16
2.4.4 Padding . . . . 16
2.4.5 Local response normalization . . . . 17
2.4.6 Grouped convolution . . . . 17
2.5 Network architectures . . . . 18
2.6 AlexNet . . . . 18
2.7 Transfer learning . . . . 19
2.8 Data augmentation . . . . 20
2.9 Rank decomposition . . . . 20
2.10 Low-rank tensor decomposition . . . . 20
3 Previous work 21 3.1 Structured sparsity . . . . 21
3.2 Deep compression . . . . 22
3.3 Pruning . . . . 22
3.4 Hashing Trick . . . . 23
3.5 ShuffleNet . . . 23
4 Experimental Set-up 24 4.1 Hardware . . . . 24
4.2 Dataset . . . . 24
4.2.1 Training . . . . 24
4.2.2 Validation . . . . 24
4.2.3 Test . . . . 24
4.3 Tensorflow . . . . 24
4.3.2 TFRecords . . . . 25
4.4 Data augmentation . . . . 26
5 Methodology for Low-rank decomposition 26 5.1 Motivation . . . . 26
5.2 Overview of how to decompose the network . . . . 26
5.3 Method for low-rank decomposition . . . . 27
5.3.1 Algorithm . . . . 27
5.4 Implementation . . . . 28
5.5 Original network architecture . . . . 31
5.5.1 Training . . . . 31
5.5.2 Validation and testing . . . . 32
5.5.3 Decomposition of learned weights . . . . 32
5.6 Decomposed network architecture . . . . 32
5.6.1 Training . . . . 32
6 Experiments 33 6.1 Different values for K . . . 33
6.2 Profiling . . . . 33
7 Result 34 7.1 Inference time . . . . 34
7.2 Impacts of various values for Ks . . . . 35
7.2.1 Time . . . . 35
7.2.2 Accuracy . . . . 38
7.2.3 Time vs accuracy . . . . 39
7.3 Best network configurations . . . . 39
8 Discussion and conclusion 42 8.1 Summary . . . . 42
8.1.1 Answer to the research question . . . . 43
8.2 Future work . . . . 43
8.3 Conclusion . . . . 43
9 References 44
Glossary
activation function A nonlinear function
back-propagation Efficient method to compute gradients epoch One iteration through the training set
feature map The output from a convolutional operation
fine-grained classification problem Classification problem where the classes share common parts inference The evaluation of new data after the model has been trained on the training data kernel The weights which is used for convolution applied on the input
learning rate Positive scalar which determines the step size to update the learnable parameter θ MLP Multilayer perceptron: A composition of functions
stride The step size used for moving the kernel over the input tensor A multidimensional array
training Utilizing the gradients computed by back-propagation to minimize the loss training set The set of data used to train the model
transfer learning Utilizing pretrained weights
1 Introduction
In this section, the motivation, problem domain and the research question for this study are introduced.
1.1 Motivation
Having a fast inference speed of a neural network is crucial for deploying real time applications.
A significant amount of research has been made studying how to increase the accuracy of convo- lutional neural networks. However, research has often focused on increasing the accuracy while neglecting the size of the network as well as the inference speed [1, 2].
Recent studies have suggested redundancies exists in convolutional neural networks and this property can be used to increase the inference speed [3, 4].
In a convolutional neural network the majority of the computations consist of the convolutional computations [4]. However, the size of the network model comprise mostly of the fully connected layers [5, 4]. To decrease the size of a convolutional network it is important to both reduce the size of the fully connected layers as well as the convolutional layer [4].
1.1.1 Problem domain
The chosen problem domain for this study is dog breed classification. The problem is a fine- grained classification problem with 120 relatively similar classes where different classes share common parts [6]. The dataset is relatively small compared to other datasets, consisting of 10K images in total. Previous studies have often focused on increasing inference speed on the ImageNet dataset which training set consists of of 1.2 million images [7, 4]. The purpose of using a limit dataset, and evaluating the effects of the decomposition, is that many real world machine learning problems have a limited amount of data.
1.2 Research question
The purpose of this study is to answer the following research question:
• How is the effectiveness of the decomposition affected by having a limited training set?
In order to answer the research question, the following questions will be investigated:
• How does the decomposition affect:
1. inference speed?
2. accuracy of the model?
1.3 Societal interest
Increasing inference speed is important for real time machine learning applications and a common problem is having a limited training set. Therefore the results of this study could be of interest to those who are developing real time machine learning applications. Artificial intelligence is more and more becoming a part of everyday life with applications such as Siri, custom recom- mendations of media content and , potentially, in the near future self driving cars for commercial use.
1.4 Ethics and sustainability
Artificial intelligence is a field that shows great potential but is also a field which requires great responsibility. Artificial intelligence can help improve society when it comes to creating smarter solutions and optimizations for real world problems, such as sustainability problems. However, Artificial intelligence could also be misused with catastrophic consequences. The purpose of this study is to investigate how inference speed can be increased, while maintaining accuracy, when the training data is limited. Increasing inference speed is a step towards more powerful Artificial intelligences.
The dataset consists of dog images and therefore there was no sensitive information used in this
1.5 Report outline
The outline of this report is as follows:
1. The necessary theory to understand the thesis is presented in section 2 Background.
2. A literature survey about the related work to this study is presented in section 3 Previous work.
3. The experimental set-up is presented in section 4 Experimental Set-up providing informa- tion about the necessities in order to set up the experiments of this study.
4. The methodology of this study is presented in section 5 Methodology for Low-rank de- composition, explaining in detail the methodology of the study how the low-rank decom- position was applied.
5. The experiments conducted and how they were evaluated is presented in section 6 Experi- ments
6. The results of the study is presented in section 7 Results
7. The discussion and the conclusion of the study is presented in section 8 Discussion and
Conclusion
2 Background
This section describes the necessary theory in order to understand the methodology of this study.
2.1 Machine learning
Machine learning as we know it today was first coined 1959 by Samuel in Ref. [8] and he described it as follows:
”Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort”. Commonly the following definition is used:
”the field of study that gives computers the ability to learn without being explicitly programmed”
2.2 Supervised learning
Supervised learning is a method to give the computers the ability to learn by providing the learning algorithm with data with the correct target output. An example of supervised learning is a Feed forward neural network [9].
2.3 Feed forward neural network
The purpose of a feed forward network is to approximate some function f ∗ that, for example, maps an input x to a category y, y = f ∗ (x). The feed forward network learns the value of a parameter θ, for the mapping function y = f (x; θ), which best approximates the function f ∗ .
The term neural in ”Feed forward neural network” has its origin from the fact that idea behind the networks were loosely inspired by neuroscience. The name ”network” originates from the fact that feed forward neural networks are composed of different functions. The composition of these functions is called a multilayer perceptron (MLP) and how these functions are composed can be described with a direct acyclic graph. An example of a multilayer perceptron can be seen in Figure 1.
For three functions the following chain can be formed: f (x) = f (3) ( f (2) ( f (1) (x))).
With l functions: f (x) = f (l) ( f (l−1) (..( f (1) (x)..))), each function constitutes a layer, the first function being the first layer, the second function being the second layer and the lth function being the lth layer. The lth layers is called the output layer, while the layers between the input and the output layer are called hidden layers.
The name ”hidden” comes from the fact that the training algorithm needs to decide how to use these layers to get the desired output without specifying how to do so by the training data.
This is different compared to the output layer which is directly specified by each sample of the training data, which includes the correct y, of what action to take to get a result close to y. The hidden layers are not provided with a desired output from the training data and is therefore called hidden. The number of composed functions, or layers, is referred to as the depth of the network. If the layers are many, the network is called deep, hence, deep neural networks are deep networks with many hidden layers. Each element in a hidden layers is referred to as a unit.
The number of units in each layers is referred to as the width of the network [9].
Input
Hidden
Output
Figure 1: An example of a Multilayer perceptron (MLP) (figure adapted from [9]).
2.3.1 Activation function
In order to learn a nonlinear function a nonlinear activation function is applied after each linear function f (). [10]
The activation function is a nonlinear function such as:
• Softmax, defined as a(z) = exp(z) P
j exp(z j ))
• Rectified linear unit (ReLU), defined as a(z) = max(0, z)
ReLU is the recommended default activation function in modern neural networks. Softmax is mostly used for the output layer to represent the probability distribution over n classes [9].
2.3.2 Forward propagation
Forward propagation allows information to flow forward through the network. A feed forward network is a composition of non-linear functions on linear combinations of the input, where the coefficients in the linear combinations are learnable adaptive parameters [10].
A single layer is defined as:
z = a(Wx + b) (1)
where a() is the activation function, x is the initial input to the network, W is the adaptive weights and b is the adaptive bias. If the input x ∈ < n∗n then the weights W must be W ∈ < k∗n , where k is the number of outputs. A network consisting of L layers is defined as:
z 1 = a(W 1 x + b 1 ) (2)
z l = a(W l z l−1 + b l ) , l = 2, .., L (3)
Where the weights W l ∈ < |z
l|∗|z
l−1|
[10].
2.3.3 Loss function
The loss function L( ˆy, y), or cost function, is the measurement of the loss for the predicted output
ˆy and the target output y. Commonly, neural networks are trained using maximum likelihood
with the negative log-likelihood used as the loss function [9].
2.3.4 Back-propagation
Back-propagation is a computationally inexpensive method to compute the gradients. The infor- mation obtained from the cost is passed backwards through the network in order to compute the gradient, hence, the name back-propagation. The chain rule is used to compute the gradients of functions composed of other functions with known gradients [9].
2.3.5 Gradient Descent
Gradient descent is an algorithm used for solving an optimization problem of either minimizing or maximizing some function f (x) by altering x. Gradient descent works by shifting x in small steps with the opposite sign of its derivative. For neural networks it is the Loss function that we want to minimize. Minimizing is achieved by maximizing − f (x). Minimizing the loss function, by utilizing the gradients computed by the back-propagation, is referred to as training the network [9].
2.3.6 Stochastic Gradient Descent
Stochastic gradient descent is an extension of Gradient descent and is used by most deep learning architectures to train the network with the gradients obtained from back propagation. Since deep networks usually require a large training set it is computationally infeasible to calculate the loss over the entire training set T = {x (1) , ..., x (m) }.
∇ θ J(θ) = 1 m
m
X
i =1
∇ θ (x (i) , y (i) , θ) (4)
The computational complexity of this operation is O(m) and as the size of the training set m increases the computational time scales linearly. To accommodate for this restriction, Stochastic gradient descent can be used. Stochastic gradient descent utilizes the fact that the gradient is an expected value that can be approximately estimated using a small subset of the training set. This is referred to as a minibatch of examples B = {x (1) , ..., x (m
0) }, where m 0 << m. A new minibatch is drawn uniformly from the training set at each step of the algorithm.
g = 1 m 0 ∇ θ
m
0X
i =1
∇ θ (x (i) , y (i) , θ) (5)
The computational complexity of this operation is O(m 0 ) which is fixed as the size of the training set increases. The stochastic gradient descent algorithm updates the learnable parameter θ as:
θ = θ − g (6)
where is a positive scalar which determines the step size, referred to as the learning rate. The size of the minibatch is referred to as the batch size. One iteration over the entire training set is referred to as an epoch [9].
2.3.7 Adam Gradient Descent
Adam gradient descent is another method for efficient stochastic optimization, which was intro- duced in Ref. [11]. Adam computes individual adaptive learning rates for each parameter and stores both an exponentially decaying average of past squared gradients and an exponentially decaying average of past gradients [11, 12]. Kingma and Ba showed empirical results demon- strating that Adam works well both in practice but also compares favorably to other stochastic optimization methods [11].
2.3.8 Regularization
Regularization is any method to alter the learning algorithm with the intent to make the learning
algorithm reduce the error on the test data but not on the training data [9].
2.3.9 Batch normalization
Batch normalization was introduced by Ioffe and Szegedy in Ref. [13]. Deep networks have been known to be difficult to train. As described in section 2.3.4 Back-propagation each parameter is updated by the gradient, under the assumption that the other layers do not change.
The reasons being that in practice all the layers are updated simultaneously. When many func- tions are composed together, and are changed simultaneously with values computed under the assumption that the other functions were constant, unexpected results can occur [14].
This requires the training process to have a sufficiently low learning rate, which results in longer training periods, and also careful parameter initialization [13].
Batch normalization addresses these problem by using normalization as part of the network ar- chitecture by performing normalization for each training minibatch.
The benefits are not only allowing for deeper networks but batch normalization allows for higher learning rates, makes networks less dependent on good initialized values and acts as a regularizer [13].
2.3.10 Overfitting
Overfitting is a problem in machine learning when there is a large distance between the training and test accuracy, the training accuracy being significantly higher than the test accuracy [9].
Overfitting occurs when the model is too complex for the amount of training data available.
Increasing the amount of training data can reduce the problem of overfitting [10].
2.3.11 Dropout
Since deep neural networks can have a large number of parameters it can be a powerful machine learning model. However, since there is a large number of parameters overfitting can be a significant problem. Dropout is a method to prevent overfitting and has shown to be effective in improving the performance of neural networks. Dropout regulates the network by preventing units from co-adapting too much, which is achieved by randomly dropping units along with their connection during training [15].
Figure 2: Drop out illustration, figure taken from [15].
2.3.12 Inference and prediction
Inference is the evaluation of new data after the model has been trained on the training data [10].
As described in section 2.3.1 Activation function the Softmax operation is used to represent the
probability distribution over n classes. The class with the highest probability is then chosen as
the models prediction for the input data [16].
2.3.13 Top-1- and Top-5 accuracy
The top-1- and top-5 accuracy are measurements of classification/prediction accuracy of the model. The top-1 accuracy is the accuracy of the model classifying an input with the correct class.
The top-5 accuracy is the accuracy of having the correct class in one of the 5 highest probabilities in the output of the model [9].
2.4 Convolutional neural networks
A convolutional neural network is a feed forward network that utilizes convolution in at least one of the layers.
A convolution is a linear operation defined as:
s(t) = (x ∗ w)(t) = Z ∞
a=−∞ x(a)w(t − a) (7)
and for discrete convolution:
s(t) = (x ∗ w)(t) =
∞
X
a=−∞
x(a)w(t − a) (8)
For convolutional neural networks, x and w in equation(7) are often referred to as the input and as the kernel, respectively. The output of the operation can be referred to as a feature map.
Convolution is not restricted to one-dimensional inputs, for many machine learning problems the input is often a multidimensional array, referred to as a tensor [9].
The order of the tensor determines the dimensions of the array. A one order tensor is referred to as a vector, a second order tensor as a matrix and a third order tensor or higher, is referred to as a high order tensor
A N-order tensor X has the form < I
1∗I
2∗ ..∗I
n.
Flattening a tensor is the transformation from, for example, a tensor of dimensions 8x3x2 to a matrix 8x6 [17].
If the input is a two-dimensional image I the kernel K is usually two-dimensional as well, and the convolution would be defined as followed:
S(i, j) = (I ∗ K)(i, j) = X
a=m
X
a=n
I(m, n)K(i − m, j − n)) (9)
The convolutional operation is commutative, which means that we can define the convolution as:
S(i, j) = (K ∗ I)(i, j) = X
a =m
X
a =n
I(i − m, j − n)K(m, n)) (10)
The benefits of the latter is that the there is less variation in the range for values m and n which can make it more straightforward to implement.
In practice many machine learning libraries implement cross-correlation but call it convolution.
The reasons being that the commutative property is not really useful in practice but useful when writing proofs. It is the same as convolution but without flipping of the kernel [9]:
S(i, j) = (I ∗ K)(i, j) = X
a =m
X
a =n
I(i + m, j + n)K(m, n)) (11)
2.4.1 Properties of convolutional neural networks
Convolutional neural networks have three important properties:
• sparse interactions
• shared weights
• equivariant representation.
Convolutional neural networks have sparse interactions, meaning that each input does not inter- act with every output unit, compared to traditional neural networks where that is the case. Sparse interactions are achieved by using a kernel smaller than the input. Having sparse interactions reduces the memory requirements and improves the statistical efficiency of the model.
In comparison to traditional neural networks convolutional neural networks have parameter sharing, or tied weights. Instead of only using each element of the weight matrix once when computing, the output layer is is used for more than one function in the model. Each member of the kernel is used in all positions of the input.
Convolutional networks are equivariant to translation which means that as the input changes the output changes equally. If two functions are commutative they are equivariant to each other.
The order in which two equivariant functions are applied to the input does not affect the result of the output [9].
2.4.2 Pooling
The pooling operation helps make the representation approximately invariant to small input changes, meaning that a small change in the input will not result in different values for most of the pooled outputs.
The pooling operations achieves this by reducing the dimensions of the feature maps of the previous layer. This is done by a statistical summary to retain the most important information. A common pooling operation is max pooling, which takes the maximum value of a group of nearby values [9].
2.4.3 Stride
The step size used for moving the kernel over the input is referred to as stride. Stride can both be applied for the convolution and the pooling operation [9].
2.4.4 Padding
With different configurations of kernel size and stride it is not always the case that it will be a perfect match for the input size [9]. An example of non-perfect match can be seen in Figure 3.
In this study there are two methods applied when the aforementioned scenario occurs.
1. Append zeros to keep an operation from down sampling the size.
2. Discard the non-matching values in the input, so that size is reduced after the operation has been applied.
In this study the first method will be referred to as SAME padding and the second will be referred
to as VALID padding.
1 1 1
4 3 3
7 5 9
1 1 1
4 3 3
7 5 9
1 1 1
4 3 3
7 5 9
1 1 1
4 3 3
7 5 9
Not matching Stride = 3
Figure 3: Example of a non-perfect match.
In Figure 3 the stride is 3 and the kernel has a width of 4 and a height of 2. Since the input only has a width of 6 the kernel can only fit once with stride 3. The last two values in the row will not match and therefore there is a non-perfect match. One of the two padding options can then be applied. Either drop the values and thereby reduce the size of the input, or append zeros to the input to keep the same size.
2.4.5 Local response normalization
Local response normalization creates competition for big activities amongst the output from the neurons computed from the different kernels [18]. Local response normalization was used in the AlexNet architecture. Krizhevsky et al. verified that the normalization improved accuracy on the CIFAR-10 dataset using a four layer convolutional neural network with 2% [19].
Local response normalization is defined as:
b i x,y = a i x,y / 1 + α
i+N/2
X
j =i−N/2
(a x j ,y ) 2 β
(12)
Where a i x ,y is the activation computed by applying kernel i at position (x, y) where N is the total number of kernels in that layer [19].
2.4.6 Grouped convolution
Grouped convolution separates the filters into filter groups. Each filter group only operates on a subset of the input feature map. Some methods focus on approximating the convolution filters in the spatial domain to reduce the complexity, such as low-rank tensor decomposition.
However, grouped convolution focuses on reducing the complexity of the convolutional filters
in the channel domain. This reduces the computational complexity and the number of model
parameters, without affecting the dimensions of the input and output feature maps [14]. Grouped
convolution was introduced in Ref. [19] in order to be able to distribute the training on several
GPUs [19, 14].
2.5 Network architectures
A convolutional network can be constructed in a number of ways. This includes varying the number of layers, the type of layers and adding pooling operations [9]. A common network scheme is to repeatedly stack a few convolution layers followed by a pooling operation and use the feed forward layers as the final layers of the network [16]. Examples network architectures include [16, 20, 4]:
• AlexNet
• VGG
• R3DCNN
• GoogleNet
2.6 AlexNet
AlexNet is a famous architecture, first introduced in ”Imagenet classification with deep convolu- tional neural networks” by Krizhevsky et al. which at the time achieved significantly better result on the ImageNet test set compared to the previous state of the art [19].
The AlexNet architecture consists of 8 layers, 5 convolutional layers and 3 fully connected layers.
Additionally, AlexNet has 3 pooling operations. The first is located in between the two first convolutional layers, the second after the second convolutional layer and the third is located be- tween the last convolutional layer and the first fully connected layer. Additionally, local response normalization is added after the first and second convolutional layer [16]. Convolutional layers 2, 4 and 5 utilizes grouped convolutions with the number of groups being 2 for each respective layer. The AlexNet Architecture can be seen in Figure 4
Figure 4: AlexNet network architecture, figure taken from [19].
A more detailed description can be seen in the Table 1, the table is adapted from [16].
Table 1: Original AlexNet architecture (The cells in the table containing ”-” represents that parameter is not applicably to that layer).
AlexNet
Layer Type Maps Size Groups Kernel size Stride Padding Activation
Input Image(RGB) 3 224x224 - - - - -
Conv 1 Convolution 96 55x55 1 11x11 4x4 SAME ReLU
lrn 1 Local response
normalization 96 55x55 - - - - -
pool 1 Max pool 96 27x27 - 3x3 2x2 VALID -
Conv 2 Convolution 256 27x27 2 5x5 1x1 SAME ReLU
lrn 2 Local response
normalization 256 27x27 - - - - -
pool 2 Max pool 256 13x13 - 3x3 2x2 VALID -
Conv 3 Convolution 384 13x13 1 3x3 1x1 SAME ReLU
Conv 4 Convolution 384 13x13 2 3x3 1x1 SAME ReLU
Conv 5 Convolution 256 13x13 2 3x3 1x1 SAME ReLU
pool 3 Max pool 256 6x6 - 3x3 2x2 SAME -
Fully Fully connected
layer - 4096 - - - - ReLU
Fully Fully connected
layer - 4096 - - - ReLU
Fully Fully connected
layer - number of
classes - - - - Softmax
2.7 Transfer learning
Transfer learning is a method to utilize pretrained weights for a different problem domain.
Transfer learning utilises the property of the early layers learning more generic properties and the later layers learning more specific properties of the problem domain. Transfer learning is an essential aspect of this study and is commonly used when there is a limited amount of data for the specific problem or when you have limited time for training. There are two methods how to approach the training of the network on the new problem domain [21]:
1. Retrain all the layers of the network
2. Let the layers with transferred parameters remain constant during training, leaving them so called frozen.
The first method can be used if the dataset for the new problem domain is large or the number of
parameters in the model is small while the second methods is best suited when the dataset for the
new problem domain is small and the number of parameters in the model is large. The second
method prevents overfitting of the model [21]. A simplified illustration of transfer learning can
be seen in Figure 5.
Dingo
Pug
Collie
Visible layer
(input pixels) 1st hidden layer (edges)
2nd hidden layer (corners and contours)
3rd hidden layer (dog shapes)
Dingo
Pug
Collie
4th hidden layer (breed specific features)
Output layer (dog breeds)