Keywords Abstract

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Deep Learning Model

Compression for Edge

Deployment

ASHUTOSH VAISHNAV

(2)

Abstract

Powerful Deep learning algorithms today allow us to solve many difficult classification and regression tasks. However, running them on memory constrained and low power devices for efficient inference at the edge is a challenge. The goal is to develop a highly generalizable and low complexity compression algorithm that can compress deep neural networks. In this thesis, we propose two novel approaches to this end. The first approach involves learning a new network with L1norm regularized parameters from

the original trained model. This new model is trained with only a fraction of the original dataset. The second approach involves using information about second order derivative of loss to find solutions that are robust to quantization. Combining these approaches allows us to achieve significant compression of the trained model, with only marginal loss in performance, measured using test set classification accuracy.

Keywords

(3)

Abstract

Kraftfulla djupinlärningsalgoritmer gör det idag möjligt för oss att lösa

många svåra klassificerings- och

regressionsproblem. Att köra dessa algoritmer på minnesbegränsade och energisnåla enheter för effektiv inferens är en stor utmaning. Målet är att utveckla generaliserbara kompressionsalgoritmer med låg komplexitet som kan komprimera djupa neurala nätverk. I det här examensarbetet föreslår vi två nya tillvägagångssätt för att uppnå målet. Den första metoden bygger på att träna upp ett nytt nätverk med L1-normregulariserade parametrar från den ursprungliga modellen. Den nya modellen kan tränas med bara en bråkdel av den ursprungliga datan. Det andra tillvägagångssättet använder sig av information om andraderivatan av förlustfunktionen för att hitta lösningar som är robusta mot kvantifiering. Genom att kombinera dessa metoder uppnår vi markant komprimering av den tränade modellen, med endast marginell prestationsförlust uppmätt genom klassificering av separat testdata.

Nyckelord

(4)

Acknowledgements

I would like to thank Hossein for his guidance and several fruitful discussions, that were of great help in shaping the ideas developed in this thesis. I would also like to thank Carlo for suggesting this project and providing valuable feedback about my work. I am also thankful to my colleagues and friends for the time spent together, which made this experience even more enjoyable.

(5)

Author

Ashutosh Vaishnav

M.Sc. in Information and Network Engineering KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

Carlo Fischione Full Professor

Division of Network and Systems Engineering KTH Royal Institute of Technology

Supervisor

Hossein Shokri Ghadikolaei Post Doctoral Fellow

(6)

List of Algorithms

(9)

List of Figures

1.1 AlexNet architecture [6]. . . 3

2.1 A fully connected deep neural network architecture . . . 15

2.2 Visualising CNN architecture . . . 18

3.1 Uniform affine quantizer . . . 27

4.1 L0 norm and L1norm . . . 31

4.2 Simple figure illustrating two solutions with similar minimum value but different sensitivity to noise. . . 36

5.1 Some sample images from the MNIST dataset . . . 40

5.2 10 random images from the 10 classes in the CIFAR-10 dataset [33] . . . 41

5.3 Architecture of the modified AlexNet model used in setup 3 . . 42

5.4 Inference accuracy vs sparsity after pruning and retraining for original and regularized 5 layer model. . . 43

5.5 Inference accuracy vs sparsity after pruning and retraining for original and regularized AlexNet model. . . 44

5.6 Largest eigenvalues of Hessian matrix of weights for different weight initializations. . . 46

5.7 Inference accuracy vs bits per weight after quantization for different initializations after 1000 epoch training . . . 47

(10)

Chapter 1 Introduction

Machine learning approaches have revolutionized a large number of classification and regression application domains. These approaches involve learning models to infer based on past experiences(training data). Rather than focusing on the model design for every unique problem, which requires deep understanding of the problem domain, machine learning approaches rely on more generic models that can learn the patterns in the training dataset and infer based on those learnings. Such approaches become particularly useful when the problem is too complex to be modelled properly manually. Some examples of such problems would be recognizing objects(vision), processing language, driving a car, etc.

(11)

unlocked the great potential of this technology.

Today, state of the art deep learning algorithms show outstanding results in multiple application domains. They are being widely used for computer vision applications, like object detection and localization. Deep recurrent neural networks, a particular class of DNNs, have been incredibly successful in speech recognition. They have enabled technologies like Google Duplex [5] which is an artificial assistant capable of making natural sounding conversations with humans. The technology combines speech recognition, language understanding, and speech synthesis, all of which use deep learning in some form.

A recent breakthrough DNN in image classification was AlexNet [6], that beat all other image classification networks by a huge margins. The competetion was for classifying ImageNet [7] images, ILSVRC-2010 [8] with 1.2 million from 1000 different classes. Figure 1.1 shows the AlexNet architecture. It has 60 million parameters. The model achieved top-5 error rate of 17%, which means that the correct class was among the top 5 guesses of the model 83% of the times.

Figure 1.1: AlexNet architecture [6].

(12)

methods use algorithms that are robust to overfitting, e.g., via L1 or L2

regularization [9]. Still, the final model would have many redundancies and there is a potential to remove these redundancies once the model has been trained. This problem is of great practical interest as a smaller model would occupy lesser space and require lesser computing power to execute, allowing us to draw inference using cheap and low power devices at the edge. Mobile phones with many apps storing multiple trained deep learning models for different uses like face recognition, speech recognition, etc. is a good example case that illustrates the need for compressing deep learning models.

1.1 Problem Statement

Our goal is to compress trained deep learning models with a marginal performance loss, measured using test set inference accuracy. We want our compression algorithm to have the following characteristics:

• High generalizability: The algorithm should be relatively easily applicable to all deep learning models.

• Minimal access to the original dataset: Requirement of access to complete original dataset raises privacy concerns and can also be inconvenient for models with huge training datasets.

• Low complexity: Low complexity means lesser time to compute a compressed model from the original model, which saves resources and is particularly useful for applications where the original model keeps getting frequently updated (Eg. online learning setting [10])

(13)

come up with a model that has 50x fewer parameters than AlexNet and achieves similar accuracy as AlexNet. The problem with such approaches is that they reqire a lot of manual effort and experimentation to design a model for each application, and thus the approach is not highly generalizable. A simple but highly generalizable approach is model pruning, initially shown to be useful for network compression in 1989 by LeCun et al. [12]. The approach involves simply deleting the connections with small values. Pruning resurfaced recently in 2015 when Han et al. [13] used it to compress some state of the art deep learning models without any loss of accuracy. Another popular generalizable approach is quantization of model parameters. A simple uniform affine quantizer can be used to compress most deep learning models by 4x without any significant loss of accuracy [14]. To further improve resulting compression gain, quantization aware training [14] is used which uses some heuristics to simulate the effect of quantization during training of the model. This approach may lead to a learned model that is more robust to quantization, which will be applied in later stages for the compression.

In this thesis, we build on top of the existing pruning and quantization approaches. With the three stated design goals in mind, we propose two different approaches that can be stacked to get significant compression gains. The first approach involves training a student network layer by layer with a L1 regularized cost. The idea is that L1 regularization promotes

sparsity, thus the resulting student network can be pruned to a greater extent. The second approach is about making use of Hessian information to obtain a solution after training that is more robust to quantization. Reaffirming the insights suggested by Han et al. [15], we observe that both these approaches can be stacked without interfering with each other, resulting in great compression gains. We believe that the proposed approaches can find many interesting applications in the promising area of machine learning over wireless networks.

(14)

(15)

Chapter 2 Background

In this chapter, we briefly describe the fundamentals needed to follow this thesis. We start with the basics of optimization theory, followed by algorithmic solution approaches used for solving the training optimization problem of DNNs. In the next section, we cover the architecture of artificial neural networks, their training, and some relevant challenges in training along with existing solutions.

2.1 Optimization

Optimization problems are encountered in all engineering disciplines. It is common to find some trade-off between different decision variables that need to be adjusted to maximize some utility function or equivalently minimize some cost function. An optimization problem can be expressed in the form

minimize

x p(x)

subject to qi(x)≤ 0 i = 1, . . . , n.

(2.1)

(16)

Depending on the nature of functions p(x) and qi(x), there can be different

ways to approach solving such problems. The optimization objective could have multiple minima in which case we may need to examine all of them to find the optimal solution.

One particular set of problems that are relatively easily solvable and hence widely used are convex optimization problems. The convex nature of objective function and constraints also makes it possible to use gradient based methods to arrive at the exact unique solution.

Mathematically, a problem expressed of the form in 2.1 is convex iff functions p(x) and qi(x): Rd 7→ R are convex. A function h(x) is said to be convex if it

satisfies

h(αx + (1− α)y) ≤ αh(x) + (1 − α)h(y)

for all x, y ϵRd _{and all 0} _{≤ α ≤ 1. The inequality stated above essentially}

means that any line segment joining any two points on the curve h(x) never lies below the curve itself.

The motivation for studying convex optimization problems is not just because they are easily solvable, but also because they provide insights that help solve nonconvex problems. One intuitive way is to approximate a nonconvex function with a convex function and then solve the convex problem. The solution obtained can then be used as initialization point for solving the original nonconvex problem. Simple gradient based methods that guarantee finding the optimal solution for convex problems, also form the basis for popular algorithms like Adam which are widely used for solving non convex problems in neural network training.

(17)

dimension o. This could be formally stated as minimize w 1 n n ∑ i=1 L(f(xi,w), yi) (2.2)

where L(α, β) refers to the loss function that measures loss on predicting output α, when the desired output is β. n is the number of training samples in the dataset.

In all gradient based methods for solving optimization problems, the idea is to start from some initial point and then move towards the direction of decreasing gradient until one reaches close to an optima. In steepest

descent or batch gradient descent, all the training samples are used for

computing the gradient in each iteration. Considering the machine learning objective stated in (2.2), the gradient descent weight update rule becomes

wk+1 =wk− αk n n ∑ i=1 ∇L(f(xi,wk),yi) (2.3)

∇ is used to denote the gradient. αk is called learning rate for the kth step

and it affects the step size of updates in each iteration.

In steepest descent, a very good weight update direction is computed in each step as all samples are considered, however the computations in each step also become increasingly expensive with increase in the size of the dataset. This method also allows parallelization of computations as gradients on different parts of the dataset can be computed independently and then finally aggregated in each iteration.

Another gradient based algorithm is stochastic gradient descent, where a sample is chosen randomly from the training set in each iteration for gradient computation. The weight update rule is given by

(18)

where ikcorresponds to the randomly chosen index of training sample for the

kthiteration. Each iteration in stochastic gradient descent is thus very cheap compared to steepest descent. However, the direction of weight update in each step is not guaranteed to be one of descent from wk, rather it is a descent

direction in expectation as shown in [16].

Thus, we have here a trade-off between per-iteration computation cost and expected per-iteration reduction in training loss. It can be shown that the total computations required to obtain an ϵ-optimal solution for a strongly convex problem solved using batch gradient method with fixed learning rate α is proportional to n log(1/ϵ). For stochastic gradient method, the same quantity is proportional to 1/ϵ [16]. Thus, batch gradient descent can be faster for small datasets, but in most practical deep learning scenarios with huge datasets, stochastic gradient descent performs better. To also exploit the parallelization capabilities of batch gradient descent, a hybrid of both methods called mini batch gradient descent is popularly used.

In mini batch gradient descent, m number of samples from the training set (where m < n) are randomly chosen and used for gradient computation in each step. The weight update rule is given by

wk+1 =wk− αk m m ∑ i=1 ∇L(f(xik,wk),yik) (2.5)

The choice of m allows us to select a good point in this trade-off between batch and stochastic gradient descent.

In all of the discussion until now, we didn’t pay much attention to the learning rate αk. Having a high learning rate can cause our training to

overshoot and even diverge from the optima. Having a very small learning rate would slow down training thereby increasing the number of iterations required to converge to optima.

(19)

faster in the beginning when the solution is far from optima and then slow down near the optima. One such learning rate schedule is exponential

decay, where learning rate is updated as follows.

αk= α0e−βk (2.6)

Another idea is to use the second order gradient information to decide the learning rate, for example using Newton’s method. Evaluating the Hessian of loss tells us about how fast is the loss surface changing gradient in different directions. In Newton’s method, the learning rate is selected to be inversely proportional to the Hessian of loss. However, the problem with this method is that the computation of Hessian requires the computation of (number of weights in the model)2 _{number of gradients.}

These computations are expensive and not feasible for large networks with millions of parameters.

Thus, there are heuristics that emulate similar behaviour by carrying over knowledge of gradient or gradient updates in the past iterations. One popular technique is using momentum [17]. The weight update rule is given by

wk+1 =wk− αk∇L(wk) + βk(wk− wk−1) (2.7)

αkand βkare usually constant values that suit the optimization problem. The

third term in this equation is momentum, which recursively accumulates as the weights keep getting updated towards the same direction.

Another popular technique is RMSProp [18], which is short for root mean square propogation. In RMSProp, exponentially weighted average of squares of gradients is used to compute the learning rate in each step. The weight update rule is give by

(20)

One key difference from momentum optimizer is that the learning rate is different for different parameters that form the vector wk in each

iteration. The exponentially averaged term in the denominator means that the parameters that got updated with large values in the past few steps get a smaller learning rate assigned in the next step. This helps prevent the optimizer to overshoot from optima. Combining the ideas of both momentum and RMSProp, we get Adam [19]. It is perhaps the most popular optimization algorithm used for training DNNs. Firstly, estimates of the first and second order moment are iteratively calculated

mk= β1mk−1+ (1− β1)∇L(wk) (2.10) vk= β2vk−1+ (1− β2)∇L(wk)2 (2.11)

To overcome the bias towards 0 in the initial steps, as mk−1 and vk−1 are

initialized with 0, these updated estimates are calculated. ˆ mk = mk 1− βk 1 (2.12) ˆ vk = vk 1− βk 2 (2.13)

Finally, these estimates are used to update the parameters in each iteration.

wk+1 =wk− η √ ˆ vk+ ϵ ˆ mk (2.14)

The resulting algorithm is presented as algorithm 1.

2.2 Artificial Neural Networks (ANNs)

(21)

Algorithm 1 ADAM algorithm for stochastic optimization Require: α : step size

β1, β2ϵ[0, 1) :Exponential decay rates for moment estimates

L(f(x, w), y) : Objective function with parameters w

w0 :Initialization value of w

Initialize: m0 ← 0

v0 ← 0

k← 0

while wknot converged do

k← k + 1 gk ← ∇wL(f(x, wk−1),y) mk ← β1mk−1+ (1− β1)gk vk ← β2vk−1+ (1− β2)g2k ˆ mk ← mk/(1− β1k) ˆ vk ← vk/(1− β2k) wk ← wk−1− α ˆmk/( √ ˆ vk+ ϵ) end while

return wk(Trained parameters)

f (x, W), parameterized by some weights W that tries to best approximate

f∗. The weights W are learned by solving the optimization problem stated in (2.2).

To understand the design of neural network function f (x, W), let us start with a simple two class classification problem. The input samples x belong to two categories represented by yϵ{0, 1}. One way to solve this problem is to use linear regression to fit some line to given data samples.

ˆ

y = wTx + w0

This model seems very elegant and is easily solvable but has problems like our estimates ˆycan take values larger than 1 or lesser than 0. Thus, a much better idea is to squash the output of linear model to the interval (0,1) with a non linear sigmoid function.

ˆ

(22)

where g(z) = 1/(1 + e−z). This function represents what we call a single layer neural network model. The layer comprises of weights w that scale the different components of input x, followed by non-linear activation function g(z).

The limitation of the previous approach is that it is still incapable of modelling simple functions like the binary XOR function. This can be solved by adding another layer to the single layer model. The effective function modelled by such an architecture is given by,

ˆ y = g1(W1T(g0(W0Tx + w 0 0)) + w 1 0) (2.15)

When we compose many such functions together, stacking multiple layers, it is called a deep neural network. The same architecture works for multi-class multi-classification as well by having multiple nodes in the final layer. The functions modelled by a deep neural network can be expressed as follows.

f1(x) = g1(W0Tx + w 0 0) f2(x) = g2(W1Tf1(x) + w10) .. . fK(x) = gK(WKT−1fK−1(x) + wK0 −1) (2.16)

Here fk(x)represents the output of the kthlayer of the model, and K is the

total number of layers in the model.

2.2.1 Training ANNs

(23)

Figure 2.1: A fully connected deep neural network architecture computation of gradient with respect to a weight in the first layer requires the knowledge of some gradients from all later layers. This computation is can be done effectively using an algorithm called backpropogation.

Back propogation The basic principle used in backpropogation is the

chain rule of derivatives. Consider functions f and g defined for real number to real number mapping. For a real number x, y = g(x) and z = f (g(x)) = f (y). The chain rule says that

dz dx = dz dy dy dx (2.17)

In case of a DNN, the chain rule can be applied as shown in algorithm 2 to compute gradients starting from the last layer of the model. For more details, readers may refer to [20].

Exploding and vanishing gradients This is a challenge when training

deep networks. To understand this, consider some computational graph that involves repeated multiplication of some matrix W . After K steps, this is equivalent to multiplying by WK_{. Say W has an eigendecomposition}

W = V diag(λ)V−1, then

(24)

Algorithm 2 Backpropogation algorithm for training neural networks Require: (x, y) : Pair of input and output

Do a forward pass to compute activation values {gk}Lk=1, then compute

gradient on the output layer:

grad← ∇yˆJ =∇f (x,W )L(f(x, W ), y) for k = L, L− 1, . . . , 1 do

Multiply the partial derivative of the activation function with the evaluated gradient of the output layer:

grad← ∇gkL(f(x, W ), y) = grad.f

′_(g k)

Evaluate the gradients of pre-activation w.r.t. weights: ∇bkJ = grad

∇WkJ = grad.gk−1

end for

return ∇WkJ, k = 1, . . . , K :(Gradients w.r.t. all parameters)

The problem of vanishing and exploding gradients is that the gradients also scale according to diag(λ)K_{. Thus, apart from the eigenvalues having}

magnitude close to 1, all other eigenvalues will either explode if they are greater than 1 or vanish if they are less than 1. Vanishing gradient makes it difficult to know the direction in which parameters be updated to reduce loss. Exploding gradient makes the learning unstable.

Activation function In equation-2.16, the function gk is known as

activation function for layer k. The choice of this function obviously affects the shape and curvature of the final function modelled by our network. We already defined the sigmoid activation unit earlier in this section. To tackle the vanishing gradients problem in DNNs, rectified linear unit(ReLU) activation was proposed [6], which gives much better performance compared to sigmoid or hyperbolic tangent functions in DNN training. The function is given by

gk(z) = max{0, z} (2.19)

As we can see, the gradient never saturates for z > 0 which helps with gradient based training algorithms.

(25)

for two class classification and multiclass classificationas respectively. The softmax function is given by

sof tmax(z)i =

exp(zi)

∑

jexp(zj)

(2.20)

where i, j belong to the set of index of different classes. Thus, the output of sof tmax(z)i can be seen as an estimate of the probability that a sample

belongs to the ith _{class. The negative log likelihood loss function can undo}

their saturation behaviour.

Loss function Mean square error between predicted output and actual

output is one choice of loss function. However, negative log likelihood loss function works well with sigmoid or softmax as it can undo their saturation behaviour and make the loss minimization problem convex for a single layer network. As we just discussed, considering the output of softmax as the probability of a sample belonging to ith _{class, the}

maximum likelihood estimate can be obtained by minimizing the negative log likelihood. Resulting loss function is given by the expression

L(W ) = −Ex,y∼ˆpdata log pW(y|x) (2.21)

where ˆpdatarefers to the distribution of training data while ˆpW refers to the

distribution of the model parameterized by W . This loss function is also known as the cross entropy loss function.

2.2.2 Convolutional neural networks (CNNs)

(26)

To understand CNNs, let us first define the convolution operation. Convolution operation can be seen as multiplying an window of weights to compute a weighted average for some input signal, which is generally of longer length than the window. As we move this window along the input signal, we get different weighted averages corresponding to different parts of the signal.

s(t) = (x∗ w)(t) (2.22)

s(t) = ∫

x(a)w(t− a)da (2.23)

If the input signal is say a grayscale image, we can construct a 2D kernel of weights that moves along the two spatial directions during convolution. Thus, the resulting discrete convolution operation can be expressed as

S(i, j) = (I∗ K)(i, j) = ∑

m

∑

n

I(m, n)K(i− m, j − n) (2.24)

Figure 2.2: Visualising CNN architecture

(27)

Just like fully connected networks, the weighted sum of inputs is followed by some non linear activation function, such as sigmoid or ReLU. CNNs usually also have a third operation, called the detector stage that uses a pooling

function to reduce the dimensionality of output of a convolution layer. For

(28)

Chapter 3 Related Work

Multiple methods have been proposed and implemented in the literature to compress trained neural network models, without significantly affecting inference accuracy. We can classify most of those methods into two categories.

The first category can be defined as the set of algorithms that reduce the cardinality of the parameter set, or the number of parameters in the model. The result obtained from using such algorithms is a modified model topology with fewer parameters and connections. Note that the model topology itself occupies far lesser storage space than parameter values. Thus the compression gain is very close to the ratio of number of parameters in the original model to that of the new model if all parameters occupy the same storage space.

(29)

additional storage is then required to store the mapping of bit sequences to representative values.

In this chapter, we will discuss popular compression methods that have already been proposed in the literature. The chapter is structured so that we discuss the methods from the aforementioned categories in order. Towards the end of this chapter, we discuss the benchmark frameworks for model compression that consist of a combination of approaches from both these categories.

3.1 Parameter

set

cardinality

reduction

approaches

3.1.1 Smaller architectures

This is the first approach for reducing cardinality of the parameter set. When designing large DNNs, micro-architecture refers to the design of building blocks or modules and macro architecture refers to how those modules are organized in the network. Smaller architecture can be considered as an umbrella term that covers all micro-architecture and macro-architecture design space exploration strategies for designing architectures with fewer parameters.

One of the biggest and most recent breakthroughs in this approach area was SqueezeNet [11]. SqueezeNet proposed a new architecture, with 50x fewer parameters than AlexNet, that achieves AlexNet level or higher accuracy on ImageNet. Some of the core ideas from their work are as follows

• The authors propose ‘fire modules’ as the building blocks of a CNN architecture. Each fire module involves one squeeze layer followed by one expand layer.

(30)

behind squeeze layer is to reduce the number of input channels to the next layer containing 3x3 filters, thereby reducing total number of parameters in a convolution kernel.

• The expand layer comprises of 1x1 and 3x3 convolution filters.

• Another strategy used is to delay downsampling. This means increasing stride or adding pooling layers is only done late in the network. Their intuition is that large number of activation values earlier in the network helps getting higher classification accuracy. • By systematically exploring the hyperparameter space for each layer

on using fire modules, and also the ordering of layers, authors come up with the final architecure of SqueezeNet.

• The resulting model with 50x fewer parameters than AlexNet can be further compressed using Deep Compression[] (discussed later in this chapter), to get a trained model 510x smaller size than AlexNet. The great compression gain achievable by combining model architecture design strategies and generic model compression methods like Deep Compression demonstrates the level of redundancy in state of the art DNN architectures. However, it must also be noted here that the model design approaches proposed for SqueezeNet are not easily applicable to new models and require a lot of manual effort to design one such model for a very specific application.

3.1.2 SVD and Tucker decomposition

This is the second approach for reducing cardinality of the parameter set. Singular value decomposition is a widely studied method to decompose a linear transformation described by a matrix as a sum of product of orthogonal and diagonal factors [21]. This is described in the following expression.

(31)

Here U and V are orthogonal matrices composed of singular vecotors uiand

vi respectively and Σ is a diagonal matrix containing singular values σi. The

same expression can be restated as an outer product of singular vectors as described below.

A = σ1u1v1T +· · · + σrurvrT (3.2)

By discarding the vector components with smallest singular values, we can reconstruct the linear transformation achieved by the original network with lesser parameters and computations at the price of small error. Authors of [22] extended this approach for deep convolutional networks. Their result is a one shot compression algorithm for the whole network, however the network needs to be fine tuned(retrained) to recover lost performance. Authors report x5.46 and x2.67 reduction in total weights and FLOPs. The core ideas of their algorithm could be described as below.

• The first step is selecting a lower rank value using variational Bayesian matrix factorization.

• The second convolutional layer to the first fully connected layer are decomposed using Tucker decomposition. The result is a set of 3 expressions that can be used to approximate the output of original layers with lesser number of operations.

• All remaining fully connected layers are decomposed using SVD. • Finally the model is fine tuned (retrained on the whole dataset 10

epochs) to recover close to original inference accuracy.

(32)

3.1.3 Pruning

Pruning is perhaps most popular amongst the algorithms that fall in the first category. Network pruning is used to describe the removal of weights that take very small values after training a neural network. Such connections contribute very less to the output of the layer and hence their removal doesn’t have a big impact on the performance of the neural network. The approach was demonstrated to avoid over-fitting and reduce network complexity in 1989 by LeCun et al. [12]. In 2015, Pruning was used to compress state of the art CNN models to about a tenth of their original size without causing any loss of accuracy by Han et al. [13].

3.2 Quantization approaches

This section covers three common methods for quantizing weights in a neural network.

3.2.1 Uniform affine quantizer

Consider a floating point variable with values within some range (xmin, xmax).

The variable needs to be quantized to N bits. We define the quantized variable to take integer values in the range (0, 2N − 1). Then, we need two

(33)

The quantization operation can be formally stated as xQ=            0 x≤ a round(_∆x) + z a≤ x ≤ b 2N−1 _x_{≥ b} (3.3)

where a =−∆z and b = (N −1−z)∆ are the minimum and maximum values representable without error after quantization respectively.

The de-quantization operation can be described as

xdeQ = (xQ− z)∆ (3.4)

A special case of affine quanntizer is the uniform symmetric quantizer, which restricts the zero point to 0 for an unsigned variable or to N /2 for a signed variable.

When storing the model after quantization, we only need to store ∆ and z in addition to the quantized values. Thus, the compression gain on using uniform affine quantizer is almost equal to 32/N for a model with large number of parameters.

3.2.2 K means clustering based quantization

To understand this quantization method, let us first look at the k means clustering algorithm. Given some sample data points, the k means clustering algorithm aims to partition the given dataset into k clusters such that weighted average of cluster variance is minimized. If we consider {x1, . . . , xn} to be the samples in the dataset, and C = {C1, . . . , Ck} denotes

the clusters after partitioning, we want to solve

arg min

C k

∑

i=1

(34)

The k means clustering algorithm, also known as Lloyd’s algorithm uses a 2 step iterative method. In the beginning, k centroids for k clusters are randomly initialized. At the first step of each iteration, all data samples are assigned to the cluster with the nearest centroid. In the second step, new centroid locations are calculated by taking mean of all the points in the cluster. This two step process is repeated until the algorithm converges. The convergence of this algorithm is guaranteed as number of iterations approach infinity [23]. However, the obtained solution is not necessarily optimal, and can vary for different initialization of centroids.

The objective function stated above can also be seen as minimizing the quantization error variance, if the k centroids were representative quantization levels. In k means clustering based quantization, k representative levels are determined using the k means algorithm. Thus after quantization, all the samples can be expressed using symbols that occupy log₂k bits each. It is worth noting for this approach that the mapping of representative levels to stored symbols also needs to be stored requiring some extra storage. Directly using the compressed model for inference also requires lookup of this symbol to value translation for each parameter adding O(kn)extra computation time for a compressed model with n parameters and k centroids.

3.2.3 Quantization aware training

(35)

updated as follows in each step of training. xout = SimQuant(x)

= ∆(xQ− z)

(3.6)

Figure 3.1: Uniform affine quantizer

The derivative of this simulated quantizer is zero almost everywhere, as can be seen from figure 3.1. This causes a problem when updating weights using backpropogation algorithm, forcing the updates to be zero. To solve this, the simulated quantizer is approximated with the following function.

xout=            xmin x≤ xmin x xmin ≤ x ≤ xmax xmax x≥ xmax (3.7)

(36)

discussed in section 2.1 with quantization aware training is

wf loat= wf loat− η

∂L ∂wout

.Iwf loatϵ(wmin,wmax) (3.8)

where Ixϵ(a,b) refers to the indicator function which is 1 when xϵ(a, b) and 0

otherwise. Readers may refer to [14] for more details.

3.3 Deep Compression

Deep Compression [15] was proposed in 2015 by Song Han et al. and achieves 35x compression on AlexNet without any loss of accuracy. It can be considered as a modern benchmark for any generic model compression approaches. It combines network pruning, k-means clustering based quantization and Huffman coding to compress any deep neural network. The main motivation behind development of Deep Compression was to reduce storage and thereby also reduce energy consumption in fetching weights for computations for battery constrained mobile applications. The steps involved in Deep Compression are described as below.

• First step is pruning of weights, which involves removing all connections having weight below some threshold value.

• Then the network is retrained, and the resulting sparse structure is stored in a way to further reduce storage space used.

• The next step involves use of k-means clustering to determine centroid of bins for quantization (The authors also analyze different centroid initialization approaches for k-means, suggesting linear initialization works best). This way, we are left with only few shared weights for each layer.

(37)

• Finally, the quantized weights are further compressed using Huffman coding.

Using methods like K-means clustering based quantization and Huffman coding allow us to greatly reduce the storage required for storing the model, however, conventional GPU implementations cannot exploit the reduced number and precision of representative weights to reduce computation time and energy. Towards this end, the authors have also developed EIE hardware accelerator [24] in their other work that efficiently works with the compressed model to be able to reap full benefits of the Deep Compression algorithm in terms of computation speedup.

(38)

Chapter 4 Methods

4.1 L1 regularization before pruning (L1BP)

Our first proposal is the a better algorithm for parameter set cardinality reduction. We discussed some popular methods that fall in this category in the previous chapter. Methods that involve designing a new architecture manually with fewer parameters (Eg. SqueezeNet[11]) can achieve great compression but are limited to very specific applications. Methods using matrix decomposition to reduce the number of parameters and computations perform relatively well in reducing model execution time but their compression gains are limited as they don’t take the non linearity of activation function into consideration. Of all the methods discussed earlier, we saw that it is only the pruning approach that meets our design requirements of a compression algorithm which is highly generalizable, requires minimal access to original dataset, and is of low complexity. However, it is a very simple approach whose performance could be significantly improved if we were to prune a solution that is already sparse.

(39)

as the original model. We say the same architecture as it keeps our approach readily applicable to any new DNN model. Exploring new architetures for the sparse solution systematically, by say, merging layers together or introducing residual connections(Szegedy et al.[25]), is an avenue which was not explored given the time frame of this project.

With the stated goal in mind, now we formulate our approach. Consider a model that has been trained, resulting in solution set W . We wish to find a new solution set { ¯W∗} such that this set is sparse but manages to approximate the output of the original solution set{W∗}. To obtain this new solution, we can define the following optimization problem.

minimize ¯ W L(y, f(x, ¯ W )) + λ∥ ¯W∥0 (4.1) where: L = Loss function

f =Equivalent function for the whole model

x =Input to the model

y =Desired output of the model

(40)

Here,∥ ¯W∥0 computes the number of non zero values in ¯W. Minimizing the

L0 norm directly enforces sparsity, however the resulting function is non

convex as can be seen from figure 4.1. Therefore we relax it to L1 norm to

make solving the problem easier. The modified problem is given as minimize

¯

W L(y, f(x, ¯

W )) + λ∥ ¯W∥1 (4.2)

This problem is same as the objective of the original model, just with added L1regularization. However, retraining the complete model adds significant

computational complexity and requires full access to the original dataset. These issues can be circumvented by splitting the training of new model to be done layer by layer, using the output of trained model as desired output of our new sparse model. Thus, we define the following optimization problem for each layer i in the model.

minimize ¯ wi L(fi(xi, w∗i), fi(xi, ¯wi)) + λi∥ ¯wi∥1 (4.3) where: L = Loss function

fi =Function modelled by the ithlayer

¯

wi =New model weights in ithlayer

w∗_i =Trained model weights in ith_layer

xi =Input to ithlayer

The optimization problems are solved using Adam optimizer starting with the first layer problem, followed by second layer problem and so on. The values of λiare selected manually such that regularization cost is about 10%

(41)

the amount of training data required to train the new sparse model.

Once a solution ¯w∗_i is obtained for all layers i, we proceed to prune this new model. The intuition is that pruning the regularized model should lead to smaller loss in performance when compared with pruning the original model as the L1regularized model already tends to be sparse. The remaining

weights ¯w∗_i after pruning are then retrained to recover any loss in accuracy, as mentioned in [15].

We observed that compared to the one shot pruning of the model to target sparsity followed by retraining, a gradual pruning procedure as proposed by Zhu and Gupta [26] gives significantly better performance. The model is first pruned to some initial target sparsity, retrained, pruned to a higher target sparsity and retrained, and so on. This target sparsity in each of the gradual pruning steps is described by the following equation.

st= sf + (si− sf)(1− t− t0 n∆t ) 3 _{for tϵ}_{t 0, t0+ ∆t, . . . , t0+ n∆t} (4.4) where: si =Initial sparsity st=Sparsity at step t sf =Final sparsity ∆t =Pruning frequency n =Total pruning steps

(42)

Algorithm 3 L1BP algorithm for model compression Require: L : Number of layers in the model

w∗_i :Trained model weights for all layers λi :Regularization coefficients for all layers

sf :Final sparsity

∆t :Pruning frequency n :Total pruning steps

Initialize: ¯ wi ← ¯wi0 i← 1 si ← 0 t← t0 while i≤ L do

Train new layer i weights ¯wi using Adam optimizer(algorithm 1) with

objective stated in equation 4.3 i← i + 1

end while

while t≤ to+ n∆tdo

Prune complete model to target sparsity st(equation 4.4), and retrain

for ∆t steps t← t + ∆t

end while

(43)

4.2 Hessian Aware Quantization (HAQ)

Our second proposal is for improving the performance of quantization methods used for compressing DNNs. A lot of the research in this area has been focussed on developing better schemes for quantizing neural network weights, such as K means clustering proposed by Han et al. [13], and a lossy compression algorithm proposed by Jin et al. [27]. Only few works consider modifying DNN training so as to obtain solutions that are robust to quantization. Using such an approach could give us significant compression gains. Moreover, all quantization schemes are likely to perform better for such robust solutions and they can be combined with existing state of the art quantization algorithms to achieve even better performance.

One such algorithm

was proposed by Krishnamoorthi [14], called quantization aware training, which we discussed in section 3.2.3. Their proposal involved introducing artificial quantization error during training of the model, in order to ensure that the quantized solution has similar performance as the original solution. However, the gradient approximation used for quantized weights during weight update during training is only a simple heuristic without any strong analytical motivation. Our idea is to find a robust solution by examining the second order derivative information of the loss surface. To motivate this approach, consider the scenario presented in figure 4.2.

(44)

Figure 4.2: Simple figure illustrating two solutions with similar minimum value but different sensitivity to noise.

of loss function with respect to weights.

Hi,j =

∂2_L

∂wi∂wj

∀i, jϵ{1, . . . , N} (4.5)

Here N denotes the total number of weights in the model. For our purpose, it is more informative to look at the largest eigenvalues of this Hessian matrix, rather than the Hessian matrix itself. The largest eigenvalues of the Hessian matrix convey curvature information about the directions with fastest rate of change of gradient of the loss function. This allows us to understand the curvature by just looking at few eigenvalues, when the complete Hessian matrix would have N2_{elements. There are also numerical iterative methods}

for estimating these largest eigenvalues directly (Demmel and James [28]), without explicitly computing all eigenvectors and eigenvalues.

Thus, to analyze the quality of a solution, we look at the largest eigenvalues of the Hessian of weights. All eigenvalues tend to be non negative near a local minima, unless the optimizer is stuck at a saddle point. Therefore, we prefer solutions having small ’largest eigenvalues’.

(45)

stop early and reinitialize if we see bad largest eigenvalues. This knowledge can also be used to quantize different weights to different number of bits depending on each weight’s sensitivity to quantization. Such adaptive quantization can also lead to significant improvement in compression, which was recently demonstrated by Dong et al. [29].

Once we obtain a solution which is robust to quantization using stated methods, we can proceed to quantize it using any quantization scheme. In this project, we use uniform affine quantizer as described in section 3.2.1. We wanted to first analyze the improvement in compression on using our scheme when compared with direct quantization. It should be possible to further improve the compression gains by using a more sophisticated scheme like k means clustering based quantization as discussed in section 3.2.2. The algorithm used in our experiments is presented as algorithm 4.

Algorithm 4 HAQ algorithm for model compression Require: DNN model architecture

K :Number of random initialization iterations Tk for k = 1, . . . , K : Number of training epochs

J :Number of largest eigenvalues to be computed

Initialize:

k← 0

while k < K do

Select random seed θk

Initialize layer weights wk

i with seed θk, i = 1, . . . , L

Train all wk

i using Adam optimizer(algorithm 1) for Tkepochs

Estimate J largest eigenvalues ei,j of Hessian of loss w.r.t. each layer

weights using Arnoldi algorithm[30] Ek← ∑L i=1 ∑J j=1e 2 i,j k← k + 1 end while k∗ ← arg minEk Quantize wk∗

i , i = 1, . . . , Lusing uniform affine quantizer return wk∗

i , i = 1, . . . , L(Quantized weights)

(46)

(47)

Chapter 5 Experiments and Results

5.1 Experimental setup

5.1.1 Datasets

We test our approaches on models trained with two image datasets.

1. MNIST MNIST is a database of grayscale handwritten digits containing 60,000 train images and 10,000 test images. Each image has 28x28 pixels with the digit positioned close to the center. Some sample images from the MNIST database are shown in figure 5.1. State of the art DNNs have achieve <0.5% error rate on the MNIST dataset [31].

(48)

0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25 0 10 20 0 5 10 15 20 25

Figure 5.1: Some sample images from the MNIST dataset

5.1.2 Models

We designed two 5 layer fully connected networks to analyze the performance of approaches. We also tested approach 1 on a modified version of AlexNet. The architectures of the three models used are described below.

1. FC1: 5 fully connected layer architecture with layer sizes: {200, 100, 60, 30, 10}.

2. FC2: 5 fully connected layer architecture with layer sizes: {20, 18, 15, 12, 10}.

(49)

Figure 5.2: 10 random images from the 10 classes in the CIFAR-10 dataset [33]

5.1.3 Setup

We came up with three combinations of dataset and models to test our approaches. Each of these setups was implemented using Tensorflow (Abadi et al. [34]). The setups are described below.

1. Setup 1: Each image from the MNIST database is flattened to a 784(28x28) length vector. FC1 architecture is used. All weights are initialized with random values between -0.2 and +0.2. Rectified linear unit(ReLU) is used as the activation function because it’s non-saturating nature helps overcome the vanishing gradients problem for deep neural networks. Softmax is used at the final layer to obtain the confidence in selecting a particular label as output. Cross entropy loss function is used for training the model. Accuracy of the model after training is 97%.

(50)

Figure 5.3: Architecture of the modified AlexNet model used in setup 3 architecture is used. The main motivation for designing this setup was feasibility of full Hessian computations with available hardware. 3. Setup 3: The modified AlexNet architecture is used to train on the

CIFAR10 dataset. CIFAR10 images are resized from 32x32 to 70x70 using interpolation. This was done mainly to preserve as much as possible of the original AlexNet architecture which was designed for ImageNet database images of size 224x224 pixels. After training, the model achieved 74% classification accuracy.

5.2 L1BP results

(51)

model. The graph is plotted against new model size ratio after compression. We can see that both graphs start close to 97% accuracy which is the accuracy of the original model. As we keep on increasing target sparsity after compression, we see that both graphs once again converge in the rightmost part of the graph where they approach 10% test accuracy which is the accuracy of a random classifier for 10 class classification.

In most applications, we don’t want to compromise much on accuracy for improving compression. This is why the region of interest in this graph involves points with high test accuracy. We see that the regularized model can be pruned to a higher sparsity of 0.95 for <1% accuracy loss. On the other hand original model can only be pruned to target sparsity of 0.9, for <1% accuracy loss. Thus, while original model can be compressed by 10x with pruning, regularized model can be pruned by 20x with pruning.

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975

Weight sparsity

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 Test set accuracy

Inference accuracy after pruning and retraining the models

Original model

Regularized model

Figure 5.4: Inference accuracy vs sparsity after pruning and retraining for original and regularized 5 layer model.

(52)

train, thus it is highly time consuming to try many hyperparameters with cross validation. Since each layer is trained separately, the choice of regularization coefficient for each layer is one of the important design considerations. Choice of initialization of weights for the regularized model is another important consideration. Because of such large hyperparameter space, the regularized model could not be completely trained with random initialization. So, we initialized the regularized model with trained parameters from the original model. After training on one tenth of the original dataset for 2000 epochs, the obtained results are plotted in figure-5.5.

0.70

0.75

0.80

0.85

0.90

0.95 Weight sparsity

0.45

0.50

0.55

0.60

0.65

0.70

0.75 Test set accuracy

Inference accuracy after pruning and retraining the models

Original model

Regularized model

Figure 5.5: Inference accuracy vs sparsity after pruning and retraining for original and regularized AlexNet model.

(53)

5.3 HAQ results

Our first objective is to analyse the feasibility of using Hessian information to improve quantization performance. Evaluating the Hessian of a deep neural network is both computationally and memory intensive. The number of computations and storage required grow with the square of the number of parameters in the model. For this reason, we designed setup 2 for this analysis.

Evaluating the complete Hessian matrix with all weights was not possible with available memory constraints. So we evaluate the Hessian matrix for each layer. This can be considered as approximating the complete Hessian with a block diagonal matrix. The Hessian was evaluated using a batch of 500 random samples from the train set for trained model parameters. Figure 5.6 shows the plots of 10 largest eigenvalues of Hessian of each layer, for two different random initializations. We select these initializations to highlight the amount of variations in the curvature of valleys the optimizer converges at. Let’s call the two seeds as seed A and seed B respectively. It can be seen that the energy of the Hessians is generally larger for seed B compared to seed A. This effect is more pronounced in the 4th_{and 5}th_layer

graphs. Thus, one would expect the seed A initialization to be less sensitive to quantization noise.

(54)

(a) Eigenvalues of Hessian for seed A

(b) Eigenvalues of Hessian for seed B

(55)

3

4

5

6

7

8 Bits per weight after quantization

0.2

0.4

0.6

0.8 Test set accuracy

Inference accuracy after quantizing model trained for 1000 epochs

Seed=3

Seed=4

Figure 5.7: Inference accuracy vs bits per weight after quantization for different initializations after 1000 epoch training

This shows that the Hessian can vary significantly for different initializations which can be used to find a quantization robust solution. Another aspect to consider is to see if the Hessian values vary significantly with training as well. To answer this, we train the same models of 2000 epochs compared to 1000 epochs in the first plot, and then measure their performance for different quantization levels. The results are shown in figure-5.8. We see that the seed 3 solution actually got worse and the seed 4 solution got better in terms of quantization performance when compared with the solution obtained after 1000 epochs of training. This shows that the curvature near solution can change drastically with training as well.

(56)

3

4

5

6

7

8 Bits per weight after quantization

0.2

0.4

0.6

0.8 Test set accuracy

Inference accuracy after quantizing model trained for 2000 epochs

Seed=3

Seed=4

Figure 5.8: Inference accuracy vs bits per weight after quantization for different initializations after 2000 epoch training

using per layer uniform quantization for small models like these, as much information is lost on quantizing a less sensitive layer to say (k-1) bits which compensates for the accuracy gained on quantizing a more sensitive layer to (k+1) bits. The compression levels obtained on using such a scheme is likely to increase for bigger models as quantization gives worse performance for smaller models [14]. Very recent research in this area [29] also indicates the same.

5.4 L1BP + HAQ results

(57)

affine quantizer. The obtained results are presented along with the results for pruning followed by quantization which is fundamentally the algorithm proposed in Deep Compression.

Pruning L1BP Improvement

3.33x 5x 1.5x

UA Quant HAQ Improvement

4x 5.33x 1.33x

Prune+UA Quant L1BP + HAQ Improvement

13.33x 26.66x 2x

(58)

Chapter 6 Conclusions and Future

Work

In this thesis, we undertook the problem of compressing trained deep learning models without causing any significant loss in performance measured using test set inference accuracy. Our goal was to design a compression algorithm that is generalizable, requires minimal access to the original dataset and has lower complexity compared to the training of the original model. By leveraging our understanding of optimization theory, we developed and tested two new approaches that achieve 2x further improvement in compression over state of the art in our experiments. We also provided the basic idea of a third approach that could in principle perform even better than the first two approaches.

The first approach exploits the idea that L1regularization promotes sparsity.

We use this idea to develop an algorithm that trains a new model using only 1/10th _{of the original dataset and about 1/12}th _{of the training time}

(59)

The second approach is based on the idea that Hessian information can be used to estimate the sensitivity of a solution to quantization noise. We evaluate and analyse the largest eigenvalues of the hessian of weights of each layer to this end. We find out that the largest eigenvalue is a good indicator of sensitivity to quantization noise, and that it can vary significantly for different initializations and with number of training epochs. By compressing the different solutions obtained for different initializations, we observe 1.33x further gain in compression even for a small model. Once again, this approach is readily applicable to any DNN. To make the approach more complete, we will need to design a method to explore and select a solution robust to quantization with minimal hyperparameter tuning. After that, we can hope to extend the obtained results for larger benchmark architectures.

There is another research

direction that combines the ideas of both categories of approaches. The core idea that encompasses both these approaches could be stated as minimizing the number of unique representative values in the model. One function that captures this idea is the information entropy, a measure of randomness of a random variable. This approach is based on the minimum description length principle from information theory. The principle could be crudely translated as the best model to explain the pattern inherent in the feature and label pairs in a given dataset is the smallest model that can explain the same. One way to utilize this idea is to regularize the DNN training objective with the information entropy of parameters in the model. The resulting problem is given as follows. minimize W N ∑ j=1 L(f(xj, W ), yj) + λH(W ) (6.1)

(60)

(61)

Bibliography

[1] McCulloch, Warren S and Pitts, Walter. “A logical calculus of the ideas immanent in nervous activity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.

[2] Hebb, Donald Olding. The organization of behavior: A neuropsychological theory. Psychology Press, 2005.

[3] Cybenko, G. “Approximation by superpositions of a sigmoidal function”. In: Mathematics of Control, Signals and Systems 2.4 (Dec. 1989), pp. 303–314. ISSN: 1435-568X. DOI:10 . 1007 / BF02551274. URL:https://doi.org/10.1007/BF02551274.

[4] Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. “Multilayer feedforward networks are universal approximators”. In: Neural networks 2.5 (1989), pp. 359–366.

[5] Google Duplex. https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html. Accessed: 2019-09-01. [6] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “ImageNet

Classification with Deep Convolutional Neural Networks”. In: Commun. ACM 60.6 (May 2017), pp. 84–90. ISSN: 0001-0782. DOI: 10.1145/3065386. URL: http://doi.acm.org/10.1145/3065386. [7] Deng,

(62)

[8] Russakovsky, Olga et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. DOI:10.1007/s11263-015-0816-y.

[9] Ng, Andrew Y. “Feature selection, L 1 vs. L 2 regularization, and rotational invariance”. In: Proceedings of the twenty-first international conference on Machine learning. ACM. 2004, p. 78. [10] Shalev-Shwartz, Shai et al. “Online learning and online convex

optimization”. In: Foundations and Trends® in Machine Learning 4.2 (2012), pp. 107–194.

[11] Iandola, Forrest N. et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. 2016. arXiv: 1602 . 07360 [cs.CV].

[12] LeCun, Yann, Denker, John S, and Solla, Sara A. “Optimal brain damage”. In: Advances in neural information processing systems. 1990, pp. 598–605.

[13] Han, Song et al. “Learning both weights and connections for efficient neural network”. In: Advances in neural information processing systems. 2015, pp. 1135–1143.

[14] Krishnamoorthi, Raghuraman. “Quantizing deep convolutional networks for efficient inference: A whitepaper”. In: arXiv preprint arXiv:1806.08342 (2018).

[15] Han, Song, Mao,

Huizi, and Dally, William J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. 2015. arXiv:1510.00149 [cs.CV].

(63)

[17] Nesterov, Yurii E. “A method for solving the convex programming problem with convergence rate O (1/kˆ 2)”. In: Dokl. akad. nauk Sssr. Vol. 269. 1983, pp. 543–547.

[18] Hinton, Geoffrey. Lecture on Neural Networks for Machine Learning. http : / / www . cs . toronto . edu / ~tijmen / csc321/slides/lecture_slides_lec6.pdf. Accessed: 2019-09-01. [19] Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic

Optimization. 2014. arXiv:1412.6980 [cs.LG].

[20] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. “Deep learning”. In: nature 521.7553 (2015), p. 436.

[21] De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. “A multilinear singular value decomposition”. In: SIAM journal on Matrix Analysis and Applications 21.4 (2000), pp. 1253–1278. [22] Kim, Yong-Deok et al. Compression of Deep Convolutional Neural

Networks for Fast and Low Power Mobile Applications. 2015. arXiv: 1511.06530 [cs.CV].

[23] Hartigan, John A and Wong, Manchek A. “Algorithm AS 136: A k-means clustering algorithm”. In: Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1 (1979), pp. 100–108. [24] Han, Song et al. “EIE: efficient inference engine on compressed deep

neural network”. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE. 2016, pp. 243– 254.

[25] Szegedy, Christian et al. “Inception-v4, inception-resnet and the impact of residual connections on learning”. In: Thirty-First AAAI Conference on Artificial Intelligence. 2017.

Keywords Abstract

Deep Learning Model

Compression for Edge

Deployment

ASHUTOSH VAISHNAV

Abstract

Keywords

Abstract

Nyckelord

Acknowledgements

Author

Place for Project

Examiner

Supervisor

Contents

List of Algorithms

List of Figures

Chapter 1

Introduction

1.1

Problem Statement

Chapter 2

Background

2.1

Optimization

2.2

Artificial Neural Networks (ANNs)

2.2.1

Training ANNs

2.2.2

Convolutional neural networks (CNNs)

Chapter 3

Related Work

3.1

Parameter

set

cardinality

reduction

approaches

3.1.1

Smaller architectures

3.1.2

SVD and Tucker decomposition

3.1.3

Pruning

3.2

Quantization approaches

3.2.1

Uniform affine quantizer

3.2.2

K means clustering based quantization

3.2.3

Quantization aware training

3.3

Deep Compression

Chapter 4

Methods

4.1

L1 regularization before pruning (L1BP)

4.2

Hessian Aware Quantization (HAQ)

Chapter 5

Experiments and Results

5.1

Experimental setup

5.1.1

Datasets

5.1.2

Models

5.1.3

Setup

5.2

L1BP results

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975

Weight sparsity

0.2

0.3

0.4

0.5

0.6