Investigating minimal Convolution Neural Networks (CNNs) for realtime embedded eye feature detection

(1)

Investigating minimal Convolution

Neural Networks (CNNs) for

realtime embedded eye feature

detection

WEIHONG SUNG

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

Convolution Neural

Networks (CNNs) for realtime

embedded eye feature

detection

WEIHONG SUNG

Master in Machine Learning Date: 2020-06-06

Supervisor: John Folkesson

Examiner: Hedvig Kjellström <hedvig@kth.se>

School of Electrical Engineering and Computer Science Host company: Tobii AB

(3)

(4)

Abstract

With the rapid rise of neural networks, many tasks that used to be difficult to complete in traditional methods can now be solved well, especially in the computer vision field. However, as the tasks we have to solve have become more and more complex, the neural networks we use are becoming deeper and larger. Therefore, although some embedded systems are powerful nowadays, most embedded systems still suffer from memory and computation limitations, which means it is hard to deploy our large neural networks on these embedded devices. This project aims to explore different methods to compress the original large model. That is, we first train a baseline model, YOLOv3[1], which is a famous object detection network, and then we use two methods to compress the baseline model. The first method is pruning by using sparsity training, and we do channel pruning according to the scaling factor value after sparsity training. Based on the idea of this method, we have made three explorations.

Firstly, we take the union mask strategy to solve the dimension problem of the shortcut-related layers in YOLOv3[1]. Secondly, we try to absorb the shifting factor information into subsequent layers. Finally, we implement the layer pruning and combine it with channel pruning. The second method is pruning by using Neural Architecture Search (NAS), which uses a deep reinforcement framework to automatically find the best compression ratio for each layer. At the end of this report, we analyze the key findings and conclusions of our experiment and purpose the future work which could potentially improve our project.

(5)

Sammanfattning

Med den snabba ökningen av neurala nätverk kan många uppgifter som bru- kade vara svåra att utföra i traditionella metoder nu lösas bra, särskilt inom datorsynsfältet. Men eftersom uppgifterna vi måste lösa har blivit mer och mer komplexa, blir de neurala nätverken vi använder djupare och större. Där- för, även om vissa inbäddade system är kraftfulla för närvarande, lider de flesta inbäddade system fortfarande av minnes- och beräkningsbegränsningar, vilket innebär att det är svårt att distribuera våra stora neurala nätverk på dessa inbäd- dade enheter. Projektet syftar till att utforska olika metoder för att komprimera den ursprungliga stora modellen. Det vill säga, vi tränar först en baslinjemo- dell, YOLOv3[1], som är ett berömt objektdetekteringsnätverk, och sedan an- vänder vi två metoder för att komprimera basmodellen. Den första metoden är beskärning med hjälp av sparsity training, och vi kanalskärning enligt skal- ningsfaktorvärdet efter sparsity training. Baserat på idén om denna metod har vi gjort tre utforskningar. För det första tar vi unionens maskstrategi för att lösa dimensionsproblemet för genvägsrelaterade lager i YOLOv3[1]. För det andra försöker vi absorbera informationen om skiftande faktorer i efterföljan- de lager. Slutligen implementerar vi lagerskärningen och kombinerar det med kanalbeskärning. Den andra metoden är beskärning med NAS, som använder en djup förstärkningsram för att automatiskt hitta det bästa kompressionsför- hållandet för varje lager. I slutet av denna rapport analyserar vi de viktigaste resultaten och slutsatserna i vårt experiment och syftar till det framtida arbetet som potentiellt kan förbättra vårt projekt.

(6)

1 Introduction 1

1.1 Research Question . . . 2

1.2 Contributions & limitations . . . 2

1.3 Societal impact and sustainability . . . 3

1.4 Ethical considerations . . . 4

1.5 Outline . . . 4

2 Background 5 2.1 Deep Neural Networks . . . 5

2.1.1 Fully Connected Layers . . . 6

2.1.2 Convolutional Layers . . . 7

2.2 Object Detection . . . 8

2.2.1 Early works . . . 8

2.2.2 Detection algorithm based on Object Proposals . . . . 9

2.2.3 Detection algorithm based on Integrated CNNs . . . . 11

2.3 Model Compression in Deep Learning . . . 13

2.3.1 Pruning in Deep Learning . . . 13

2.3.2 Quantization . . . 17

2.3.3 Distillation . . . 19

3 Methods 22 3.1 Dataset . . . 22

3.2 Data pre-processing . . . 23

3.3 Baseline Network . . . 24

3.4 Pruning by using sparsity training . . . 26

3.4.1 Pruning channels of shortcut-related layers . . . 28

3.4.2 Absorbing the shifting factor information into subsequent layers . . . 29

3.4.3 Layer pruning . . . 30

v

(7)

3.5 Pruning by using NAS . . . 31 3.6 Evaluation methods . . . 34

4 Results 35

4.1 Training and implementation details . . . 35 4.2 Baseline model . . . 37 4.3 Pruning by using sparsity training . . . 38

4.3.1 Sparsity training with detector Batch Normalization (BN) layers . . . 38 4.3.2 Sparsity training with feature extractor BN layers . . . 38 4.3.3 Sparsity training with all BN layers . . . 39 4.3.4 Absorbing the shifting factor information into subse-

quent layers . . . 43 4.3.5 Layer pruning . . . 44 4.4 Pruning by using NAS . . . 45

5 Discussion 48

6 Conclusions 51

Bibliography 53

A YOLOv3 structure 61

B Acronym List 65

(8)

Introduction

In recent years, the development of deep learning has reached an unprecedented level and it solved many hard problems in different industry fields. With the significant increase in the computing power of the graphics card, we are able to train deeper neural networks for more complex tasks like segmentation and object detection. Object detection has long been a key challenge in computer vision, involving both object classification and localization. CNNs have proven promising in recent years for tackling object detection. As a company focus on eye tracker development, Tobii has been committed to doing eye- tracking tasks for years, which means doing the gaze tracking or estimation in practice. There are three main benefits of doing gaze tracking. Firstly, study- ing eye movements can help us better understand human behavior because it represents what we focus on and what information we process. Secondly, by using eyes as a “pointer” on the screen, hand-free interaction with devices can be possible especially for those who are disabled. Finally, combining the eye- tracking with other input devices like keyboard, new user interfaces can be created and our users will have a brand new experience with these innovating interfaces. In this project, we focus on eye-feature detection like eyelids, pupil, iris, etc.. The reason why we want to detect eye-features is, by some methods and calculation, these features can be a base system or a helper system for gaze estimation. Moreover, eye-features can be also used for individual detection systems combined with other features.

Although some of the embedded systems are powerful nowadays due to hardware optimization or other reasons, most embedded systems still suffer from memory and computation limitations. Therefore, large CNNs are challenging to deploy on most embedded devices. Considerable recent attention has been

1

(9)

given to this problem by attempts to reduce the size of existing networks. This project involves using cutting edge techniques to obtain the smallest possible Convolution Neural Network (CNN) for eye feature detection. In recent years, there are three main methods used in model compression field: pruning, quantization and distillation. Since there are many existing libraries and functions in deep learning frameworks like pytorch[2] which have realized quantization very well and distillation mainly aims to improve the training accuracy of the small models, we believe that pruning is the most promising techniques to reach our goal because we want to find the optimal structure for our dataset.

1.1 Research Question

In this project, the research question is defined as: How can cutting edge techniques be applied to compress model and obtain the smallest CNN for eye- tracking tasks? Specifically, there are two directions for examination:

• The accuracy of the neural network

• The complexity of the neural network

The problem requires us to find the smallest CNN for eye feature detection.

This is a multi-object detection and classification task combined with model compression approaches. For the large network, we plan to use YOLOv3[1]

for object detection, and the main task is to compress this large network. That is to say, we want to achieve a small-sized model to detect eye-features without losing too much accuracy compared to the large model. There are mainly two risks we may encounter. Firstly, we haven’t trained an object detection model before and we don’t know the performance on our own dataset. Secondly, the training of an object detection model takes time and maybe there is not enough time for multiple experiments. Moreover, if there are not enough effective methods(by previous research) in terms of compressing objection detection models, we may consider designing new methods.

1.2 Contributions & limitations

This project is for those who want to deploy the deep neural network on embedded or mobile devices, especially for those who plan to detect or use eye

(10)

features of an eye-tracking or related project. Although there are many research papers in this field, most of them are for classification problems and test on some classical datasets like ImageNet[3]. Hence, our work will show how effective different model compression methods will be in an object detection problem. That is to say, we will try to validate whether the previous methods[4, 5] are useful and effective in an object detection problem(e.g. YOLOv3[1]).

Furthermore, we will explore different strategies so that these methods can be better integrated into object detection tasks, and we will also analyze the results to reason why a certain method or a certain strategy works or not.

Since our project mainly focuses on the model compression part, which means our goal is to find the CNN with the highest accuracy and lowest memory used, one limitation will be we don’t try to find the state-of-the-art result with large CNN. Due to the computational and time constraints, we will not adjust the hyper-parameters and optimize the original network structure to achieve the best baseline model accuracy. The other limitation is that all of our experiments are based on the internal dataset from Tobii, which means that we don’t validate methods used in this project on different datasets. Therefore, all conclusions obtained through experiments are limited to Tobii’s internal dataset, and the generalization of these conclusions and results are also restricted in this case.

1.3 Societal impact and sustainability

The results of this project could be potentially used for developing a better eye- tracking system which improves people’s life quality. For example, if we can reduce the model size and inference time successfully, it is possible to develop an advanced eye-tracking system on the embedded devices which will make it more convenient for the disable people to interact with these devices. With light models, the cost of producing these devices can be largely reduced so that more people can afford them. Moreover, it has the positive affect in the virtual reality field. Tobii has its own virtual reality devices(e.g. Tobii Pro) and using small models may help reduce the cost of them and improve the experience of the researchers while conducting eye-tracking studies.

In terms of the sustainability, this project makes the following contributions to Sustainable Development Goals (SDG) which was purposed by United States in 2015: Firstly, it can improve people’s life quality which actually ensure the healthy lives and promote well-being for all at all ages. Secondly, saving

(11)

the producing cost is saving the energy. We should always develop and use energy-efficient applications and devices.

1.4 Ethical considerations

Although this project has many positive contributions and impacts as mentioned in section 1.2 and 1.3, the results may be used in unethical ways. With the producing cost of the eye-tracking devices becomes cheaper, the use of these devices will become more popular. Some can easily collect data from the users of these devices and analyze the habits of the users. In fact, this can be regarded as a data breach, and it seriously violates the user’s privacy.

1.5 Outline

The report has the following structure:

• Chapter 1: Introduce the purpose and the research question of this project.

Briefly analyze the contribution, sustainability and ethical problems.

• Chapter 2: Introduce the background knowledge needed for the following chapters. It consists of the brief introduction of neural networks, the development of object detection and some common used model compression methods.

• Chapter 3: Elaborate the two methods used in this project in details. The first method is pruning by using sparsity training, the second is pruning by using NAS. The evaluation method is explained in the end of this chapter.

• Chapter 4: All the experiment results and the analysis are included in this chapter.

• Chapter 5: Further analyze the experiment results and summarize previous statements or conclusions.

• Chapter 6: Summarize all the project and discuss the future work.

(12)

Background

In 1949, Hebb started the first step of machine learning based on the learning mechanism of neuropsychology. The result of his unsupervised learning rule is to enable the network to extract the statistical characteristics of the training set, thereby dividing the input information into several categories according to their similarity. After that, IBM scientist Arthur Samuel developed a checkers program to learn a hidden model to guide the players by observing the current states, and this created a new field, machine learning, which was defined as giv- ing computers the ability to learn without being explicitly programmed. Later in the 1980s and 1990s, many famous machine learning algorithms were developed such as Support Vector Machine (SVM)[6], decision tree[7] etc.These algorithms were classified as traditional machine learning methods since they were mainly based on statistics and only had few or even no hidden layers(e.g.

linear regression). Nowadays, with the rapid development of deep learning, we can train a very deep neural network with very high accuracy, and deep learning is widely applied to complex tasks in different fields like computer vision and natural language processing. In this chapter, we will start by introducing the basic knowledge of deep neural networks(section2.1) and the development of object detection(section 2.2) which are closely related to our project. Fi- nally, based on this knowledge, we will introduce three main methods used in model compression with their related works.

2.1 Deep Neural Networks

Since large networks were hard to train, People had used traditional machine learning algorithms for a long time. However, in 1998, LeCun[8] et al. purposed the LeNet network, which was the first CNN and was widely applied

5

(13)

in the ATM later. Before that, neural networks were considered as the stack of fully connected layers and they were always hard to train. In contrast to this, LeNet used convolutional layers to replace the fully connected layers so that it had fewer parameters and was easier to train. In this section, we will introduce the fully connected layer and convolution layer individually, which are the pre-requisite knowledge for section2.2 and section 2.3.

2.1.1 Fully Connected Layers

As mentioned at the beginning of this chapter, Hebb[9] designed and created the first learning rule of machine learning and this rule had laid the foundation for the neural network learning algorithm. The Hebb rule believed that the learning process ultimately occurs at the synapse between neurons, and the strength of synaptic connections changed with the activity of neurons before and after synapses. Based on this theory, the mathematical formula should be:

h_j = σ(Σ_iw_ijx_i) (2.1)

where we can also see in Figure2.1, hj is a single node of the hidden layer, x_i is the input node, w_ij is the weight value which represents the strength of the connection between two nodes and σ is the activation function.

Figure 2.1: A simple fully connected network

Generally, we will add a bias term βi to each hidden node, and the activation function will do the non-linear transformation for the reduced value. The reason why we always use non-linear transformation like sigmoid or Relu[10]

is that if we stack multiple fully connected layers together and we use linear activation function, the result will be the same as we only use one fully connected layer.

(14)

2.1.2 Convolutional Layers

One of the problems of using fully connected layers is the number of parameters because there should be a weight to describe the strength of connection for any two nodes from the neighboring layers. For solving this problem, one effective approach is to use weight sharing. In most of the computer vision tasks, we can assume that the statistical properties of a certain part of the image are the same as the other parts, which means that the features we learned in this part can be used in another part, so we can use the same learned features for all positions on this image. For example, we can see Figure2.2 intuitively, we select a 3 × 3 kernel and let it slide over our input image data from top-left to down-right, and each time we do the dot-product of the kernel matrix and the small chunk of the data, which can be described by:

y = σ(Σ_cΣ_iΣ_jw_cij∗ x_cij) (2.2) where y is the output value(blue box in the output feature map in Fig- ure2.2), c is the input channel number(the original channel number is 3: red, green and blue), σ is the activation function. We can have multiple kernels with the same size to control the output channel number of the output feature map.

Figure 2.2: Illustration of the operation of a convolution layer

Each kernel is a feature extraction method, and it is similar to the filter we used in solving traditional computer vision problems(e.g. Gaussian filter).

The filters can filter out the part of the image that meets the conditions, but this time, the weights of these filters are learned by the neural networks. In practice, we always stack multiple convolutional layers and to construct a large CNN to solve more complex tasks because the features learned by just one convolutional layer are relatively local. The higher the number of layers, the more global the learned features.

(15)

2.2 Object Detection

Object detection is one of the largest subjects in the computer vision field.

The whole development history of object detection can be divided into two phases: techniques before the rise in popularity of deep learning which oc- curred around 2013 and current developments in deep learning-based object detection. In section 2.2.1, we will introduce the development of object detection during the first phase, and in section 2.2.2 and section 2.2.3 we will introduce some recent algorithms for object detection by based on deep learning.

2.2.1 Early works

During the first phase, the object detection algorithms were mainly based on the classical manual features. In 2001, Viola and Jones [11, 12] purposed a real-time face detection algorithm, which was called the VJ detector. The VJ detector used the traditional method, sliding windows, for detection, but the inference time is much faster than the other algorithm at that time. Inspired by the previous work[13], they combined multi-scale Haar-like features with Adaboost [14] to reduce the computation cost.

Then in 2005, using Histograms of Oriented Gradients (HOG) [15] for pedes- trian detection achieved tremendous results. HOG feature was regarded as the foundation of all the gradient-based object detectors. For detecting objects in different scales, the author applied the multi-scale image pyramid with sliding windows, which meant they kept the detector size unchanged while downscaled the image into different scales. Moreover, to keep the invariant feature(scale-invariant, shift-invariant, etc.) and differentiation (strong ability to express features), images were divided into small cells, and oriented gradient information was calculated in each cell. This drew on the experience of SIFT features purposed by Lowe et al.[16].

Finally, in 2008, Felzenszwalb et al. [17] purposed Deformable Part based Model (DPM), which reached the pinnacle of the development of detection algorithms based on classic manual features. The main idea of DPM can be simply understood as to split and convert the detection problem of the entire target in the traditional target detection algorithm into the detection problem of each small part, and then aggregate the detection results of each part to obtain the final detection result[17]. For example, the problem of detecting face tar-

(16)

gets can be decomposed into the detection of eyes, nose, and mouth under the idea of DPM, and the detection of pedestrians can be similarly decomposed into the detection of heads, limbs, and trunks. In terms of the model structure, DPM was an extension of the HOG detector[15], and it consisted of root-filters and a series of part-filters, which is called Star-model[17]. Later in [18, 19, 20, 21], the structure was further extended to Mixture model which aimed to solve the detection problem of three-dimensional objects under different per- spectives.

During the whole first phase, people focused on building complex models on low-level feature expressions and complex ensemble systems to improve detection accuracy[22]. However, with the rapid rise of neural networks, especially the CNNs, the abstraction ability, shift-invariant ability and scale-invariant ability of the network were getting stronger and stronger. Another important advantage of using CNNs was that CNN features were generated based on the training data rather than hand-crafted features, which meant that it did not require the prior expert knowledge on the problem we were trying to solve.

Therefore, around 2013, the research on object detection entered the second phase, that is, researchers developed the detection algorithms based on deep learning models.

2.2.2 Detection algorithm based on Object Proposals

In an object detection task, our algorithm should not only predict the class of objects we are interested given an image, but it should also predict the location of these objects(bounding boxes). In previous algorithms[11, 12, 15, 17], the authors used the sliding windows to enumerate all possible positions in different scales, which resulted in huge computation time. Therefore, object proposal methods had been introduced to solve this problem. Compared to the sliding window method, object proposal methods wanted to find some po- tential objects in the images instead of enumeration. SelectiveSearch[23] and BING [24] are two examples of object proposal algorithms. In this section, we will study the development of the most popular detection algorithms based on object proposals.

R-CNN[22] was the first detection algorithm based on object proposals and it used a simple detection strategy. Firstly, object proposals were extracted from the image through selective search and each proposal was scaled to the same size. Secondly, the Alexnet network[25] trained on ImageNet[3] was

(17)

used to extract features. Finally, an SVM classifier[6] was used to classify the objects. R-CNN achieved amazing results on the VOC2007 dataset[26], the mean Average Precision (mAP) increased to 58.5%. However, although R- CNN had made great progress, its disadvantages were also very obvious: its training was multi-stage, which was tedious and time-consuming. Moreover, due to repeated feature extraction on high-density proposed regions, its detection was extremely slow.

SPPNet[27] was purposed by Kaiming He et al. in 2014. Compared with the traditional CNN models, they added a spatial pyramid pooling layer between the convolutional layer and the fully connected layer, which enabled the input size of the whole network not to be warped to a constant number. Thus, SPPNet could realize feature extraction of any size and any aspect ratio area without scaling the proposed regions. But the problem was that its training was still multi-stage.

Fast-RCNN[28] was based on R-CNN[22] and SPPNet[27]. The main fea- ture of Fast-RCNN was that it implemented a multi-task learning method. It achieved synchronous training of target classification and bounding box regression. The training speed was 9 times that of R-CNN and the detection speed was 200 times of R-CNN. On the VOC2007 dataset[26], Fast-RCNN increased mAP from 58.5%[22] to 70.0%. The disadvantage of Fast R-CNN was that it still needed an external algorithm to extract the prior proposed regions, so it can’t achieve end-to-end processing.

Faster-RCNN[29] was the first end-to-end object detection algorithm based on deep learning models. It improved the mAP from 70.0%[28] to 78.8%

on the same dataset. The Region Proposal Network (RPN) was designed to solve the disadvantage of Fast-RCNN, which merged the external object proposal detection algorithms into a large deep neural network. Thus, object proposal, feature extraction, classification and bounding boxes regression were combined into a deep neural network framework.

FPN[30] was purposed in 2017. The previous object detection algorithms[22, 27, 28, 29] usually used the top-level features for detection because they thought the semantic information of the top-level features of the network was rich.

However, the position information of the targets was rough of these features, which wasn’t conducive to locate the bounding boxes. On the contrary, al-

(18)

Figure 2.3: Comparison of previous detection method(top) with FPN(bottom)[30]

though the bottom-level features had less semantic information, the position information was accurate. Therefore, the core idea of FPN was to feedback the top-level feature images layer by layer after the forward pass of the network and fused them with the feature maps of the previous layer. Figure 2.3 showed the difference between the previous methods[22, 27, 28, 29] and FPN. On this basis, multiple detection ports were induced from different depth in the network to detect targets of different scales.

2.2.3 Detection algorithm based on Integrated CNNs

Enlightened by Faster-RCNN[29], we would like to merge the functions of algorithms in different stages into one integrated framework, that is, we want to directly predict the class and the location of our targets. Unlike the two-stage algorithms mentioned in the previous section[22, 27, 28, 29] which implemented object proposal and classification separately, in this section, we focus on those one-stage networks which view the region proposal problem as a regression task and directly output the regression and classification results together.

YOLO[31] was the first one-stage detection algorithm. The main feature of YOLO was its fast inference speed compared to those two-stage models[22, 27, 28, 29] and it treated the object detection task directly as a regression problem, combining the two stages of region proposals and detection into one framework. When doing inference, the input images were divided into 49 small cells, and for each cell, two bounding boxes were predicted. For the model structure, GoogleNet[32] had been referenced but the inception mod-

(19)

ules[33] were replaced by 3×3 and 1×1 convolutional layers, and the size of the final output was 7×7×[2(bounding boxes)×(xcenter, ycenter,width, height, confidence_level)+number of classes]. Hence, the whole structure was simple and due to its high inference speed, YOLO can be applied in many real-time detection tasks. However, the disadvantage was obvious, since the number of cells was small and only two bounding boxes were predicted each cell, the mAP for YOLO was not as good as Faster-RCNN[29], especially in detecting small and dense targets.

Figure 2.4: The Architecture of YOLO[31]

One year later, Redmon and Farhadi purposed the second version of YOLO[34].

They wanted to solve the problem that YOLO was inaccurate in locating targets and it had a relatively low recall rate, so they referenced to Single Shot- MultiBox Detector (SSD)[35] and used multi-scale feature maps for detection.

Moreover, YOLO v2 had enlarged the input image size to 416×416 and divided them into 13×13 grid cells(5 predicted boxes each cell). As a result, the mAP and recall rates were both improved. Based on the second version, in 2018, Redmon and Farhadi improved their model again[1], which was YOLO v3.There were two main contributions in the third version: firstly, inspired by [36], they applied residual modules in the backbone network, which made the whole network deeper; secondly, they used FPN structure[30] to realize multi- scale detection. We can see while maintaining the speed advantage during the development process, YOLO continued to improve the network structure, and at the same time, drew various tricks from other excellent object detection algorithms to improve the accuracy.

SSD[35] was proposed by Liu et al. in 2015. SSD absorbed the advantages of

(20)

YOLO[31](fast inference time) and RPN [29] (accurate target location), and it referenced the anchor box mechanism applied in RPN[29] while detecting on the feature maps of different scales. SSD achieved similar performance with Faster-RCNN[29] while maintaining extremely fast detection speed. There were two major differences between SSD and Faster-RCNN: SSD set several anchors on the feature maps of different scales for detecting the bounding boxes while Faster-RCNN[29] only dealt within one scale and SSD directly performed multi-scale detection and bounding box regression on feature maps of different scales respectively.

Retina-Net[37] used the focal loss as the loss function which was revised on the traditional cross-entropy loss. Lin et al.[37] stated that the unbalanced distribution of targets and background data resulted in the relatively low accuracy of algorithms based on Integrated CNNs. Therefore, the focal loss was able to reduce the learning weights of the simple background samples while training so that Retina-Net can focus on those hard samples.

2.3 Model Compression in Deep Learning

In recent years, CNNs has been developed rapidly and achieved unprecedented results for different kinds of tasks, for example, image classification[25, 36]

and object detection[1]. Supported by the growing computational ability of modern GPUs, we can design a deeper neural network like [25, 36, 38, 32].

However, no matter how good these models can be, there are three main constraints of deploying CNNs in real-world applications, especially for those low computational power devices: model size, run-time memory, and the number of computing operations[4]. Therefore, we would like to find a method to compress our model without losing too much accuracy. Since our deep neural networks are often over-parametrized because we often use complex and large models at first to avoid the under-fitting problem, removing redundant parameters will not affect the predictive power.[39]. In this section, we will focus on three main methods for model compression which are frequently implemented and study recently.

2.3.1 Pruning in Deep Learning

Pruning is an important part of model compression, and it can be roughly divided into structured pruning and unstructured pruning. The structured prun-

(21)

ing means that we will prune the whole filters or layers to reduce the parameter size, but at the same time, we will modify the network structure. On the other hand, the unstructured pruning will only remove some connections between nodes. Since unstructured pruning is very unfriendly to the hardware, which means it often may not be able to accelerate and compress the network well in the process of hardware implementation, people gradually focus on the structured pruning, but the main idea of both is to eliminate the unimportant weights in the network and then fine-tune it to recover the accuracy.

The first pruning method could be found in 1990, which was purposed by Le- Cun, Denker, and Solla[40]. They defined the contribution of each weight as the change of the loss value after deleting this weight. Specifically, they used second-order Taylor expansion to calculate the contribution of each weight and then pruned those low contribution parameters. Pruning was also regarded as a regular term to improve generalization ability[40, 41].

Then pruning neural networks had not been widely studied for a long time until the rapid rise of the deep neural network. As we mentioned above, researchers were more likely to explore the structured pruning, which was also called coarse-grained sparsity[42], because coarse-grained pruning can achieve the similar or even better results compared with fine-grained pruning[42] and fine- grained pruning had a higher requirement for hardware devices. Therefore, in 2017, Hao Li et al.[43] purposed a method to pruning channels(filters) of CNNs. The main idea was to used L1-norm to calculate the sum of the absolute values of weights in each kernel matrix and then discarded those whose values were small. When calculating the sum, they tried two strategies: one was independently pruning, which meant they didn’t consider the pruning operations applied on the last layer, so all the weights of current channels will be summed up; while the other one was greedier, which will not consider the pruned weights of the last layer. Moreover, they also considered the pruning condition for residual modules because shortcut layers[36] required the same dimension for the two inputs. As a result, they stated that the pruning for the second convolutional layer should be decided by the pruning outcome of the shortcut layers. This was an important conclusion because it enlightened us on how to deal with some complex modules when pruning our models.

At the same time, Zhuang Liu et al.[4] suggested that channel level sparsity(pruning) was better. Fine-grained level pruning such as weight level pruning[44, 45] often gave the best compression rate and the highest accuracy,

(22)

but the inference time was slow and it required special hardware devices to help accelerate. On the other hand, although layer level sparsity[46] didn’t de- pend on special hardware devices, it was less flexible which meant this method was only effective when the whole network was deep enough like ResNet[36].

Therefore, the channel level sparsity had a good trade-off between them[4].

The way to sparsify the channels was sophisticated. They also added a regularization term to the loss function, but it was not the sum of the filter weights.

Instead, they summed up the scaling factors α of BN layers[47]. The whole pruning procedure can be concluded as following: Firstly, after achieving a large trained model, we resumed to do the sparsity training on our model which regularization term was included; Secondly, pruned the channels whose scaling factor α’s value were close to zero. Finally, fine-tuned the model to recover the accuracy. We can iteratively implement these three steps to achieve a better result. Although it achieved considerable results on several models[38, 36], there were mainly two problems: One was the choosing of hyper-parameter λ which controls the balance between the regularization term and the original loss term. It might take time to find a proper value for our dataset. The other one was more important, which was when we were pruning channels based on the scaling factor, we didn’t consider the bias values, and this might result in losing much accuracy.

Different from Zhuang Liu et al.[4], Franco Manessi et al.[48] built an end- to-end method for pruning. Specifically, this method didn’t require us to iteratively repeat the training and pruning process for seeking the optimal pruning threshold. In contrast, it allowed us to perform pruning during the back- propagation of the training phase and for different layers and found a proper threshold for each layer. They introduced the sibling networks which was able to reduce the weights of the original network. In terms of the contribution of the weight, regularization term was also used which consisted of the weights and bias of the affine map[48]. Moreover, there was another way to construct the regularization term. Christos Louizos et al. proposed using L0 regularization to prune neural networks[49]. The idea was also to prune networks by encouraging weights to go close to zero during the training phase. But introducing L0regularization made the loss function non-differentiated, so they applied a series of mathematical techniques like variational optimization to convert the non-differentiable L₀norm into a differentiable way, thereby solving the problem.

The aforementioned methods[43, 49, 48, 4] all viewed the magnitude of the

(23)

Figure 2.5: Illustration of variables’ relationship[50]

absolute value of weights as the contribution, and greedily pruned those had less contribution. However, JH Luo et al.[50] had another perspective on pruning, which was based on the reconstruction of the error. This type of method determined which filters to be pruned by minimizing the reconstruction error of the feature output, that was, finding information in the current layer which had little effect on the later layers. Specifically, they implemented the pruning process layer by layer. In Figure 2.5, we can see the relationship between the neighboring convolutional layers, for the i_th layer, we first sampled enough data to obtain the input value x_i+1 and the output value Y of the next layer.

Since the reduced sum of the convolution results of the input value xi+1 and the filters was Y (each kernel had one feature map), we can greedily choose the subset of these feature maps to make the sum of the set closer to Y, so we could know which kernels were unimportant. After the i_th layer was pruned, we should first fine-tune a few epochs before moving to the next layer. Simi- larly, He et al.[51] also considered the effect on the later layers after pruning the current layer, but they used LASSO regression as the criterion for deciding the contribution. R Yu et al.[52] minimized the reconstruction error of the penul- timate layer of the network and took into account the back-propagation error accumulation to determine which of the previous filters need to be pruned.

There was another track of pruning that became more and more popular recently. Inspired by NAS[53], Yihui He et al.[5] first introduced Reinforcement Learning (RL) into pruning models. They stated that it was time-consuming to compress models manually, and we usually designed the light models[54, 55, 56] empirically, so they proposed a method to automatically find the compression strategies. They designed a Deep Deterministic Policy Gradient (DDPG)[57]

RL agent to receive the state information which consisted of 11 features of the current layer t, and output the compression ratio of this layer. After the current layer was compressed, it moved to the next layer. When all layers were compressed, DDPG[57] agent would receive a reward R based on the accuracy and

(24)

FLoating-point OPeration (FLOP). Recently, [58] proposed to apply NAS[53]

directly to the network with flexible channels and layers. Minimizing the loss of the pruning network was beneficial to learning the number of channels. The feature map of the pruned network was composed of K feature segments sampled based on the probability distribution, and the loss was transmitted to the network weights and parameterized distribution through back-propagation.

2.3.2 Quantization

Generally, the parameters of neural network models are represented by 32-bit floating-point numbers. It is usually not necessary to retain such high precision. For instance, we can use 0 ∼ 255 to represent the original 32-bits precision, which means we sacrifice accuracy to reduce the space occupied by each weight. Besides, Stochastic Gradient Descent (SGD)[59] requires only 6 ∼ 8 bits of accuracy, so a reasonable quantized network can reduce the model’s storage size while ensuring accuracy. To quantize networks, there are three basic problems to be solved: how to quantize weights, how to calculate the gradient of quantized weights, and how to ensure accuracy. According to different quantization methods, it can be roughly divided into binary quantization, ternary quantization, and multi-bit quantization. We will discuss these methods in this section.

The main idea of the binary quantization was to represent the floating-point number in the weight matrix with two values(normally +1 and -1). Generally, a symbolic function or a linear symbolic function was considered to approxi- mate. In 2016, Mohammad Rastegari et al.[60] introduced two kinds of networks: Binary Weight Networks (BWN) and XNOR-Networks. In BWN, only network parameters were approximated with binary values, which brought about 32x storage compression and 2x speed increase, while in the latter network, both network inputs and parameters were binary, and achieved 58x speed increase and 32x storage compression.[60] Took BWN as an example, the forward propagation and the back-propagation were both implemented by the binary weights, so before doing the forward pass, we should binarize the filter weights. By replacing the original weights with the multiplication of a scaling parameter α = ^{|W |}_n^l1 and a binary matrix B = sign(W ) where l₁ indicated the l₁ norm and W represented the weights of the current filter, we could ap- proximate the original weights. Xundong Wu et al.[61] analyzed the weight adaption behavior of training these two networks. Later in 2017, X Lin et al.[62] used multiple binary weight to in order not to lose too much accuracy.

(25)

Ternary quantization was proposed to improve the binary quantization. As stated in [63], instead of using 1, -1 to represent weights, we should add 0 explicitly to the quantified values because when we were training neural networks, many weights were close to 0 or even 0. Besides, they applied ternary connect to eliminate the multiplications during the forward pass by stochas- tically sampling weights to be -1, 1 or 0. Chenzhuo Zhu et al.[64] proposed Trained Ternary Quantization (TTQ) which was also a method that could reduce the precision of weights to ternary values. They modified the quantified values from -1, 0, 1 to −W_lⁿ, 0, W_pⁿ where l was the index of layer and

−W_lⁿ, W_pⁿ were two trainable parameters. This method could even improve the accuracy compared with using the original weights on CIFAR-10. In 2016, Venkatesh, Nurvitadhi and Marr[65] proposed two methods to accelerate the computational efficiency of deep CNNs without losing accuracy. They referenced on the ternary network framework[66] and replaced the full-precision weights with 2-bit weights during the training phase.

For multi-bit quantization, S Gupta et al.[67] purposed a new strategy to replace the floating number with a 16-bit fixed-point number while training a deep learning model. Traditionally, the round-to-nearest strategy was applied and it rounded the floating number to the nearest fixed-point number. How- ever, the new strategy was called stochastic rounding which would round the floating number to one of the two nearest fixed-point numbers. The probability was inversely proportional to the distance to these two points. In 2017, Micikevicius et al.[68] approached this problem differently. They stated that if only half-precision format(16 bits) was used during the training phase, the accuracy of some models would not be able to reach its original accuracy. There- fore, mixed-precision training was designed to solve this problem. Mixed precision training contained two main ideas: Firstly, they maintained a master copy of weights of 32-bit format during training. That was, they used 16-bit data for training, and at the same time had a 32-bit parameter copy for parameter update. Secondly, they introduced loss-scaling to scale down the loss which could recover some small but essential gradient values. This method[68] achieved almost the same accuracy compared to the original Deep- Speech 2 model.

(26)

Figure 2.6: The basic framework of Knowledge Distillation which is derived from [71]

2.3.3 Distillation

The distillation method uses transfer learning and the main idea is to train a simple network under both supervision of the output of a pre-trained complex model and the ground truth labels. The complex model is called a teacher model while the simple model is called a student model. Back in 2014, Ba and Caruana[69] had trained a shallow neural network that had similar parameters with the deep neural network and achieved similar performance on the TIMIT dataset[70]. Their work indicated that the performance of deep neural networks might not always be better than shallow networks. Inspired by Caruana et al.’s other work, the first systematic approach was introduced by Hinton et al.[71]. In this section, we’ll start with Hinton et al.’s work[71] and introduce the development of model distillation.

Hinton et al.[71] thought it was possible to transfer the knowledge of a deep network to a simple network. They first trained a large teacher model to generate the soft-target. The soft-target was the output vector (it represented the probability of each class in a classification task) of the large network, but for soft-target, a sample could be assigned more than one class and this was re- flected by the smaller differences in class probabilities than the normal output.

The normal output was called hard-target, which was approximately a one- hot encoder. They modified the softmax layer by adding a temperature T to

(27)

generate the soft-target label, which was:

p_i = exp(zi/T )

Σ_jexp(z_j/T ) (2.3)

where T was the temperature, p was the probability, and z was the logit which was the same to the one in normal softmax function. They believed that it contained more information than the hard-target label. After that, they trained the simple student model by both the soft-target labels(teacher’s knowledge) and the hard-target labels(ground truth), which meant the loss function was composed of these two terms. Figure 2.6 showed the whole process of distillation. This method could also be used for training on unlabeled datasets, and the experimental results were positive. On the MNIST dataset[72], even though the transferred training set contained unlabeled data or some categories of data were missing, the model could still perform well. Moreover, the output of the large model contained more information than ground truth labels, such as class spacing and intra-class variance, so this could also be used as a means of data augmentation.

Based on this seminal work[71], some researchers focus on improving the supervision information given by the teacher models. Sau et al.[73] added noise to the soft-targets of the teacher model and calculate the L2 loss using these noisy soft-targets and the student’s soft-targets. They also tried adding noise to the student outputs, but the experiment results showed adding noise to teach- ers outperformed adding noise to students. This method could be regarded as a way to improve regularization ability. Zhang et al.[74] proposed Knowledge Projection Network (KPN) to train the student model with new data when there was not enough label information.The KPN layer was designed to guide the learning of the student model. Specifically, they defined the KPN loss as the difference of the teacher layers with the corresponding student layers, and they applied 1×1 convolutional kernels to project the teacher’s feature maps so that the channel numbers would be the same. Xu et al.[75] even applied conditional GANs[76] to learn a better loss.

Besides the improvement on the teacher side, there were some notable works in terms of the student side. Wang et al.[77] designed a new structure for the student model, and improved its accuracy through distillation. Lu et al.[78]

applied the classical distillation method[71] to the speech recognition task.

The also purposed a new student model structure which was different from conventional CNNs or RNNs. Crowley et al.[79] modified the convolutional

(28)

blocks in the teacher model to achieve the student model.

All the aforementioned approaches[71, 73, 74, 75, 77, 78, 79] mainly focused on the classification tasks. However, these approaches achieved bad results when applied to complex tasks like object detection[80]. Therefore, Li et al.[80] proposed the mimic algorithm to solve the problem. Similar to the traditional method[71], they also used ground truth labels to train the student, but instead of directly using the output of the teacher, they reinforced the supervision between the feature maps of the teacher and the student, which meant they also tried to minimize the different between the feature maps of the teacher and the student. At the same time, they pointed out that the student should not directly learn all the feature maps from the teacher due to the high dimensions, and some feature maps without objects would be less useful than the others, so they sampled some proposals through RPN[29] and learned these sampled features. Like Zhang et al.[74], they introduced a deconvolution layer to ensure the same dimensions. Another work related to object detection was proposed by Chen et al.[81], and it was the first end-to-end method of distillation for detection tasks. The total loss consisted of three terms. The first one was referenced on the hinted-based learning[82] which learned the feature maps from the teacher model. The second one was to learn the proposal’s features selected by RPN[29] and it included both classification loss and regression loss. The third term was to learn the soft-targets of both classification and regression of the prediction generated from the detector part of Fast-RCNN[28].

By applying the new loss with different strategies, it improved the distillation performance when the classification classes were imbalanced and when the regression output of the student was far from the teacher’s output.

(29)

Methods

In this section, we will describe the methods that we used to achieve model compression with the use of the pruning technique. We start by describing the dataset that we used and the structure of our baseline model. Then, we will introduce two model compression methods that were applied to this project:

pruning by using sparsity training and pruning by using NAS. Finally, we provide implementation details and evaluation methods to make our work re- producible.

3.1 Dataset

In this project, we chose to use the internal dataset from Tobii AB. The original dataset was called g6_eye_features and it consisted of both Region Of Inter- est (ROI) samples and Down-sampled Full scale (DFS) samples. The DFS samples contain whole facial features starting from the forehead to the neck while ROI samples only focus on the region near the eyes. The ROI images can also be split into two types according to the relative placement of the il- luminator and the camera: Bright pupil (BP) samples and Dark pupil (DP) samples. Figure 3.1 shows an example of BP ROI sample.

Figure 3.1: Example of a BP ROI sample

22

(30)

Therefore, our original dataset consisted of 72304 samples with annotations in total. However, although YOLOv3[1] was good for detecting small objects, the annotation boxes of DFS samples were still too small compared with the image size. Moreover, the object feature was different from BP and DP samples. Therefore, to improve the accuracy of our baseline model, we created a sub-dataset from the original dataset which excluded the DFS samples and only contained BP samples. This sub-dataset had 30175 images with annotations and we will choose the model with better performance as a baseline after training with these two datasets separately. For the training and validating purpose, we divided both two datasets into 2 parts: 50% for training and 50% for validating. Table 3.1 shows the splitting condition.

Table 3.1: The dataset splitting condition. The reason why the scale of BP dataset without DFS samples are twice as mentioned will be explained in section 3.2.

Training set samples Validation set samples

Original Dataset 36152 36152

BP Dataset without DFS 30175 30175

3.2 Data pre-processing

To fit the input size of our object detection network, YOLOv3[1], we must adjust the size of the original data. The original size of BP and DP samples was 1160 × 300, and the DFS images had no fixed size. We first applied the letterbox rescale method to all images, which meant that we scaled down the original image to 416 × 416 and the remaining blank parts are filled with gray pixels. The annotation was also scaled down to the same size, Figure 3.2 shows one of the scale-down samples, the blue boxes are the ground truth annotations while the red boxes are the predicted bounding boxes.

Since the BP dataset without DFS had only 30175 samples, we decided to implement data augmentation for enhancing the dataset scale. Therefore, we did the horizontal flip to all images so that the scale of this dataset was doubled, and this was the reason why in Table 3.1 we had 60350 samples for this dataset.

(31)

Figure 3.2: Example of a downscaled sample

3.3 Baseline Network

In this project, we chose our baseline to be one of the most famous object detection models, YOLOv3[1]. Hence, in this section, we will briefly describe the structure of YOLOv3[1] which is also shown by Figure 3.3.

Figure 3.3: YOLOv3 structure which is derived from[1]

We will first introduce some block units which are at the bottom of Fig- ure 3.3. A single Conv block is composed of a 2D convolution layer, a BN layer[47] and a leaky relu layer[10]. Conv blockN indicates that we will re- peat a single Conv block for N times. Res Unit is similar to the basic block unit used in ResNet[36], and it consists of two single Conv block plus a shortcut

(32)

path. The shortcut path will keep the input of the Res Unit and the output of the unit will be the addition of the shortcut data and the convolution output, so it requires the dimension of the output channels of last Conv block to be the same with the input data. Finally, Res BlockN contains a single Conv block followed by N Res Unit.

With these basic block structures, we can construct the whole framework of YOLOv3[1], and it can be divided into two parts. The first part, the backbone network, is constructed by Conv block and Res BlockN and it aims to extract features given an input image, so it is called feature extractor. Generally, since we could download the pre-trained weights of the feature extractor which are trained on ImageNet[3] and it is good for general purpose classification or detection tasks, we will directly use the pre-trained weights instead of training it from scratch to save time.

The remaining part is the detector which aims to do the detection task based on the extracted features. Generally, the deeper the network, the better the effect of feature expression. For example, when performing 16-fold downsampling detection(Out2 in Figure 3.3, the output dimension is 26×26×M . The reason why it is called 16-fold downsampling detection is that the final output width and height are one-sixteenth of the initial size, which is ⁴¹⁶₁₆ = 26), if you use the fourth downsampling features(in Figure 3.3, it is the output feature maps of the backbone network, which is the output of Res Block4. It is also 16-fold downsampling, which has the same dimension) directly to detect, then shallow features are used, which is not good for the final performance. However, if you want to use the 32-fold down-sampling features(the output feature maps of the Conv Block5 in the first row in Figure 3.3), the dimension of these deep features is too small, so this is the reason why YOLOv3[1] uses upsample layers with the stride size of 2 to double the size of the feature maps obtained by the 32-fold down-sampling. Similarly, 8-fold sampling is also used to upsample the 16-fold down-sampling feature so that deep features can be used for detection. Since we have the same size as our feature maps, we can directly concatenate the channel dimensions by the concat layers.

Then we can focus on the output. It is obvious in Figure 3.3 that YOLOv3[1]

has 3 final output feature maps, and the reason for this design is to improve the detection accuracy of the small objects. YOLOv3[1] adopts FPN-like[30]

upsample and fusion methods and perform detection on the feature map of multiple scales(13 × 13, 26 × 26, 52 × 52). These three scales divide the orig-

(33)

inal input images into 13 × 13, 26 × 26 and 52 × 52 grid cells respectively, and for each cell, it will predict 3 bounding boxes. Thus, YOLOv3[1] will predict (13 × 13 + 26 × 26 + 52 × 52) × 3 = 10647 bounding boxes in total. We can see in Figure 3.4, the green box is a grid cell, and it will predict three bounding boxes. For each bounding box, the model will predict 3 things: Firstly, the location of the box(center coordinate x and y, width and height, four numbers); Secondly, the probability of each class; Thirdly, the confidence score.

In our project, since we only have one class(eyelid) and for each image, we need to predict two bounding boxes, hence, the inference method will be different. When choosing the final result, we first choose the top 20 bounding boxes which have the best confidence score and set the top one bounding box as one of the final results. Then we remove the repeated boxes by iteratively searching from the second-best box until the box has no common area with the first box. In this case, we can ensure that we achieve the two best-predicted bounding boxes for each image.

Figure 3.4: The output of YOLOv3[1]

3.4 Pruning by using sparsity training

After we have our baseline model, the next step is to explore whether there are methods to compress this large model. The first method we tried was pruning the network by using sparsity training. Since weight pruning requires special hardware or libraries to experience an actual speedup[4], we decided to start with channel pruning, which meant that we pruned whole channels of convolution layers instead of pruning the weight nodes. Thus, the problem became how we could know the importance of each channel.

(34)

Inspired by Liu’s work[4], we focused on the scaling factor of the BN[47]

layer. From Figure 3.3 we can know all the convolution blocks which can be pruned have BN layers and there are no other kinds of layers, so this method may fit our problem very well. Since the output of the BN layer is calculated by:

y = αˆx + β (3.1)

where α is the scaling factor, β is the shifting factor and ˆx is the normalized input. From this equation, we can observe that when α is 0, the output y will be a constant no matter what value the input data ˆx is, thus, there will be no input information conveyed to the next layer. Based on this, we can assume that channels with lower scaling factor values are less important. Figure 3.5 shows how we implement channel pruning according to scaling factor values. We need to define a pruning threshold manually to identify whether this channel is important or not.

Figure 3.5: Schematic diagram of implementing channel pruning. The orange channels in the left part have less scaling factor values, so these channels are less important. After pruning the orange channels, we also need to remove the related input channels as shown in the right part[4]

Then an essential problem was how we can find these unimportant channels. If only a few scaling factor values are close to 0, the compressed model will still be very large. Thus, as mentioned in Liu’s[4] work, we used the L1 norm to impose the sparsity on the scaling factor. In detail, assume the original loss function of our baseline model was L_yolo(y, t) where y was the prediction results and t was the targets, our new loss function will be:

Lsparsity = ΣiLyolo(yi, ti) + λΣn|α| (3.2) where λ is the regularization factor. With L1-norm, we were able to achieve the sparse solution of scaling factors. Therefore, after sparsity training, we can easily implement channel pruning shown in Figure 3.5.

(35)

Figure 3.6: Procedure of channel pruning by using sparsity training[4]

In conclusion, the whole procedure of pruning channels by using sparsity training was shown in Figure 3.6. After we pruned the channels, we need to fine-tune our network because there’s always an accuracy loss after pruning.

The dotted line indicates that if the model size is still too large after pruning, we can iteratively apply the pruning methods. The following 3 sub-sections will introduce different tracks we had explored based on this channel pruning framework.

3.4.1 Pruning channels of shortcut-related layers

Although in YOLOv3[1], all the convolution layers which can be pruned have BN layers, a large part of them is shortcut-related layers. Shortcut-related layers are the layers whose outputs are restricted to each other due to the shortcut path. Like the Res Unit in Figure 3.3, Figure 3.7 shows a short part of the backbone network. Both the blue and orange blocks represents a single Conv Block in Figure 3.3. The blue blocks are the shortcut-related layers because their output will be directly added to the outputs of other layers. Thus, in this case, the output dimension of these blue blocks should all be the same. That is to say, when we are doing the pruning tasks, we should prune all the shortcut- related layers in the same group with the same pruning ratio in order to ensure there will be no dimension problems.

Since there were 28 shortcut-related layers in YOLOv3[1] in total, it was necessary to design a pruning strategy to prune these layers. In this project, we applied the union mask strategy to solve this problem. With this strategy, we kept the union set of the remaining channel indexes of each shortcut-related layer in the same group. For example, assume we have two shortcut-related layers and each has 3 output channels, and the two pruning masks are [1,0,0]

and [0,0,1] (0 indicates this channel should be pruned and 1 indicates not).

The final result will be [1,0,1], which means the first and the third channels of these two layers will be kept after pruning though some channels’ scaling factor values were lower than the pruning threshold. This was a relatively con-

(36)

Figure 3.7: The blue blocks are shortcut-Related Layers, and the output dimension of these layers should be the same after pruning

servative strategy because we tried to keep as many channels as we could when pruning.

3.4.2 Absorbing the shifting factor information into

subsequent layers

In the previous section, we only focused on the scaling factors of BN layers3.1.

However, we were not sure whether it was reasonable that we directly pruned the channels with high shifting factor values(β). Ye [83] proposed a method that could absorb the shifting factors and bias terms into the subsequent layers. In this project, we also explored whether this method was effective for our model.

As mentioned before, if the scaling factor is 0, the output channel will be a constant and its value will be equal to the shifting factor. In this case, there are mainly two conditions: firstly, if the subsequent layer does not have the BN layer, we absorb the shifting factor information into the bias terms by:

b^l+1_new= b^l+1+ I(α = 0).RELU (β)^T.Σ_a,bW_a,b,.,.^l+1 (3.3) where l represents the current layer index, b represents the bias term of the next layer, α and β represent the scaling and shifting factor of the current layer.

Secondly, if the subsequent layer have the BN layer, we absorb the shifting factor information into the moving average by:

µ^l+1_new = µ^l+1− I(α = 0).RELU (β)^T.Σa,bW_a,b,.,.^l+1 (3.4)

(37)

3.4.3 Layer pruning

Besides channel pruning, we also tried layer pruning in our project because YOLOv3[1] was a very deep network. While channel pruning aimed to reduce the width of the network to make it thinner, layer pruning reduced the depth of the network to make it shallower.

Figure 3.8: Two conditions of pruning Res Unit. Top: prune an individual Res Unit; Bottom: prune the Res Unit after a Conv Block.

Based on the same assumption that less scaling factor value has less information, we evaluated the importance of each layer by calculating the average value of its scaling factors. We also set a pruning threshold manually to prune layers whose average scaling factor values were below the threshold. However, the dimension problem in section 3.4.1 still existed when we were pruning the shortcut-related layers, and pruning each layer independently will easily dam- age the original structure of the network. Thus, we designed a strategy to deal with this problem. Instead of pruning individual layers, we pruned the whole Res Unit. The evaluation method was similar: we calculated the average scal- ing factor values of each Res Unit. Figure 3.8 shows two conditions we met when pruning Res Units. The advantage of using this strategy was to keep the whole structure of YOLOv3[1] while reducing complexity.

(38)

3.5 Pruning by using NAS

In the previous section, we mainly discussed how to prune our network by using sparsity training. One of the disadvantages of this method was we must manually set a threshold to decide the compression ratio, and we needed to try different compression ratios and then fine-tuned the pruned model in order to find the optimal one. In addition, we may need to spend a lot of time adjusting sparsity rate which affected the learning rate of the L1-norm regularization term when doing sparsity training. Therefore, in this section, we will introduce a trendy method called AutoML for Model Compression (AMC)[5]. AMC used the idea of deep RL and it could automatically find the best compression ratio of each layer, which means we did not need to set a global compression threshold by ourselves.

Figure 3.9: Overview of the whole framework of AMC[5]

Figure 3.9 shows the framework of AMC. AMC mainly consisted of two parts: a DDPG agent which can learn competitive policies for our task [57]

and a channel pruning environment. AMC aims to find the best model channel pruning strategy through the interaction of these two parts. Before we introduce these two parts and the algorithm, it is necessary to describe how we constructed the state and action space since this is a RL problem.

State space The original paper[5] defined 11 features to embed states for each layer. Similarly, we used 9 features:

(t, n, c, k, stride, F lop(t), reduced, rest, at−1)

where t is the layer index, n is the input channels, c is the output channels, k is the kernel size, FLOP(t) is the FLOPs of the current layer t, reduced is

(39)

the number of FLOPs reduced in the previous layers, rest is the rest FLOPs in the following layers, at− 1 is the action(sparsity ratio) of layer t-1. We also normalized the embedded state information to [0,1] when initializing the environment.

Action space We used the same action space (0,1] as stated in [5]. The action in our project means the sparsity ratio of the current layer. For example, 1 indicates we will not prune any channels of the current layer, while 0.5 means that only 50% of the channels will be kept after pruning.

Reward The reward function in our project is −Error ∗ log(F LOP s) which considers both the accuracy and the model complexity.

Figure 3.10: AMC framework in details, which is derived from [5]

After we have these basic settings, we can introduce the framework in detail now. As illustrated in Figure 3.10, there are four neural networks in the DDPG agent[57], two for the actor and two for the critic, and there is an experience replay memory used to save the observation for training. Each time the channel pruning environment receives the noisy action generated by the online strategy network µ. The environment implements this action and re- turns the transition(current state, the next state, and the current reward) to the DDPG agent[57]. The DDPG agent[57] saves the transition into the memory and randomly sampled a batch of transition samples. Then these samples are used to update the online Q network and the online strategy network. After