• No results found

An Embedded System for Classification and Dirt Detection on Surgical Instruments

N/A
N/A
Protected

Academic year: 2022

Share "An Embedded System for Classification and Dirt Detection on Surgical Instruments"

Copied!
65
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

An Embedded System for

Classification and Dirt Detection on Surgical Instruments

GUÐMUNDUR HALLGRÍMSSON

(2)

Abstract

The need for automation in healthcare has been rising steadily in recent years, both to increase efficiency and for freeing educated workers from repetitive, menial, or even dangerous tasks. This thesis investigates the implementation of two pre-determined and pre-trained convolutional neural networks on an FPGA for the classification and dirt detection of surgical instruments in a robotics application. A good background on the inner workings and history of artificial neural networks is given and expanded on in the context of convolutional neural networks. The Winograd algorithm for computing convolutional operations is presented as a method for increasing the computational performance of convolutional neural networks. A selection of development platform and toolchains is then made. A high-level design of the overall system is explained, before details of the high-level synthesis implementation of the dirt detection convolutional neural network are shown. Measurements are then made on the performance of the high-level synthesis implementation of the various blocks needed for convolutional neural networks. The main convolutional kernel is implemented both by using the Winograd algorithm and the naive convolution algorithm and comparisons are made. Finally, measurements on the overall performance of the end-to-end system are made and conclusions are drawn. The final product of the project gives a good basis for further work in implementing a complete system to handle this functionality in a manner that is both efficient in power and low in latency. Such a system would utilize the different strengths of general-purpose sequential processing and the parallelism of an FPGA and tie those together in a single system.

Keywords: Neural Network, CNN, FPGA, PetaLinux, Winograd, High-level Synthesis.

(3)

Sammanfattning

Behovet av automatisering inom vård och omsorg har blivit allt större de senaste åren, både vad gäller effektivitet samt att befria utbildade arbetare från repetitiva, enkla eller till och med farliga arbetsmoment. Den här rapporten undersöker implementeringen av två tidigare för-definierade och för-tränade faltade neurala nätverk på en FPGA, för att klassificera och upptäcka föroreningar på kirurgiska verktyg. En bra bakgrund på hur neurala nätverk fungerar, och deras historia, presenteras i kontexten faltade neurala nätverk. Winograd algoritmen, som används för att beräkna faltningar, beskrivs som en metod med syfte att öka beräkningsmässig prestanda. Val av utvecklingsplattform och verktyg utförs. Systemet beskrivs på en hög nivå, innan detaljer om hög-nivå-syntes- implementeringen av förorenings-detekterings-nätverket visas. Mätningar görs sedan av de olika bygg-blockens prestanda. Kärnkoden med faltnings-algoritmen implementeras både med Winograd-algoritmen och med den traditionella, naiva, metoden, och utfallet för bägge metoderna jämförs. Slutligen utförs mätningar på hela systemets prestanda och slutsatser dras därav. Projektets slutprodukt kan användas som en bra bas för vidare utveckling av ett komplett system som både är effektivt angående effektförbrukning och har bra prestanda, genom att knyta ihop styrkan hos traditionella sekventiella processorer med parallelismen i en FPGA till ett enda system.

Nyckelord: Neuralt nätverk, Faltade neurala nätverk, FPGA, PetaLinux, Winograd, Hög- nivå syntes.

(4)

Acknowledgements

I would like to thank Professor Christian Smith at the RPL for suggesting the project and driving it as a part of the Automed project. I would also like to acknowledge the work of Vladimir Li in developing and training the neural networks used in the project and Amirinnisa Atrisandi for collecting the data for their training. For their valuable input on the thesis I would like to thank Professor Johnny Öberg, as well as my opponent Peter Agoston. Finally, I would like to thank the good people of the RPL lab for providing a productive working environment as well as technical expertise, and my beloved wife for her unending love and patience.

(5)

List of Figures

1 System overview . . . 2

2 A neuron as part of an Artificial Neural Network (ANN) . . . 5

3 A fully connected ANN . . . 6

4 Activation functions . . . 6

5 Example of a 2D convolution, where the right image shows the left image after convolving with the Sobel operator. . . 8

6 Pooling layers. Max pooling on the left side, Average pooling to the right . . . 9

7 Convolutional layer . . . 10

8 VGG-16 topology . . . 12

9 Detection network topology . . . 14

10 FSM with rolled loop logic. Note that resets are ignored, but should lead back to C0. . . 24

11 FSM with unrolled loop logic. Note that resets are ignored, but should lead back to C0. . . 25

12 FSM with pipelined logic. . . 27

13 Software loop . . . 28

14 Kernel module loop . . . 30

15 System overview . . . 33

16 Winograd convolution element design . . . 36

17 Naive convolution element design . . . 38

18 Latency of convolutional elements by the needed number of convolutional operations, exhibiting linear relationship, with the naive implemention underperforming slighly at larger dimensions. . . 38

19 Convolution resource usage, showing how the winograd implementation uses up to double the resources of the naive implementation. Actual resource counts in the upper row and their ratios in the lower row. . . 39

20 Pooling latencies . . . 40

21 Pooling resource usage . . . 41

22 FPGA block design . . . 43

23 Power summary . . . 44

24 Winograd end-to-end performance . . . 45

25 Naive end-to-end performance . . . 46

26 Implemented design . . . 47

(6)

List of Tables

2 VGG-16 Operation count . . . 13

3 Detection network operation count . . . 15

4 VGG-16 Memory requirements, assuming 4-byte values . . . 18

5 Detection network memory requirements, assuming 4-byte values . . . 19

6 FPGA SoC platforms . . . 21

7 Loop unrolling comparison . . . 24

8 Array partitioning comparison . . . 26

9 Loop pipelining comparison . . . 26

10 OpCodes . . . 32

11 Convolution HLS statistics, Winograd implementation . . . 36

12 Convolution HLS statistics, naive implementation . . . 37

13 Pooling HLS statistics. . . 40

14 Multiplication element performance . . . 41

15 FPGA resource utilization, detection network, full system . . . 43

16 End-to-end performance . . . 45

17 Tensorflow performance, detection network . . . 49

(7)

Contents

Acronyms ix

1 Introduction 1

1.1 Background . . . 1

1.2 Problem . . . 2

1.3 Goal . . . 2

1.4 Benefits, Ethics and Sustainability . . . 3

1.5 Delimitations . . . 3

1.6 Outline . . . 3

2 Background 4 2.1 Automed . . . 4

2.2 Artificial Neural Networks . . . 4

2.2.1 Fully connected Neural Networks . . . 5

2.2.2 Convolutional Neural Networks . . . 6

2.2.3 VGG-16. . . 11

2.2.4 Detection network . . . 13

2.2.5 Winograd . . . 16

2.2.6 Datatypes . . . 16

2.2.7 Compression . . . 17

2.2.8 Memory transfers . . . 20

3 Platform and Tools 21 3.1 Platforms . . . 21

3.2 HLS . . . 21

3.2.1 Program flow control . . . 22

3.2.2 Vivado HLS. . . 27

4 System design 28 4.1 Software design . . . 28

4.1.1 User-space application . . . 29

4.1.2 Kernel-space application . . . 30

4.2 HLS design . . . 31

(8)

6 Results 42

6.1 FPGA block design . . . 42

6.1.1 Synthesis results . . . 43

6.2 Linux . . . 45

7 Conclusions 49 7.1 Future Work . . . 49

7.1.1 Neural networks . . . 49

7.1.2 HLS . . . 49

7.1.3 FPGA . . . 50

7.1.4 Application . . . 50

7.1.5 Further . . . 50

A Code 53 A.1 HLS Code . . . 53

A.2 PetaLinux code . . . 54

(9)

Acronyms

ANN Artificial Neural Network ARM Advanced RISC machine

AXI4 Advanced eXtensible Interface 4

BRAM Block RAM

CNN Convolutional Neural Network CPU Central Processing Unit

DDR Double Data Rate

DDR3 Double Data Rate (Three) DMA Direct Memory Access

DRAM Dynamic RAM

DSP Digital Signal Processing FC Fully Connected

FIR Finite Impulse Response

FPGA Field Programmable Gate Array FSM Finite State Machine

GPIO General-Purpose Input/Output HDL Hardware Description Language HLS High Level Synthesis

I/O Input/Output

ILSVRC ImageNet Large-Scale Visual Recognition Challenge

IP Intellectual Property IRQ Interrupt Request

KTH Kungliga Tekniska Högskolan

(10)

RAM Random Access Memory ReLU Rectified Linear Unit RGB Red, Green and Blue ROM Read-Only Memory

RPL Department of Robotics, Perception and Learning

RTL Register Transfer Level SD Secure Digital

SoC System on Chip SRAM Static RAM

TCP Transmission Control Protocol USB Universal Serial Bus

VGG Visual Geometry Group VGG-16 VGG 16-channel CNN

(11)

1 Introduction

This thesis is done as a tangent to the Automed project. The Automed project, further described in Section2.1, includes a number of sub-projects that aim to further automation in healthcare.

One of those projects has a need for classifying surgical instruments (classification network) and detecting any dirt that may be found on them (detection network), to further improve efficiency during sterilization. A convolutional neural network was created and trained for the purpose, showing very good results, but the need for implementing this functionality in an embedded system was noted at that time. A Field Programmable Gate Array (FPGA) was identified as a fitting solution to process the networks in a timely manner in an embedded system and this thesis project conceived to implement that system.

This thesis will explain the concepts involved in neural networks in general, and convolutional neural networks in particular, as well as some concepts for working with them that will be useful in later sections. The thesis then goes over some of the design tools and methods used and explains the result of the design work, including how the different parts of the system were implemented. Finally some results will be shown and conclusions will be drawn.

1.1 Background

The trend of implementing neural networks on FPGAs has been rising in recent years, so it is no coincidence that an FPGA was chosen for this project. The project also has certain requirements that are much better suited to a general-purpose operating system, which drove the platform selection to a System on Chip (SoC) solution, which has both an Advanced RISC machine (ARM) processor and an FPGA integrated into a single chip.

In simple terms, the overall system consists of a high-level application, a kernel module device driver, and a neural network accelerator. The high-level application runs on an embedded Linux operating system on the ARM processor system, and takes care of listening for requests on the network. It is then responsible for taking pictures with the Universal Serial Bus (USB) camera connected to the system, as well as talking with the lower-level device driver and returning any results back over the network. Below this application lies the kernel module device driver, which takes care of relaying communications between the high-level application and the neural network accelerator. Finally, the most substantial component of the system is the neural network accelerator itself, which is synthesized using Vivado High Level Synthesis (HLS), running on the FPGA. A conceptual diagram for the system can be seen in Figure1.

(12)

Figure 1: System overview

1.2 Problem

The neural networks as they stand work very well, but they are far from optimized for running on an embedded system. The classification network for example requires several hundred megabytes of configuration data to be able to run, which can get cumbersome when implemented in an embedded system. Furthermore, the topologies used are not very well suited for processing on an FPGA. Some specific topologies have been identified and/or constructed specifically for running on an FPGA, utilizing its strengths to the fullest, but the topologies used in this application are much better suited for running on, for example, a graphical processing unit.

All the same, the project attempts to construct a generic framework for running these networks in both an accurate and timely manner. The question this thesis attempts to answer thus becomes: Can these two networks be integrated in an embedded system that can fit within a larger system and is able to give results within an acceptable timeframe?

1.3 Goal

The goal is to create an application which will satisfy the requirements of the larger system by returning classification and detection results in a timely manner. This should include an FPGA accelerator for inferring the neural network results and an application that will be able to grab pictures and process on request.

(13)

1.4 Benefits, Ethics and Sustainability

This project should benefit society as a whole, as a part of the Automed project, which aims to reduce the cost of the health-care system by a substantial amount. Finally, ethics is not a core focus of this project since it has no direct impact on individuals.

1.5 Delimitations

This project will focus on constructing a working overall system, emphasizing the end-to-end functionality with the viewpoint that further optimizations and implementations will take place at a later date. This will involve only implementing one of the two neural networks which are needed in the system, and while attempts will be made to implement optimized convolution algorithms, a naive algorithm will be developed in parallel to ensure proper functionality.

1.6 Outline

We will start by reviewing the background of the project, its origin and purpose, as well as the theory behind the key concepts of the project. After this we will review the platforms and tools used for the project, and go over a conceptual design of the entire system. Thereafter a concrete implementation of that design will be reviewed and a look will be taken at some key performance metrics, before drawing some conclusions and outlining the next steps to be taken in the development of the system.

(14)

2 Background

This chapter provides basic background information about the Automed project, artificial neural networks, the neural network topologies used in this project and high-level synthesis. The chapter also describes some related work in implementing neural networks on FPGA platforms.

2.1 Automed

This thesis work was done as a part of a larger project named AutoMed [1], which is sponsored by Vinnova1 and coordinated by ABB robotics2. The project is intended to increase the effectiveness of healthcare by introducing automation and robotic solutions to automate repetitive tasks and the expected result is an increase in effectiveness of specialized healthcare by up to 20%.

The AutoMed project has been split into several different subprojects at the Karolinska3 hospital in Stockholm, ranging from improving processes in the microbiology laboratory to improving metrics for patient satisfaction. One of these subprojects, for example, aims to show the benefits of utilizing collaborative robots as support in a microbiological laboratory, where a robot is used to take over certain tasks from staff, while letting the task itself remain unchanged as much as possible, to allow for human intervention [2]. Another subproject, and the focus of this thesis, concerns the identification and classification of surgical instruments to streamline the sterilization processes. This subproject is intended to be used in the process of automating the cleaning and sterilization of surgical instruments after use. This is not only a repetitive task but also poses a potential danger for employees when handling sharp objects that may contain infectious or toxic substances. This could be replaced with robotic workers which can pick up these surgical instruments, which would inspect them after they have been sterilized and detect whether the sterilization was satisfactory or not.

For this purpose, two convolutional neural networks have been developed. One is intended to identify the type of surgical instrument that the robot is holding, and the other to classify it as either sufficiently sterilized or not (clean or not clean). These neural networks are quite large and require at least a fairly powerful computer to be able to produce timely results from the two networks. This is not ideal for a system that is intended to work in such an environment, since it hampers both transportation of the robot and consumes more power than it has to. Thus, the need for an embedded solution was identified. Such a system would both reduce the physical size of the system, not to mention removing the inconvenience of the various cords a desktop computer needs, as well as drastically reducing the power usage.

2.2 Artificial Neural Networks

ANNs are by now well established as the most common and flexible method for Machine Learning. The method builds on a very simple basic principle and extends this concept to be able to construct a generic framework for many applications. In the simplest form, ANNs can be used to map an input domain to an output domain by using a set of known inputs and outputs to train the network and then apply it to new inputs and infer the outputs that those

1https://www.vinnova.se/

2https://new.abb.com/products/robotics/sv

3https://www.karolinska.se/?splitoption=splitdecision

(15)

inputs would most likely correspond to. This is especially useful for functions that can not easily be represented with traditional linear algebra.

Figure 2: A neuron as part of an ANN

An example of the basic building block of a Neural Network (the neuron) can be seen in Figure 2. This neuron maps a set of three inputs to a single output, where every input Ii is multiplied by a weight wi, and the total added to a bias b. This is then passed to the output, giving a total result of O = b + ∑n

i=1

Iiwi.

2.2.1 Fully connected Neural Networks

The simplest and most common way of expanding on the concept of a neuron is to construct a so-called Fully Connected (FC) network. The basic idea with these is to construct a number of layers, each a row of neurons, where every neuron of a given row is connected to every neuron of the preceding row. Generally, each neuron output is then passed through a so-called activation function. Classically this would most often be the sigmoid function: f (x) = 1+e1−x. A more common modern alternative is the Rectified Linear Unit (ReLU), where f (x) = max(0, x). An example of a fully connected network can be seen in Figure3and example activation functions in Figure4.

(16)

I1

I2

N1,2 N1,1

N1,3

N1,4

N2,2 N2,1

N2,3

O

Figure 3: A fully connected ANN

Figure 4: Activation functions

This topology lends itself well to problems with simple inputs and outputs, but it does not generally scale very well to larger inputs, such as images, which can lead to very high computational load both when inferring a result from the network as well as when training. A good overview of these classical neural networks can be found for example in Neural networks and deep learning by Michael Nielsen [3].

2.2.2 Convolutional Neural Networks

The traditional fully connected neural network, while very powerful for simple inputs, does not lend itself very well to very large inputs, such as images. It’s easy to see for example how an

(17)

image that has 1024 pixels on each side and 3 color channels would lead to about 3 million weights in only the first layer of neurons, which is not only unnecessary but wasteful as well.

A common solution to this problem is to use so-called Convolutional Neural Networks (CNNs) for image-based inputs. These take advantage of certain properties of images, and emphasize localized changes in the image itself, such as edges and gradients.

Most CNNs consist of a few different types of layers, the output of each layer being the input to the next layer in the network. The most common layer types being:

• Convolutional layer

• ReLU

• SoftMax

• MaxPool

• AvgPool

• Fully connected

The convolutional layer makes up the bulk of the computation within a CNN. These layers compute a dot product for a certain learnable filter (also known as a kernel) for every location on the input map, adding up the results in the input channels, creating an output layer channel for every kernel used in the current layer. For example, a layer that has an input of 1024x1024 pixels in 3 channels and 10 kernels would generally create an output layer consisting of 1024x1024 pixels in 10 channels. The exact size can be variable however, depending on how the edges are handled and the size and stride of the kernel. For the scope of this thesis, all layers have kernels of size 3x3, a stride of 1 (meaning that the dot product is computed at every location in the input layer), and padded with zeroes, making the output layers generally the same width and height as the input layer, with varying depths according to the number of kernels in the layer.

An example of the result of a convolution with a 2D image can be seen in Figure5, where the image on the left has been convolved with the so-called Sobel operator [4], which is often used for edge detection in images. This operator is of the form

k=

−1 1 −1

−2 2 −2

−1 1 −1

 (1)

(18)

Figure 5: Example of a 2D convolution, where the right image shows the left image after convolving with the Sobel operator.

The ReLU layer simply brings all negative values in the output to 0 (y = max(0, x)). This is a very common activation function to apply to convolutional layers and is applied in every instance used in this project.

The SoftMax function is commonly used on the outputs of CNNs, and is used to emphasize the higher values in an output vector and obscure the lower ones. It is calculated as such:

σ (z)j= ez j

K

k=1

exTwk

(2)

The MaxPool and AvgPool layers are part of a wider group of pooling layers, and are used to bring the size of the current layer down. These have a certain size, like kernels, and stride. The image is tiled according to those parameters and each tile takes on a single value. That value is simple the largest value in the tile for MaxPool and the average of the values for AvgPool.

A pool layer of size 3x3 and stride 2 would reduce the input in size by half, with each pixel having a value pooled from the 3x3 pixel area surrounding it in the input layer. For a graphical representation see Figure6

(19)

Figure 6: Pooling layers. Max pooling on the left side, Average pooling to the right

(20)

Figure 7: Convolutional layer

Finally, CNNs commonly have one or more simple fully connected layers at the end.

These are simply traditional neural network layers where the neurons are connected to every pixel/value in the input image.

In addition to this, there is a multiplication layer in the detection network. This is used to combine two branches into one and is done via simple elementwise multiplication of the two inputs.

For further information regarding convolutional neural networks, see for example the Notes for Stanford CS class CS231n [5], or Tensorflow [6] which is the framework used for designing

(21)

and training the networks used in this project.

2.2.3 VGG-16

The application consists of two convolutional neural networks, one for classifying the type of surgical instrument in question, and one for detecting whether or not the instrument is dirty and needs further cleaning. The classification network is an adapted form of the VGG- 16 neural network developed by Simonyan and Zisserman at the Visual Geometry Group (VGG), University of Oxford [7] for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [8] in 2014. The ILSVRC is generally considered as showing the most advanced image recognition neural networks each year, and in 2014 the VGG networks came out with the best results by localization error (~0.25) and came second by classification error (~0.07).

The flow through this network is shown in Figure8, and a description of each layer is found in Table2. There, the layers are listed along with the most important descriptors:

• Type, which can be convolutional, pooling, FC, etc.

• Number of channels being processed, the 3rd dimension in the output dimensions

• Output dimension, which becomes the input dimension of the next layer

• Number of Multiply and Accumulate (MACC) operations needed to process that layer

• Number of comparisons needed for processing that layer

The channels and dimension columns indicate the memory footprint, while the MACC and comparison columns indicate the processing overhead of the layer. Looking at these, we can see quite clearly how the network iterates through convolving the image with the kernel values 2-3 times for each segment, before MaxPool-ing the image in such a way that the dimensions of the image are halved. The next convolution layer after such a pooling layer will then double the channel count of the image, effectively causing the data to maintain the same total size through most of the network, leading to mostly static memory and processing requirements.

Of note regarding this network is that the final stages are FC layers. These layers need extremely large amounts of data, especially the first one of these, since it will be convolving from some 25 thousand input values into 4096 output values, which comes out to nearly 103 million kernel values for just that single layer.

(22)

Figure 8: VGG-16 topology

(23)

Table 2: VGG-16 Operation count

Layer Type Kernel channels Dimension out MACCs Comparisons

Input Image - 224x224x3 0 0

Conv3-64 1 Convolution/ReLu 64 224x224x64 8.67 · 107 3.21 · 106 Conv3-64 2 Convolution/ReLu 64 224x224x64 1.85 · 109 3.21 · 106

Pool 1 Max pool - 112x112x64 0 3.21 · 106

Conv3-128 1 Convolution/ReLu 128 112x112x128 9.25 · 108 1.61 · 106 Conv3-128 2 Convolution/ReLu 128 112x112x128 1.85 · 109 1.61 · 106

Pool 2 Max pool - 56x56x128 0 1.61 · 106

Conv3-256 1 Convolution/ReLu 256 56x56x256 9.25 · 108 8.03 · 105 Conv3-256 2 Convolution/ReLu 256 56x56x256 1.85 · 109 8.03 · 105 Conv3-256 3 Convolution/ReLu 256 56x56x256 1.85 · 109 8.03 · 105

Pool 3 Max pool - 28x28x256 0 8.03 · 105

Conv3-512 1 Convolution/ReLu 512 28x28x512 9.25 · 108 4.01 · 105 Conv3-512 2 Convolution/ReLu 512 28x28x512 1.85 · 109 4.01 · 105 Conv3-512 3 Convolution/ReLu 512 28x28x512 1.85 · 109 4.01 · 105

Pool 4 Max pool - 14x14x512 0 4.01 · 105

Conv3-512 4 Convolution/ReLu 512 14x14x512 4.62 · 108 1.00 · 105 Conv3-512 5 Convolution/ReLu 512 14x14x512 4.62 · 108 1.00 · 105 Conv3-512 6 Convolution/ReLu 512 14x14x512 4.62 · 108 1.00 · 105

Pool 5 Max pool - 7x7x512 0 1.00 · 105

FC-4096 1 Fully connected - 4096 1.03 · 108 4096

FC-4096 2 Fully connected - 4096 1.68 · 107 4096

FC-12 Fully connected - 12 49200 12

Total - - - 15.5 · 109 19.7 · 106

2.2.4 Detection network

The detection network is a purpose-built CNN, designed by Vladimir Li et al at the KTH Department of Robotics, Perception and Learning (RPL) laboratory. This network was designed with the sole intention of detecting dirt on the surgical instruments under inspection. The structure of the network is composed of three main parts, two of which take the original input image and detect where on that image dirt can be found. This means that by multiplying the outputs of these branches elementwise yields an image which highlights the visible dirt on the

(24)

Figure 9: Detection network topology

(25)

Table 3: Detection network operation count

Layer Type Kernel channels Dimension out MACCs Comparisons

Input Image - 336x336x3 0 0

Conv3-16 1-1 Convolution/ReLu 16 336x336x16 48.8 · 106 1.81 · 106 Conv3-16 1-2 Convolution/ReLu 16 336x336x16 260.0 · 106 1.81 · 106 Conv3-16 1-3 Convolution/ReLu 16 336x336x16 260.0 · 106 1.81 · 106

Pool 1-1 Max pool - 168x168x16 0 4.06 · 106

Conv3-32 1-1 Convolution/ReLu 32 168x168x32 130.0 · 106 903.0 · 103 Conv3-32 1-2 Convolution/ReLu 32 168x168x32 260.0 · 106 903.0 · 103 Conv3-32 1-3 Convolution/ReLu 32 168x168x32 260.0 · 106 903.0 · 103

Pool 1-2 Max pool - 162x162x32 0 840.0 · 103

Conv3-64 1-1 Convolution/ReLu 64 162x162x64 484.0 · 106 1.68 · 106 Conv3-64 1-2 Convolution/ReLu 64 162x162x64 967.0 · 106 1.68 · 106 Conv3-64 1-3 Convolution/ReLu 64 162x162x64 967.0 · 106 1.68 · 106

Pool 1-3 Avg pool - 157x157x64 56.8 · 106 0

Conv3-2 1-4 Convolution/ReLu 2 157x157x2 28.4 · 106 49.3 · 103 Conv3-32 2-1 Convolution/ReLu 32 336x336x32 97.5 · 106 3.61 · 106 Conv3-32 2-2 Convolution/ReLu 32 336x336x32 1.04 · 109 3.61 · 106 Conv3-32 2-3 Convolution/ReLu 32 336x336x32 1.04 · 109 3.61 · 106

Pool 2-1 Max pool - 168x168x32 0 8.13 · 106

Conv3-64 2-1 Convolution/ReLu 64 168x168x64 520.0 · 106 1.81 · 106 Conv3-64 2-2 Convolution/ReLu 64 168x168x64 1.04 · 109 1.81 · 106 Conv3-64 2-3 Convolution/ReLu 64 168x168x64 1.04 · 109 1.81 · 106

Pool 2-2 Max pool - 162x162x64 0 1.68 · 106

Conv3-128 2-1 Convolution/ReLu 128 162x162x128 1.93 · 109 3.36 · 106 Conv3-128 2-2 Convolution/ReLu 128 162x162x128 3.87 · 109 3.36 · 106 Conv3-128 2-3 Convolution/ReLu 128 162x162x128 3.87 · 109 3.36 · 106

Pool 2-3 Avg pool - 157x157x128 114.0 · 106 0

Conv3-2 2-4 Convolution/ReLu 2 157x157x2 56.8 · 106 49.3 · 103

Mul Multiplication - 157x157x2 4.93 · 103 0

Conv3-8 3-1 Convolution/ReLu 8 157x157x8 3.55 · 106 197.0 · 103

Pool 3-1 Avg pool - 28x28x2 3.92 · 106 0

Conv3-4 3-2 Convolution/ReLu 4 28x28x4 226.0 · 103 3140

Pool 3-2 Avg pool - 3x3x4 144.0 · 103 0

(26)

2.2.5 Winograd

Looking at Tables 2 and 3 we can see that a great deal of processing power is needed to implement the two topologies. The totals given are 18.4·109MACC operations for the detection network and 15.5 · 109 for the VGG-16 classification network for a single inference pass using naive convolution operations. Looking into ways to reduce the arithmetic complexity of the convolution operations (which dominate the operation count of the networks), we can for example see that the method named after Winograd, used for example in Lu et al [9] and Lavin et al [10]. This method is based on breaking the inputs to a convolutional layer into tiles and use a minimal filtering algorithm first described by Winograd [11]. The basic theory is built on a minimal algorithm for computing outputs of Finite Impulse Response (FIR) filters, where calculating m outputs of an r-tap FIR filter, referred to as F(m, r), requires m + r − 1 multiplications. Extending this as shown in [10] we end up with an algorithm where convolving a tile of 2x2 elements with a 3x3 kernel only requires 16 multiplications, or only about 44% of the 36 that the naive implementation would require. The algorithm can be summarized using some linear algebra as:

Y = AT[(GkGT) ◦ (BTdB)]A (3)

Since we only have 3x3 kernels we can simply use the form of this equation for 3x3 kernels and 2x2 tiles. For these sizes k is the 3x3 kernel, d is a 4x4 tile and the rest are:

BT =

1 0 −1 0

0 1 1 0

0 −1 1 0

0 1 0 −1

(4)

G=

1 0 0

1 2

1 2

1 1 2

212 12

0 0 1

(5)

AT =1 1 1 0 0 1 −1 −1



(6) To our advantage, we can pre-process much of this algorithm, or all parts that only have to do with the kernel. The kernels themselves will only change if the network itself is retrained, so this would not happen often and would certainly mean that this pre-processing would only have to happen one time once those kernels are ready for use. This would however mean that more data needs to be streamed to the FPGA, or 16 values for each kernel, rather than 9 for the naive convolution algorithm.

This means that the GkGT part of the algorithm would be calculated offline, leaving the on- chip algorithm looking as such: Y = AT[kw◦ (BTdB)]A, where kwis the pre-processed kernel.

2.2.6 Datatypes

The standard method of calculating convolutions using the tensorflow framework uses a 32- bit floating point format for storing all values. This is a reasonable choice for an application that runs on a large general purpose computer, but an FPGA implementation calls for a more optimized approach.

(27)

In practice we can define three distinct domains of the application that strongly affect the choice of datatypes: The memory transfer domain, the processing domain and the storage domain. The memory transfer and storage domains are mostly concerned with the size of the data to be transferred/stored, and the processing domain is more concerned with the complexity of arithmetic operations. Since the bulk of the data to be transferred will be the images and convolution kernels it is vital that the datatypes representing that data be minimized as possible.

Looking at the image data it is clear that the image only consists of regular Red, Green and Blue (RGB) pixel data, each of which is represented by 8-bit values. This means that the 8-bit chardatatype is minimal for representing this data with full precision. As for the kernel data, we need to do some analysis on the input data range and decide on an acceptable range and precision for this data.

For the processing part of the application, the main thing to keep in mind is the arithmetic complexity that the datatypes infer. For fractional numbers the choice is usually between floating and fixed point representations, where floating point usually has a wider range and more precision, while fixed point has much simpler arithmetic. These advantages of fixed point numbers often allow for up to 50% reduction in area and power, according to a white paper by Xilinx [12]. Since further analysis is needed on the input kernel data to be able to properly determine the needed precision, this project will utilize 32-bit fixed point values for all processing values as well as the input kernel values. The size of at least some values (such as kernel values or the matrix data itself) could probably be reduced, but that is left for future consideration. The choice of fixed point values is only made for the sake of reducing arithmetic complexity, rather than size, since a 32-bit fixed point value takes up the same amount of memory as a 32-bit single precision floating point value.

2.2.7 Compression

A very common problem with implementing CNNs in FPGAs is the amount of memory needed to store intermediate results. Looking at tables 4 and 5 we can see that the VGG-16 implementation requires at least 12.25 MB () of memory for caching between layers, and the detection network requires at least 13.781 MB.

The platform used for this project (see Section 3.1) contains 1510 on-chip Block RAM (BRAM) cells capable of 18kb storage each, giving a total of around 26.5 Mb storage, or 3.3125 MB. This is several times lower than the requirements of the networks to be implemented.

The most obvious way to deal with this limitation is utilizing caching strategies, combined with the fact that the platform also has 1GB of Double Data Rate (Three) (DDR3) Random Access Memory (RAM). While this is in fact the method used in this project, there are other

(28)

Table 4: VGG-16 Memory requirements, assuming 4-byte values Layer Type Dimension out Req. memory size

Input Image 224x224x3 588 kB

Conv3-64 1 Convolution/ReLu 224x224x64 12.25 MB Conv3-64 2 Convolution/ReLu 224x224x64 12.25 MB

Pool 1 Max pool 112x112x64 3.06 MB

Conv3-128 1 Convolution/ReLu 112x112x128 6.125 MB Conv3-128 2 Convolution/ReLu 112x112x128 6.125 MB

Pool 2 Max pool 56x56x128 1.53 MB

Conv3-256 1 Convolution/ReLu 56x56x256 3.06 MB Conv3-256 2 Convolution/ReLu 56x56x256 3.06 MB Conv3-256 3 Convolution/ReLu 56x56x256 3.06 MB

Pool 3 Max pool 28x28x256 784 kB

Conv3-512 1 Convolution/ReLu 28x28x512 1.53 MB Conv3-512 2 Convolution/ReLu 28x28x512 1.53 MB Conv3-512 3 Convolution/ReLu 28x28x512 1.53 MB

Pool 4 Max pool 14x14x512 392 kB

Conv3-512 4 Convolution/ReLu 14x14x512 392 kB Conv3-512 5 Convolution/ReLu 14x14x512 392 kB Conv3-512 6 Convolution/ReLu 14x14x512 392 kB

Pool 5 Max pool 7x7x512 98 kB

FC-4096 1 Fully connected 4096 16 kB

FC-4096 2 Fully connected 4096 16 kB

FC-12 Fully connected 12 48 B

(29)

Table 5: Detection network memory requirements, assuming 4-byte values Layer Type Dimension out Req. memory size

Input Image 336x336x3 1.292 MB

Conv3-16 1-1 Convolution/ReLu 336x336x16 6.891 MB Conv3-16 1-2 Convolution/ReLu 336x336x16 6.891 MB Conv3-16 1-3 Convolution/ReLu 336x336x16 6.891 MB

Pool 1-1 Max pool 168x168x16 1.723 MB

Conv3-32 1-1 Convolution/ReLu 168x168x32 3.445 MB Conv3-32 1-2 Convolution/ReLu 168x168x32 3.445 MB Conv3-32 1-3 Convolution/ReLu 168x168x32 3.445 MB

Pool 1-2 Max pool 162x162x32 3.204 MB

Conv3-64 1-1 Convolution/ReLu 162x162x64 6.407 MB Conv3-64 1-2 Convolution/ReLu 162x162x64 6.407 MB Conv3-64 1-3 Convolution/ReLu 162x162x64 6.407 MB

Pool 1-3 Avg pool 157x157x64 6.018 MB

Conv3-2 1-4 Convolution/ReLu 157x157x2 192.57 kB Conv3-32 2-1 Convolution/ReLu 336x336x32 13.781 MB Conv3-32 2-2 Convolution/ReLu 336x336x32 13.781 MB Conv3-32 2-3 Convolution/ReLu 336x336x32 13.781 MB

Pool 2-1 Max pool 168x168x32 3.445 MB

Conv3-64 2-1 Convolution/ReLu 168x168x64 6.891 MB Conv3-64 2-2 Convolution/ReLu 168x168x64 6.891 MB Conv3-64 2-3 Convolution/ReLu 168x168x64 6.891 MB

Pool 2-2 Max pool 162x162x64 6.407 MB

Conv3-128 2-1 Convolution/ReLu 162x162x128 12.814 MB Conv3-128 2-2 Convolution/ReLu 162x162x128 12.814 MB Conv3-128 2-3 Convolution/ReLu 162x162x128 12.814 MB

Pool 2-3 Avg pool 157x157x128 12.036 MB

Conv3-2 2-4 Convolution/ReLu 157x157x2 192.57 kB

Mul Multiplication 157x157x2 192.57 kB

Conv3-8 3-1 Convolution/ReLu 157x157x8 770.28 kB

Pool 3-1 Avg pool 28x28x8 24.5 kB

Conv3-4 3-2 Convolution/ReLu 28x28x4 12.25 kB

Pool 3-2 Avg pool 3x3x4 144 B

Conv3-2 3-3 Convolution/ReLu 3x3x2 72 B

(30)

2.2.8 Memory transfers

As seen in tables 4and5, the memory requirements of most of the different stages of the two networks are larger than the available memory on-chip. This means that we must offload most of the memory to the on-board DDR3 storage and stream the values between the FPGA and the DDR3 memory as needed.

This can get quite costly due to the fact that the convolution operation requires scanning through the entire input space several times, or once for each output channel. The most popular way of dealing with this in similar implementations is by using line-buffering, which involves loading the input image one line at a time as needed, leading to at least a semi-balanced process where a line of data is loaded into memory, and while that line is being processed the next line of input data gets loaded simultaneously. This means that however fast the actual processing of data or the loading of memory can get, the overall process will never be faster than the slower of the two.

(31)

3 Platform and Tools

Here the choice of platform and development tools will be reviewed.

3.1 Platforms

The choice between FPGA vendors is not very wide and for a project of this size the selection really comes down to either Intel/Altera or Xilinx. Since the requirements for this project require functionality traditionally taken care of with generic operating systems, choosing a SoC solution is comes naturally. Both vendors have SoC solutions that have a hard ARM processor system on the same die as the FPGA itself, able to run Linux distributions. The choices available for FPGA SoC systems from Intel/Altera and Xilinx can be seen in Table6.

Table 6: FPGA SoC platforms

Vendor Product line Max LE Max mult. Example pricea Price per kLE

Altera Cyclone V 110k 224 124.28e 1.1662e

Altera Arria V 462k 2180 1900.33e 4.1133e

Altera Arria 10 660k 3376 2268.94e 3.4378e

Altera Stratix 10 5510kb 3960 14138.44e 5.0494e

Xilinx Zynq-7000 Artix 85k 220 98.46e 1.1584e

Xilinx Zynq-7000 Kintex 444k 2020 1927.38e 4.3409e

aPrices sourced from Mouser for Altera and from Farnell for Xilinx. Lowest single item price with max LE count. Current as of 14.4.2018

bPrice given for 2800 kLE unit

These number show that the price per Logic Element (LE) is quite comparable between the two manufacturers. Unfortunately the lower-end product lines do not meet the requirements of this project, and the higher-performance Altera product lines (Arria 10 and Stratix 10) have no development kits within the budget of the project. Thus, the selection has been narrowed to the Arria V and Zynq-7000 Kintex platforms and due to the development kit prices, the Zynq-7000 was chosen.

The selected development kit was the AES-MINI-ITX-7Z100-BAS-G [15] which includes the Xilinx Zynq-7000 Z-7100 (Part number XC7Z100 [16]) on a MiniITX form factor Printed Circuit Board (PCB) with 2GB DDR3 RAM (1GB on the Processor System (PS) side and 1GB

(32)

either VHDL or Verilog), that can then be synthesized to hardware structures (RAM, Look- up Tables (LUTs) and so on for an FPGA) using the same logic synthesizer as in the regular RTL design flow. Extending the comparison between the HLS to a C/C++ compiler, the logic synthesizer can be considered an analogue to the assembler which converts the assembly code to machine code.

An important distinction for the HLS tool though is the fact that C/C++ code inherently describes a sequential algorithm, but one of the main strengths of the FPGA is its ability to run logic in parallel, often on a massive scale. The algorithmic analysis that the tool performs on the C/C++ input code can not definitively imply the extent of parallelism that is achievable or desired by the designer. For this reason the tool provides several methods for indicating the desired structure of the synthesized logic.

3.2.1 Program flow control

The most common method for controlling the resulting logic in this project are various

#pragma statements, which can be used to control pipelining of logic, loop unrolling, array partitioning, and more. These are used extensively in this project, and to demonstrate their application a few examples follow.

3.2.1.1 Loop unrolling The default interpretation of a loop in an HLS environment is to leave it unrolled, meaning that the sequential structure of the C/C++ program is adhered to.

This is a rather resource-light structure, but also very slow. Usually, an RTL designer will want the loop to be unrolled at least to some degree, to take advantage of parallelism, but this requires that each pass of the loop has no dependency on the pass before it.

void calculate(int x[20], int y[20], int z[20]) {

for (int i = 0; i < 20; i++) {

#pragma HLS UNROLL z[i] = x[i] * y[i];

} }

Considering this snippet of C code, we can see that it is meant simply to iterate over two arrays and place the product at each index into a third array. The synthesis tool has no knowledge of the nature of the data and the default behaviour is to leave loops unrolled. The designer, however, knows that there are plenty of free resources to unroll this loop, the order of loop passes is not important, and there is no dependency between loop passes, meaning that there is nothing to stop a full unroll of this loop. The only bottleneck will then be the accesses to the RAM storage of the input parameters. As shown in Table 7, the unrolling of the loop allows for a decrease in latency by a factor of 8, but increases the resource usage about double for the worst case of Digital Signal Processing (DSP) blocks. An example of Verilog code that might be generated without the Unroll pragma is shown below:

(33)

reg [31:0] x [0:20];

reg [31:0] y [0:20];

reg [31:0] z [0:20];

integer idx = 0;

always @ (posedge clk) begin: COUNTER

if (idx < 20) begin

z[idx] <= x[idx] * y[idx];

idx <= idx + 1;

end end

And with the pragma:

reg [31:0] x [0:20];

reg [31:0] y [0:20];

reg [31:0] z [0:20];

integer idx = 0;

always @ (posedge clk) begin: COUNTER

for (idx = 0; idx < 20; idx=idx+1) begin z[idx] <= x[idx] * y[idx];

end end

(34)

A simplified Finite State Machine (FSM) diagram of the rolled loop code is shown in Figure 10, and the unrolled code in Figure11.

Latency [ccs] BRAM DSP48E FF LUT

Rolled 81 0 3 0 47

Unrolled 12 0 6 0 42

Table 7: Loop unrolling comparison

This method for unrolling loops is mostly used for parallelizing the pooling functions as well as the core operation in the naive convolution operation, where the sums for a single kernel multiplication are done at once.

C0 start

C1

C2 C3

C4 ap_start == 1 ap_start! = 1

ap_start! = 1

mem_addr! = 20

C0: ap_idle = 1

Indicates that the block has not started, waiting for input.

C1: ap_ready = 1 (if mem_addr == 20) x_ce = 1 (Enable X memory)

y_ce = 1 (Enable Y memory)

Start of processing, starts buffering X and Y values

C2: Second cycle of loading X and Y values into buffer

C3: Calculating the product C4: z_ce = 1 (Enable Z memory)

z_we = 1 (Write enable of Z memory) Output of calculated product, increment memory address

Figure 10: FSM with rolled loop logic. Note that resets are ignored, but should lead back to C0.

(35)

C0 start

C1 C2

C3

C4 C5

C6 C7

C8 C9

C10 C11 C12 ap_start! = 1

C0: ap_idle = 1

Indicates that the block has not started, waiting for input.

Cn<11: All states except C0, C11, and C12, will increment two addresses in X and Y memories (both are two-port memories), and cache the read values from those memories, initiating a product calculation. All memory indices are constant for their respective states.

Cn>2: All states except C0, C1, and C2, will increment two addresses in the Z

memory (a two-port memory), insert the most recently calculated product. All memory indices are constant for their respective states.

Figure 11: FSM with unrolled loop logic. Note that resets are ignored, but should lead back to C0.

3.2.1.2 Array partition The default behaviour when defining arrays in HLS is to interpret them as a single BRAM unit (or a LUT when appropriate, as in the preceding code snippet).

This behaviour can be as intended, especially when working with rolled loops, which will access a single element in the array at once and in sequence. When working with unrolled loops however, the fact that there is a limited number of access possible within a clock cycle may become a bottleneck. This is the fact in the code snippet we looked at earlier, which can be optimized by splitting the array to be more compatible with the unrolled loop:

void calculate(int x[20], int y[20], int z[20]) {

#pragma HLS ARRAY_PARTITION variable=x complete dim=1

(36)

This means that there is no specific latency figure for the module, and no FSM is generated, but the resources will increase much more than in the previous case, or by about a factor of 10.

Latency [ccs] BRAM DSP48E FF LUT

Unpartitioned 12 0 6 0 42

Partitioned 0 0 60 0 420

Table 8: Array partitioning comparison

This can also be used to segment specific dimensions of multidimensional arrays, which is very useful in the context of this project, which uses this for all buffers, fitting the partitioned arrays to the loop geometry. Another related directive, that is very related to this one, is the #pragma HLS ARRAY_RESHAPE, which allows reordering the physical layout of the elements in an array, to better fit them to the loop geometry.

3.2.1.3 Pipeline An extremely important concept in digital design is that of pipelining. This allows for parallelizing large logic paths that act on a sequence of data, by minimizing the initiation interval of that path. Also of interest is the Dataflow directive, which is a Vivado specific pipelining method which works on a larger scale than the standard pipeline directive, and aims ot optimize the flow of data (so that separate data flows can flow in parallel) rather than just simple loops. This is not illustrated here, nor is it used to any extent in the project, but might possibly be of interest for further development.

void calculate(int x[20], int y[20], int z[20]) {

for (int i = 0; i < 20; i++) {

#pragma HLS PIPELINE z[i] = x[i] * y[i];

} }

This can be illustrated in our example as a sort of compromise between the fully unrolled loop and the rolled one. It will, as shown in Table 9, deliver a design that exhibits a latency in between the two options, but with a more conservative use of DSP blocks. A simple FSM is generated, as shown in Figure12, but this is mostly to contain the actual pipeline. The pipeline generated here is much like the rolled pipeline shown in Figure 10, but on a conceptual level can be considered as having four copies running in parallel to saturate the memory accesses of the 2-port BRAM blocks.

Latency [ccs] BRAM DSP48E FF LUT

Non-pipelined 81 0 3 0 47

Pipelined 24 0 3 0 51

Table 9: Loop pipelining comparison

In this project, the pipelining directive is mostly used when a long logic path is found within a loop, such as in the Winograd implementation, where the core kernel operation has been pipelined.

(37)

C0 start

C1 C2

ap_start! = 1

ap_star t =

1

C0: ap_idle = 1

Indicates that the block has not started, waiting for input.

C1: Indicates that the pipeline is currently running. This state is exited only after the pipeline is finished.

C2: Pipeline has finished.

Figure 12: FSM with pipelined logic.

3.2.1.4 Inline The default behaviour of the synthesis tool is to treat individual C/C++

functions as independent modules in the RTL output. This may or may not be the intended result, as often it makes sense to split functionality in C/C++ into separate functions while intending that function to be an integral part of the resulting RTL module. Inlining will insert the code into the calling function, allowing for greater optimization and reducing interface overhead. However, a function that is used in many different modules will then be instantiated for each of those modules. This limits resource re-use that might be desirable if the inlined function is resource-heavy. The inline directive is used in this project mostly for splitting out specific segments of code into their own functions, that are not used in many places and thus benefit from inlining.

3.2.2 Vivado HLS

The Xilinx Vivado HLS tool[17] is capable of generating an Intellectual Property (IP) block from C/C++/SystemC code, bundling that with a testing framework which can be used to verify the code before starting a time consuming synthesizing and implementing process. The testing framework offers two distinct methods for verifying code. The first is by linking it with a testing file, compiling it as regular C code and running that through testbench code written in C/C++. The second is as a C/RTL co-simulation, which generates testing waveforms from the testbench, applies that to an RTL simulation, and returns the response from that simulation back to the testbench. There are many further strategies and optimization methods available through Vivado HLS, all of which are explained in the Vivado Design Suite User Guide, High-Level

(38)

4 System design

At the highest level, the system will be receiving commands to either identify a surgical instrument or detect any dirt that may be on it. Since the system will not have any meaningful user interface, these commands should either be received via a network interface or by a General-Purpose Input/Output (GPIO) signal. The system is intended for integration with an automatic system, so a network interface is presumably more relevant and will thus be implemented, rather than a physical GPIO interface.

After receiving this network message the system will have to take a picture via a camera which is connected via USB. Both USB and Ethernet interfaces are substantially easier to work with through a proper operating system, and the chosen platform supports running embedded Linux alongside the FPGA. Thus, the easiest and most straight-forward way to implement these features is by writing a small user-space application for Linux that handles both network and USB communications. This program can then also handle pre-processing the image before sending it to the FPGA, as well as control the streaming of data.

Start Request received Request image

Pre-process image Start device

driver Stream image data

Stream kernel data Stream result

Send result over network

Figure 13: Software loop

Unfortunately, a user-space application cannot normally access physically contiguous memory locations, meaning it is unsuitable for streaming data over to the FPGA. This means that a kernel module device driver also needs to be implemented for handling any low- level communication with the FPGA.

Finally, the FPGA application should be able to process the neural networks in their entirety and stream any results back up to the device driver, and from there to the user-space application and finally up to the requesting entity over the network interface.

4.1 Software design

The system demands a fair bit of software design, which must take care of listening and responding to network messages as well as requesting and preprocessing images from the USB camera. This process must then be able to stream both the pre- processed image data and kernel data to the underlying FPGA application, and send the result back over the network. The software is written in C and runs on an embedded Linux system, defined and compiled using the Xilinx PetaLinux toolchain. An overview can be seen in Figure 13. Note that since the communication with the FPGA requires accessing absolute memories, splitting the application between kernel- and user-space is required.

(39)

4.1.1 User-space application

The user-space application takes care of all higher-level operations. This includes communicating with the network and the USB camera device, as well as streaming data to and from the kernel- space device driver application.

The network communication is achieved using simple Transmission Control Protocol (TCP) socket communication, with valid messages from the client being either DETECT or CLASSIFY. If either message is received, the application will request an image from the USB camera and process it into a fixed-point binary bitmap file. It will then activate the device driver, telling it to prepare to either run the detection or classification network, depending on the message received. Then the binary image data is streamed to the device driver, followed by the appropriate kernel data. When the network is finished, the application will stream the returned data into a binary file and then back over the network to the client, which will use the results as appropriate for its function in the wider application.

(40)

Start Device opened Prepare buffers ioctl

Start device DETECT Start device

IDENTIFY

ioctl Receive image

Stream image data

to module ioctl Receive kernel

Stream kernel data

to module ioctl Return results Stream results to application Device closed

Figure 14: Kernel module loop 4.1.2 Kernel-space application

The kernel-space application is responsible for managing and communicating with the FPGA device. The kernel module is set up as a basic character device driver and can be written to/read from like a normal file on the local file system. It accepts a number of ioctl messages (a simple mechanism for general purpose communication between user- and kernel-space) for controlling the flow of the program as well. The intended flow of the module starts with an

(41)

ioctl message which signals which CNN should be activated on the peripheral. When this message is received, the driver starts the peripheral with the appropriate mode parameter as well as the Double Data Rate (DDR) offset allocated to the device. This is achieved with memory-mapped I/O over an Advanced eXtensible Interface 4 (AXI4)-Lite interface (a simple bus connecting the PS to the FPGA), using hardcoded memory addresses defined in Vivado during the FPGA block design.

After this, the module should receive a second ioctl message, signaling the start of image data. This will be either a 3x336x336 image for the detection network or 3x224x224 for classification network, and will be streamed in line by line. This is streamed directly to the peripheral via memory-mapped Direct Memory Access (DMA) access using an AXI4-Stream interface and hardcoded memory addresses defined in Vivado during FPGA block design.

After transferring all the image data the driver should receive the third ioctl message, which signals the start of kernel data. This data is streamed through the device driver in the same manner as the image data, but the peripheral will only accept kernel data when needed, to save memory space. Due to this, the system may hang while the kernel data is being used, essentially for the duration of the processing in the peripheral. When the peripheral has used up all the data offered by the driver, it will request data from the peripheral of the intended output size. When this data becomes available the device driver will save it to the internal buffers. These buffers may be accessed via ioctl, but when the process has not been finished these will not return any data, allowing the user-space application to poll until the data from the peripheral is ready. At this point the data will be streamed from the driver to the application, after which the device should be closed. This process is shown in Figure14.

4.2 HLS design

Much of the design of this project lies in the FPGA peripheral itself, which is intended to run the pre-trained neural networks in an accurate and timely manner. The design can be split into a few blocks, each of which should take of a specific function of the processing of a neural network. These are the different layer types of the networks in the application as well as some higher-level blocks for control.

4.2.1 Top-level block

The top-level block takes care of selecting the correct series of CNN layer operations from Read-Only Memory (ROM) and sequentially executing those on a section of the DDR3 RAM.

It has 4 ports; 2 AXI4-Stream ports, 1 AXI4-Lite port and 1 simple interrupt pin. The two

(42)

Table 10: OpCodes

Name Type Description Valid values

Operation type int32 Defines the appropriate action 1 - 7 Channels_in int32 The number of channels at the input 1 - max(int32) Channels_out int32 The number of channels this operation will return 1 - max(int32) Image_width int32 The width of the input image 1 - max(int32) Image_height int32 The height of the input image 1 - max(int32) Kernel_size int32 The kernel size, assumes a square kernel 1 - max(int32) Kernel_stride int32 The stride of the kernel operation 1 - max(int32) Output_size int32 The byte-size of the output of this operation 1 - max(int32) The operation type parameter can take 7 different values, corresponding to the following operations:

Convolve Execute a convolution operation on an image with the specified dimensions with kernels of the specified dimensions.

MaxPool Execute a max-pooling operation on an image with the specified dimensions. The kernel dimensions here correspond to the area of the max operation.

AvgPool Execute an average-pooling operation on an image with the specified dimensions.

The kernel dimensions here correspond to the area of the averaging operation.

Store Stores the output of the last operation to DDR RAM and returns a pointer to the start.

Indicates that the result will be used for further processing for the network.

Stream Stream the results of the last operation through the AXI4-Stream interface. This signals the end of the program. For a graphic explanation of this block see Figure15.

(43)

Start

Stream image data to DDR

Mode

Load detection

ROM Load

classification ROM

Load instruction

Type

Avgpool Maxpool

Convolve Store Stream

Return success

Figure 15: System overview 4.2.2 Convolution block

(44)

inside the buffer.

4.2.3 Pooling blocks

The pooling blocks are relatively simple as compared to the convolution block. There are two pooling functions implemented in the application: MaxPool and AvgPool.

The MaxPool block is intended to tile the image with pre-specified sizes and strides, choosing the maximum value in each tile and passing that to the output image. The AvgPool is a similar block, but instead of passing the tile’s maximum value to the output image, it will pass the average of all values within the tile.

4.2.4 Other blocks

Other blocks that need to be considered here are the Elementwise multiplication, FC and SoftMax blocks. The SoftMax block would only be needed at the output of the detection network, and would only be calculating one value from 4 input values before passing it up to the PS, meaning that this is a very light computation. It will therefore not be implemented in the FPGA, but rather be calculated on the Central Processing Unit (CPU) once all the values from the layer above are streamed back. The FC blocks are part of the VGG network, which will not be implemented in this version of the system, and is thus ignored during this design.

The elementwise multiplication block simply takes in three pointers to DDR RAM locations, two of which point to input images and the third to the intended output location. The block then simply goes through the length of the input images (which should be the same), multiplies each element in the first input image with the corresponding element in the second input image and writes the value to the corresponding element in the output image.

There are two more defined operations, Store and Stream, that do not have any blocks associated with them. These operations are instead executed in the top-level block, where the stream command simply streams every value of the last operation to the output AXI4-Stream interface, and the Store command returns the position of the output of the last operation, which can then be cached for further processing at a later time. This last command is only useful for the Detection network, which requires two branches to be combined at a later point.

(45)

5 Implementation

The actual implemented system that will be discussed in this section, which is composed of the HLS blocks of the detection network. The block design of the overall system and the Linux software will be detailed in Section 6, including it’s connection with the PS side of the fabric, and the software implementation of the CPU side.

We will review each of the different processing elements in this section and detail the latency and resources required for each one, as used in the detection network. The latency is reported as clock cycles needed from start to end for that particular block, as given by the HLS tool. Since everything will be synthesised with a 10ns cycle time, a latency of 108cycles means a latency of 1 second. The resource usage as far as BRAM, DSP blocks, flip flops, and LUTs is detailed, and the total utilization percentage of the chosen platform, for implementing the detection network, will be shown as well. Scatter plots showing how the latency and resource usage changes with the increasing size of the blocks will be drawn.

5.1 Convolution

The convolution section of the design can be considered to contain the bulk of all the processing in the system, and is thus the most important to have properly optimized.

We will be designing and reviewing two different implementations, one which will utilize the Winograd method mentioned in Section 2.2.5, and one which will use a naive implementation of the 2D convolution algorithm. Both will iterate over the output and input channels, convolving in a width-first manner, and ending with a bias and ReLU stage. Both use a linebuffer, with the Winograd implementation loading two lines at a time, while the naive implementation shifts in a single line at a time.

Both methods will utilize C++ for coding the algorithms and will use templates to force different implementations according to the input line-widths. This enables the HLS tool to a greater degree than having a generic implementation, since this will allow us to do pipelining to a greater degree as well as specifying proper buffer sizes, rather than having to force all buffers to the maximum size used in the CNN.

Note that the statistics gathered here will only pertain to the detection network, since the classification network belongs to a well researched class of networks which has been implemented many times with good results, e.g. in [18], [19], and [20].

5.1.0.1 Winograd implementation The Winograd implementation differs from the naive

(46)

Figure 16: Winograd convolution element design

Table 11: Convolution HLS statistics, Winograd implementation

Name Latency [ccs] BRAM DSP48E FF LUT

Conv3-16 1-1 15823957 23 66 7817 8976

Conv3-16 1-2 76565573 23 66 8361 9116

Conv3-16 1-3 76565573 23 66 8361 9116

Conv3-32 1-1 39504005 23 64 7964 9093

Conv3-32 1-2 78104197 23 64 7968 9097

Conv3-32 1-3 78104197 23 64 7968 9097

Conv3-64 1-1 145593861 23 64 7976 9147

Conv3-64 1-2 289506821 23 64 7980 9151

Conv3-64 1-3 289506821 23 64 7980 9151

Conv3-2 1-4 8431935 23 64 7964 10053

Branch 1 1097706940 92 258 32273 37417

Conv3-32 2-1 31647909 23 66 8377 9147

Conv3-32 2-2 302648965 23 66 8373 9170

Conv3-32 2-3 302648965 23 66 8373 9170

Conv3-64 2-1 156208389 23 64 7976 9147

Conv3-64 2-2 310609157 23 64 7980 9151

Conv3-64 2-3 310609157 23 64 7980 9151

Conv3-128 2-1 579013637 23 64 7988 9205

Conv3-128 2-2 1154665477 23 64 7992 9211

Conv3-128 2-3 1154665477 23 64 7992 9211

Conv3-2 2-4 16814527 23 64 7968 10059

Branch 2 4319531660 92 258 32313 37591

Conv3-8 3-1 1245181 23 64 7973 9430

Conv3-4 3-2 91061 1 64 8723 9093

Conv3-2 3-3 895 5 64 5676 8214

Branch 3 1337137 29 192 22372 26737

Total 5418575737 257 731 90139 100543

Utilization % - 17.02 36.19 16.25 36.25

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

The EU exports of waste abroad have negative environmental and public health consequences in the countries of destination, while resources for the circular economy.. domestically