Real-time stereoscopic object tracking on FPGA using neural networks

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Real-time stereoscopic object tracking on FPGA using

neural networks

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings universitet

av

Fredrik Svensson och Lukas Vik LiTH-ISY-EX--14/4789--SE

Linköping 2014

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Real-time stereoscopic object tracking on FPGA using

neural networks

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan vid Linköpings universitet

av

Fredrik Svensson och Lukas Vik LiTH-ISY-EX--14/4789--SE

Handledare: Joakim Alvbrant

ISY, Linköpings universitet

Emil Hjalmarson

AnaCatum Design AB

Examinator: J Jacob Wikner

ISY, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Electronics Systems

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2014-06-26 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-110374

ISBN — ISRN

LiTH-ISY-EX--14/4789--SE Serietitel och serienummer

Title of series, numbering ISSN_—

Titel

Title Stereoskopisk objektigenkänning med neurala nätverk på FPGA_{Real-time stereoscopic object tracking on FPGA using neural networks}

Författare

Author Fredrik Svensson och Lukas Vik

Sammanfattning Abstract

Real-time tracking and object recognition is a large field with many possible applications. In this thesis we present a technical demo of a stereoscopic tracking system using artificial neural networks (ANN) and also an overview of the entire system, and its core functions. We have implemented a system able of tracking an object in real time at 60 frames per sec-ond. Using stereo matching we can extract the object coordinates in each camera, and calcu-late a distance estimate from the cameras to the object.

The system is developed around the Xilinx ZC-706 evaluation board featuring a Zynq XC7Z045 SoC. Performance critical functions are implemented in the FPGA fabric. A dual-core ARM processor, integrated on the chip, is used for support and communication with an external PC. The system runs at moderate clock speeds to decrease power consumption and provide headroom for higher resolutions.

A toolbox has been developed for prototyping and the aim has been to run the system with a one-push-button approach. The system can be taught to track any kind of object using an eight bit 32 × 16 pixel pattern generated by the user. The system is controlled over Ethernet from a regular workstation PC, which enables it to be very user-friendly.

Nyckelord

(6)

(7)

Abstract

Real-time tracking and object recognition is a large field with many possible ap-plications. In this thesis we present a technical demo of a stereoscopic tracking system using artificial neural networks (ANN) and also an overview of the entire system, and its core functions.

We have implemented a system able of tracking an object in real time at 60 frames per second. Using stereo matching we can extract the object coordinates in each camera, and calculate a distance estimate from the cameras to the object.

The system is developed around the Xilinx ZC-706 evaluation board featuring a Zynq XC7Z045 SoC. Performance critical functions are implemented in the FPGA fabric. A dual-core ARM processor, integrated on the chip, is used for support and communication with an external PC. The system runs at moderate clock speeds to decrease power consumption and provide headroom for higher resolutions.

A toolbox has been developed for prototyping and the aim has been to run the system with a one-push-button approach. The system can be taught to track any kind of object using an eight bit 32 × 16 pixel pattern generated by the user. The system is controlled over Ethernet from a regular workstation PC, which enables it to be very user-friendly.

(8)

(9)

Acknowledgments

First of all we would like to thank J Jacob Wikner. Not only for being the examiner of this thesis work, but also for helping us get in contact with AnaCatum. We are also grateful for the many things he taught us in courses at the university. Our thanks also go out to Joakim Alvbrant, our supervisor at the university, for helping us in finding literature and for proofreading this report.

We are also thankful for the help and support of Emil Hjalmarson, our supervisor at AnaCatum. This whole project was his idea, and without the fruitful discus-sions and the many tips along the way, it would have never gone as well as it did.

The colleagues at AnaCatum were very helpful. Specifically we would like to thank Björn Ärleskog for helping us with hardware and video related issues. Claes Hallström helped us very much with neural network related questions as well as FPGA issues, and deserves a big thank you.

We would also like to thank Niklas Swartz at Licera AB. He helped us with sol-dering some circuit boards, and also provided some good advice.

Last but not least we would like to thank everyone at AnaCatum for making us feel welcome and making our time there very enjoyable.

Linköping, June 2014 Fredrik Svensson and Lukas Vik

(10)

(11)

5.4 Summary . . . 38 6 Hardware aspects 39 6.1 Frame buffers . . . 39 6.2 Number of neurons . . . 40 6.3 Clock domains . . . 41 6.4 Summary . . . 43 7 Results 45 7.1 Software . . . 47 7.1.1 Scripts . . . 47 7.1.2 C program . . . 48 7.2 Resource utilization . . . 49 8 Discussion 51 8.1 Comparison to related work . . . 51

8.2 Possible improvements . . . 52

8.2.1 Better tools . . . 52

8.2.2 Independent of the Xilinx/Zynq platform . . . 52

8.2.3 Better camera setup . . . 53

8.2.4 Move to a 6 bit word length . . . 54

8.2.5 Move to higher resolutions . . . 55

8.2.6 Combine with an adaptive filter . . . 55

8.2.7 Add-ons . . . 56

A Video setup 59 A.1 Camera timing . . . 59

A.2 HDMI timing . . . 60

B Calculating the stereoscopic depth 63 B.1 Error sensitivity . . . 67 C Comments about peak signal-to-noise ratio (PSNR) 69 D Digital quantization, signal-to-noise ratio and effective number of bits 71

(13)

List of Figures

1.1 System overview. . . 2

2.1 Rough system overview. . . 5

2.2 AXI4-bus interface between PS and PL . . . 7

2.3 A configurable logic block with its two slices. . . 7

2.4 Signals between the image sensors and the Zynq . . . 9

2.5 Signals between the Zynq and the HDMI transmitter . . . 10

2.6 (a) Front and (b) back view of the motherboard and camera boards. 11 2.7 The motherboard with all components soldered in place. (a) Front view. (b) Back view. . . 12

2.8 The complete setup with motherboard and camera boards with lenses attached. . . 13

3.1 Illustration of a Bayer filter sensor array. . . 16

3.2 Kernels for calculating RGB values, when the Bayer data is cen-tered on a blue pixel. The kernels correspond to eqs. (3.1) to (3.3). 19 3.3 Kernels for calculating RGB values, when the Bayer data is cen-tered on a green pixel in a blue row. The kernels correspond to eqs. (3.4) to (3.6). . . 19

3.4 Kernels for calculating RGB values, when the Bayer data is cen-tered on a green pixel in a red row. The kernels correspond to eqs. (3.7) to (3.9). . . 19

3.5 Kernels for calculating RGB values, when the Bayer data is cen-tered on a red pixel. The kernels correspond to eqs. (3.10) to (3.12). 19 3.6 A comparison showcasing the distortion introduced by bilinear in-terpolation. The image itself depicts a detail of a fire extinguisher. 20 3.7 Kernels for averaging a Bayer image by a factor three. . . 22

4.1 Two dimensional decision space with two categories. . . 27

4.2 Neural network with a single layer. . . 28

4.3 Block diagram of an artificial neuron. . . 30

4.4 Decision space accuracy is a logarithmic function. . . 31

(14)

5.1 Illustration of the tracking technique. . . 34

5.2 Displays how the system searches in different zoom levels. . . 36

5.3 Illustration of the graphics overlay produced in hardware. . . 36

5.4 (a) The original BMP image used for generating the ROM content. (b) The data stored in the ROM. . . 37

6.1 Illustrating how the scan window is connected to the frame buffers. 40 6.2 Different clock domains in the system. . . 42

7.1 The figures showcase a face being tracked from different positions. 46 7.2 The hardware utilization dependent on number of patterns. . . 50

8.1 Different lighting conditions (a) a bright day, and (b) a gloomy day. 53 A.1 Timing for DOUT from the cameras. . . 60

A.2 LINE_VALID compared to FRAME_VALID from the camera. . . . 60

A.3 Active frame of a video signal. . . 61

A.4 Horizontal timing line for the HDMI output. . . 61

A.5 Vertical timing line for the HDMI output. . . 62

B.1 A face detected within the viewing angle of both cameras. . . 64

B.2 The cone formed by the left camera’s field of view. . . 65

B.3 The cone formed by the right camera’s field of view. . . 65

C.1 Test images for evaluating the performance of image processing algorithms. . . 70

(15)

List of Tables

2.1 Performance metrics of the Xilinx XC7Z045 SoC. . . 6 3.1 Comparison between two Bayer to RGB interpolation methods. . . 17 7.1 Number of lines of code for different languages. . . 47 7.2 Hardware utilization on the Zynq-XC7Z045 when using 150

pat-terns. . . 49 A.1 Standardized timing signals for HDMI 640 × 480 @ 60Hz. . . 60

(16)

(17)

Notation

List of acronyms

Acronym Definition

ANN Artificial Neural Network

ASIC Application Specific Integrated Circuit AXI Advanced eXtensible Interface

CLB Configurable Logic Block

DDR3 Double Data Rate (memory), third generation DFF D-type Flip-Flop

DS Deliberately Sloppy ENOB Effective Number Of Bits FPGA Field-Programmable Gate Array

FPS Frames Per Second

GPIO General Purpose Input/Output HDMI High-Definition Multimedia Interface

I2_C _{Inter-Integrated Circuit (bus)}

IP Intellectual Property LUT Look-Up Table PCB Printed Circuit Board

PL Programmable Logic PLL Phase-Locked Loop

PS Processing System PSNR Peak signal-to-noise ratio

RBF Radial Basis Function SAD Sum of Absolute Difference

SDRAM Synchronous Dynamic Random Access Memory SNR Signal-to-Noise Ratio

(18)

(19)

1

Introduction

T

his thesis presents the work done by Fredrik Svensson and Lukas Vik in their final-year project. It was conducted as part of the master’s pro-gram (sv: Civilingenjör) in Applied Physics and Electrical Engineering at Linköping University. The work was carried out during six months in Spring 2014.

1.1 Background

Our project, Real-time stereoscopic object tracking on FPGA using neural net-works, was conducted at AnaCatum Design AB, or AnaCatum for short, a leading provider of analog mixed-signal IP solutions. Their product portfolio has histor-ically consisted of analog-to-digital converters (ADCs), digital-to-analog convert-ers (DACs), phase lock loops (PLLs) and analog front-ends (AFEs). Lately they have started to venture into the lands of artificial neural networks (ANNs). This thesis is a part of that venture.

Pattern recognition and tracking algorithms, as the topic is centered around, can be found in a variety of different systems. For instance, recognizing how cus-tomers are using on-line services or detecting objects in an automotive active safety system. The implementation and techniques may differ between applica-tions, but the core idea is the same; we want to build a system that has the ability recognize and categorize data in a similar way humans can.

The aim is to implement a stand-alone facial tracking product using ANNs on an FPGA platform. Instead of looking for possible hazards in traffic as the auto-motive application, we will look for a face. The applications share some general

(20)

Figure 1.1:The idea of how the system shall work with its major parts. Cam-era figure by Design Contest/iconfinder.com, CC BY 3.0. Motherboad figure by FatCow Web Hosting/iconfinder.com, CC BY 3.0 US.

similarities even though we aim for a more simple functionality and use a differ-ent technique to find our patterns.

The product is intended to be a technical demo of what can be achieved using ar-tificial neurons. It is supposed to be the sort of demo that makes a non-technical person interested and convinced that the technology works, this without having to know all technical details. For this reason, video is chosen as a medium. Apart from tracking a face in real time, the system should estimate the distance to the person. This is achieved by having two cameras next to each other and doing stereo matching on the two video streams. The arrangement with two cameras is what we call a stereoscopic setup.

An overview of the system, with the most basic components, is shown in fig. 1.1.

1.2 Problem approach

To solve the problem, it was naturally divided into a set of sub problems. They are listed in the table below; in the order we executed them.

• Establish a simple way of accessing the system memory and registers from a remote PC

• Create an interface to the HDMI transmitter

• Implement an I2_{C interface for configuring the HDMI transmitter and other}

peripherals

• Manufacture and solder a PCB for mounting the cameras

(21)

1.3 Related work 3 • Pre-process the camera stream to a suitable format for the neurons

• Implement an algorithm for object tracking and tools for system learning • Collect a database of training data for the neurons

One could of course debate if this order was the correct one, but it was the order that we used. For example it would have been more optimal to work with the camera PCBs a bit earlier in the project. But delivery and soldering got delayed due to circumstances that were beyond our control.

Looking at the list of problems one can see that some, if not most, are not directly related to ANNs. While the ANN is the core of the system, many other functions will require time and attention besides the ANN to make it into a stand-alone product.

1.3 Related work

There are some work related to neural networks and tracking, for instance Yang and Paindavoine [20] presented a paper with a comparison between different em-bedded systems. The work is basically identical to ours, except they only use one camera and thus do not have the depth estimation aspect.

On the topic of stereo vision Ahlberg and Ekstrand [1] show that it is possible to calculate the depth of stereo images fast enough on a low cost FPGA.

1.4 Thesis outline

In the first chapters we cover some theory related to the subject and present the hardware platform. This leads into implementation and design chapters where the most interesting parts are discussed. Finally we have a short discussion about performance and possible improvements. The detailed outline of this master thesis is as follows:

Chapter 2describes the hardware we have at our disposal in the project. It also briefly explains how an FPGA works and what peripherals are available on our FPGA board.

Chapter 3discusses briefly how digital CMOS camera sensors work, and the im-plications it has on our system. A few other data preprocessing steps that are needed for tracking are also discussed.

Chapter 4presents background theory regarding neural networks. The presen-tation is somewhat concise, but references to more detailed work is presented as well.

Chapter 5is an overview of the tracking algorithm used. It also shows the over-lays that are used to indicate on screen where a face has been found.

(22)

Chapter 6discusses a few hardware limitations and other aspects. For example some bounds on how many neurons we have space for.

Chapter 7present the resulting system and its performance. It also gives a sum-mary of the thesis.

Chapter 8contains a discussion about the result and presents some future work that could be relevant.

Appendix Apresents the HDMI format and introduces concepts such as blanking and synchronization. It is not vital to read this part unless you are specifically interested in the HDMI format.

Appendix Bcontains a derivation of a compact formula for calculating the stereo-scopic depth in an image. Some analysis is also done regarding how a few com-mon errors affect the result.

Appendix Ccontains a short discussion about peak signal-to-noise ratio (PSNR), which is a performance measure for image conversion.

Appendix Dintroduces the concepts of signal-to-noise ratio (SNR) and effective number of bits (ENOB).

(23)

2

System hardware setup

T

hesystem consists of a number of parts and components which it has been designed around. Figure 2.1 shows a block diagram of the system and its major parts. In this chapter we will present the different hardware compo-nents that we have available to us, and list some of their capabilities.

We have two MT9V022 1/3-inch wide VGA digital image sensors, which are capa-ble of 752 × 480 pixels at 60 frames per second [3]. The two cameras are mounted on custom made camera boards and connected to the FPGA board. The MT9V022 sensor has a number of control registers that can be accessed over a serial bus in-terface which means that frame size, exposure, gain and other parameters can be programmed by the user.

The main platform used is a Xilinx ZC-706 evaluation board with the Xilinx Zynq-XC7Z045 SoC as its core [19]. This board offers most of the modern connections such as Ethernet, USB, HDMI, PCI-Express, etc. The board also has 2 GB of DDR3

Figure 2.1:Rough system overview.

(24)

Programmable logic cells 350k Number of lookup tables 218600 Number of flip-flops 437200

Block RAM 2180 kB

DDR3 SDRAM 2 GB

Table 2.1:Performance metrics of the Xilinx XC7Z045 SoC.

SDRAM installed that can be accessed by the Zynq-XC7Z045 SoC. These features make the board very versatile and useful for prototyping.

As the final block, we have the ADV7511 HDMI transmitter which is capable of handling most standardized audio and video formats [2]. From the SoC, synchro-nization signals and RGB data from the cameras is sent to the transmitter. Like the image sensor, the ADV7511 is configured over I2_{C to set the correct operating}

modes.

2.1 Xilinx Zynq-XC7Z045

Aside from the usual programmable logic (PL) the XC7Z045 also has a two-core ARM Cortex-A9 processor, referred to as the processing system (PS). This gives the user an option to implement certain parts of the system in software instead of hardware. In this application that meant that all the system control and Ethernet communication with our workstations could be implemented in the PS. The ARM Cortex processors are running a Linux distribution which makes it easy for the user to access the system over Ethernet and control it remotely. In table 2.1 the basic performance metrics for the FPGA are shown.

2.1.1 AXI4-Lite bus

The AXI4 bus protocol is created by ARM and exists in three versions for specific purposes [16]. The protocol has separate read and write channels and thus sup-ports full duplex. On each transaction it is able to use a burst functionality that implies that the user can transmit up to 256 words in one address request. The protocol support word lengths from 23_{to 2}10_bits.

Since the PS is mainly used for setup and control, the communication between PS and PL is not considered to be performance critical. This means that the standard AXI4 interface may be unnecessary since we are not pushing large amounts of data between PS and PL. Compared to AXI4, AXI4-Lite has some limitations:

• No burst mode available.

• Fixed data bus width, 32 or 64 bits.

The restrictions in AXI4-Lite compared to AXI4 do not imply any problem in this application since the PS is mostly used for setup and configuration. The

(25)

2.1 Xilinx Zynq-XC7Z045 7

Figure 2.2:AXI4-bus interface between PS and PL

Figure 2.3:A configurable logic block with its two slices.

AXI4-Lite interface can provide sufficient throughput and minimize the amount of hardware required.

On the AXI4 bus, both PS and PL actually can act as master and request a read/write as in fig. 2.2. An example of this is when the user wants to take a photo and send it to the PC. The PS sets the system into photo mode, and then the PL acts as a master when the data that is transmitted over the AXI4 bus to the external DDR3 memory.

2.1.2 Configurable logic block (CLB)

A CLB, or configurable logic block, is the core of an FPGA and contains two slices. The slices are not connected to each other and work independently as can be seen in fig. 2.3. Advanced arithmetic functions are obtained by stacking CLBs together. The switch matrix shown in fig. 2.3 governs how CLBs are connected and how logic is routed on the chip.

(26)

2.1.3 Slice

A slice is a group of hardware resources that in its standard configuration consists of the following components.

• Four logic function generators • Eight storage elements • Carry logic

These three basic components are explored in the paragraphs below. The Zynq FPGA also features slices specialized for storing data using distributed RAM and performing shift operations [17]. The specialized slices basically have some of the resources discussed below left out, in favor of memory elements or arithmetic elements.

Each slice contains four lookup tables that are used as logic function generators. A LUT has six independent inputs, which means that it works as a 26_{-bit ROM}

and could hold any arbitrary six input boolean function in it. In the case that we have two different five input boolean functions that share some inputs, the LUT can be configured as two 25-bit ROMs. If two boolean functions contain less than three inputs each, it can also be configured as two separate ROMs.

Each slice also holds eight storage elements; four are used as DFFs and four are configurable as either DFFs or latches. Common for storage elements in the same slice are the control signals such as Clock, Clock-Enable, Set/Reset and Write-Enable.

For fast arithmetic operations such as addition and subtraction each slice pro-vides two carry logic chains. The width of each chain is four bits and slices can be combined to perform a wider operation by cascading slices on top of each other.

In a slice, function generators (LUTs) can also be used to implement multiplexers. Since each slice has four LUTs available and each LUT can cold a six input boolean function, the following multiplexers can be implemented in one slice:

• Four 4:1 multiplexers, each using one LUT. • Two 8:1 multiplexers, each using two LUTs. • One 16:1 multiplexers, using four LUTs.

It is also possible to create larger multiplexers using more than one slice.

2.2 Aptina MT9V022

As briefly discussed in the beginning of this chapter, the image sensors can pro-vide a resolution of 752 × 480 pixels at the most. With this resolution the sensors need to run at 26.6 MHz in order to reach the desired frame rate of 60 FPS [3]. In fig. 2.4 one can see the signals between the image sensors and the Zynq.

(27)

2.3 Analog Devices ADV7511 9

Figure 2.4:Signals between the image sensors and the Zynq

The image sensor provides a new pixel on each rising edge of the pixel clock, which is an inverted version of SYS_CLK_REF. The sensor also generates the hor-izontal and vertical synchronization signals. In practice this means that the sen-sor acts as a master and will run continuously until the sensen-sor is put in standby mode by the user.

The camera sensors give their output in a ten bit resolution. But since the number of I/Os is limited on our SoC, we have to truncate the output and only make use of eight bits. This will obviously deteriorate the image quality, but for our application it will be sufficient.

2.3 Analog Devices ADV7511

The ADV7511 by Analog Devices handles the HDMI output from the ZC-706 evaluation board. This component supports both HDMI 1.4 as well as high-resolution digital audio in various formats. Since the Aptina MT9V022 provides 752 × 480 active pixels, this needs to be scaled into a standardized format [6] to ensure compatibility with different screens and projectors. Figure 2.5 shows all signals between the ADV7511 HDMI transmitter and the Zynq.

The transmitter requires horizontal and vertical synchronization signals (HSYNC and VSYNC), that have to be generated on the FPGA. In the same way as the image sensors, the HDMI transmitter can be programmed and monitored over an I2_C

interface. The user can do the necessary configuration over the I2_{C interface to}

setup most standardized audio/video formats.

For simplicity the RGB color space is used even though the ADV7511 supports other formats [2]. This is different than the camera output (RGGB) and thus the data needed some preprocessing, see chapter 3.

(28)

Figure 2.5:Signals between the Zynq and the HDMI transmitter

2.4 Camera PCB

Prior to the project start a camera motherboard PCB was designed by Björn Är-leskog, AnaCatum Design AB, and sent for fabrication. The PCB can be split into three parts which can be seen in fig. 2.6: A motherboard with some support circuitry and the cameras on separate boards.

The Aptina MT9V022 sensors come in a 52-ball IBGA package [3] and cannot be soldered by hand. Instead we soldered them at Licera AB, who has equipment for small-scale PCB manufacturing with pick-and-place machines and silicon vapor ovens. With their help the sensors and other components could be soldered in place on the PCBs. The bare sensor without a lens can be seen in fig. 2.6 on the parts labeled 3 and 4.

The board has one 160 pin and one 400 pin FMC connector soldered on it for connection with the Xilinx ZC-706 evaluation board. Notice that the board lacks some components in fig. 2.7. The reason is that the board was originally designed for use with a commercial neural network ASIC, before the decision was made to construct our own on the FPGA. Otherwise, most of the components on the board are either decoupling capacitors or pull-up resistors for the I2_{C bus.}

Figure 2.8 shows the image sensors with the Sunex DSL210D-NIR-F2.0 lenses mounted. The DSL210D-NIR-F2.0 lens has fairly short focal length which makes it suitable for indoor use at moderate distances. In this application it is also important not to use a so called “fish-eye” lens. This type of lens would distort the image and not just make it harder to find the object, but also to extract the depth in the image.

Since the image sensors require a 3.3V power supply, the voltage banks on the Xilinx ZC-706 needed adjustment since they are running at a 2.5V supply by default. The on-board power supply is a Texas Instruments UCD90120A [15] which is configured over I2_{C. To configure a voltage over I}2_{C the user needs to}

(29)

2.4 Camera PCB 11

(a)

(b)

Figure 2.6: (a) Front and (b) back view of the motherboard and camera boards.

(30)

(a) (b)

Figure 2.7: The motherboard with all components soldered in place. (a) Front view. (b) Back view.

(31)

2.5 Summary 13

Figure 2.8: The complete setup with motherboard and camera boards with lenses attached.

specify all parameters in the power supply, for instance duty cycles and switching frequencies.

For this project we acquired a USB-to-GPIO adapter so that we could connect the Xilinx ZC-706 evaluation to a PC and access the UCD90120A. The power supply is the configured using a program by Texas Instruments called TI Fusion GUI. This program also offers power consumption monitoring and logging.

2.5 Summary

In this chapter we have briefly discussed the platform and the hardware avail-able. A series of pictures describes the custom camera PCB and shows the fin-ished board with lenses mounted. We have presented the basic building blocks within the Zynq-XC7Z045 SoC and their capabilities. We have also established the signal flow between the blocks and presented a way of configuring the image sensor and HDMI transmitter.

(32)

(33)

3

Frame preprocessing

T

heoutput from a CMOS image sensor is not in the traditional widely used RGB format. The image is instead given in a mosaic pattern where each pixel holds information of only one color (red, green or blue). This format is known as a Bayer pattern, Bayer array or simply RGGB. In the world of system cameras, this format is what is known as RAW.

The reason for using this format is explained by the nature of CMOS image sen-sors. The value of each pixel is given by the intensity of the incoming light. In a modern CMOS image sensor the light is passed through an optical filter, where only one color of light can pass. This process is illustrated in fig. 3.1.

One nice side effect of this is that it saves bandwidth and memory space. Instead of sending 24 bits for every pixel (RGB), the camera only sends one color, typi-cally in eight or ten bits.

As can be seen in fig. 3.1 there are twice as many green pixels as there are red or blue. This is due to the fact that the human eye is more sensitive to variations in green than other colors [4].

3.1 Bayer to RGB interpolation

To show the image on a screen however, the image format needs to be converted. The system designer can choose different conversion methods depending on the needs of the particular system. A brief overview of different conversion schemes is given by Jean [9].

Since the interpolation should be done in hardware, and in real time, we are somewhat limited in our choice of method. For example we will only consider

(34)

Incoming light Filter layer Sensor array

Resulting pattern

Figure 3.1: Illustration of a Bayer filter sensor array. Derivative of “Bayer pattern on sensor profile.svg” and “Bayer pattern on sensor.svg” by Cbur-nett/Wikimedia Commons, used under the GFDL license.

(35)

3.1 Bayer to RGB interpolation 17 Bilinear interpolation High quality

linear interpolation

PSNR (dB) 28 33

Interpolation area

(pixels) 3×3 5×5

Number of kernels 4 8

Table 3.1:Comparison between two Bayer to RGB interpolation methods.

linear methods, and we greatly prefer methods were divisions have a power of two denominator.

3.1.1 Comparison of interpolation algorithms

We have mainly considered two alternatives: bilinear interpolation and high qual-ity linear interpolation [12].

Bilinear interpolation is the most simple and straight forward method you could imagine. It uses a 3×3 square of Bayer pixels in order to interpolate the color com-ponents of the center pixel. The only operations that are involved are addition and divisions by two or four.

Of course it has some drawbacks as well. Because it uses so few pixels when interpolating, it will struggle in certain situations. For example at sharp edges in the picture, there will be significant distortion.

The high quality linear interpolation is designed to be better at handling edges and other high frequency areas. To do this it incorporates gradient correction by taking a larger area into account when interpolating. In theory the high qual-ity linear interpolation has about 5 dB higher peak signal-to-noise ratio (PSNR), which is the standard measure of accuracy.

One backside is that since it takes a larger area into consideration, there will be more distortion around the screen edges, where neighbor pixels are not available. In our case we only have 480 vertical pixel lines, so losing four lines at the bottom and four lines at the top is actually quite a big deal. It also needs a bit more logic, and wider adders, but that is virtually irrelevant in our case.

3.1.2 Bilinear interpolation

As mentioned before the bilinear interpolation algorithm uses a 3 × 3 square of Bayer pixels to calculate the RBG components. There are four possible pixel in-terpretations when running through the data: Blue pixel, green pixel in a blue row, green pixel in a red row and lastly red pixel. All these have different meth-ods to calculate the color components. These methmeth-ods are called kernels, and are visually illustrated in figs. 3.2 to 3.5.

(36)

For a blue pixel the RGB components are given by

RBlue= M(0, 0) + M(2, 0) + M(0, 2) + M(2, 2)₄ , (3.1)

GBlue= M(1, 0) + M(0, 1) + M(2, 1) + M(1, 2)₄ , (3.2)

BBlue= M(1, 1), (3.3)

where M is the input data in a 3 × 3 square (“M” as in “Mosaic”). Please see fig. 3.2 for a visual representation of this.

For a green pixel in a blue row, the RGB components are given by the following expressions, visualized in fig. 3.3.

RGreen in blue row= M(1, 0) + M(1, 2)₂ (3.4)

GGreen in blue row= M(1, 1) (3.5)

BGreen in blue row= M(0, 1) + M(2, 1)₂ (3.6)

Furthermore, for a green pixel in a red row we have

RGreen in red row= M(0, 1) + M(2, 1)₂ , (3.7)

GGreen in red row= M(1, 1), (3.8)

BGreen in red row= M(1, 0) + M(1, 2)₂ . (3.9)

And lastly, for a red pixel the components are given by

RRed= M(1, 1), (3.10)

GRed= M(1, 0) + M(0, 1) + M(2, 1) + M(1, 2)₄ , (3.11)

BRed= M(0, 0) + M(2, 0) + M(0, 2) + M(2, 2)₄ . (3.12)

An example of an image converted using the bilinear interpolation method is shown in fig. 3.6. In the example we can see that edges cause signification distor-tion in the converted image. But though it may look really bad in this comparison, it should be noted that the images on the right are scaled very much as to show small details. It is also a worst case scenario since it contains a very sharp edge. In real life the image looks quite good.

Especially in our case where the converted frames are shown as 60 FPS video, the distortion is not noticeable. Of course in a side-by-side comparison, a test

(37)

3.1 Bayer to RGB interpolation 19 R G R G R GB GR (a) R G R G R GB GR (b) R G R G R GB GR (c)

Figure 3.2: Kernels for calculating RGB values, when the Bayer data is cen-tered on a blue pixel. The kernels correspond to eqs.(3.1) to (3.3).

G R G B G G R GB (a) G R G B G G R GB (b) G R G B G G R GB (c)

Figure 3.3: Kernels for calculating RGB values, when the Bayer data is cen-tered on a green pixel in a blue row. The kernels correspond to eqs.(3.4) to(3.6). G RGB GR G B G (a) G R GB GR G B G (b) G R GB GR G BG (c)

Figure 3.4: Kernels for calculating RGB values, when the Bayer data is cen-tered on a green pixel in a red row. The kernels correspond to eqs.(3.7) to(3.9). B G G R GB B G B (a) B G G R GB B G B (b) B G G R GB B G B (c)

Figure 3.5: Kernels for calculating RGB values, when the Bayer data is cen-tered on a red pixel. The kernels correspond to eqs.(3.10) to (3.12).

(38)

(a) Image in the Bayer format. (b) Detailed image in the Bayer for-mat.

(c) Image converted with the bilinear interpolation method.

(d) Detailed image converted with the bilinear interpolation method.

(e) Original image without distor-tion.

(f) Original image in detail without distortion.

Figure 3.6: A comparison showcasing the distortion introduced by bilinear interpolation. The image itself depicts a detail of a fire extinguisher.

(39)

3.2 Averaging 21 person would say that the image from high quality interpolation looks better than the one from bilinear interpolation. But our impression is that the image from bilinear interpolation looks “good enough”. Good enough at least that it does not motivate the time and effort required to implement high quality interpolation in RTL code.

3.2 Averaging

Before being passed to the neurons, the frame needs to be preprocessed even more. As we will discuss later, the neuron setup we use searches a 32 × 16 pixel area. If we are to find a face with that, the person needs to be very far away from the camera so that their face fits into these 32 × 16 pixels. The system should of course be able to find faces that are closer to the camera, so we need to average the frame into a lower resolution.

But since we should be able to find faces at all distances, we need to average by many different factors in parallel. At the time of writing the system averages by factors two, three and four, which accommodates all reasonable distances. The resulting smaller frames are placed into separate frame buffers, which the neural network can search through when it is needed.

We have designed a hardware optimized algorithm to do averaging of the Bayer pattern. This algorithm will be showcased using the example of averaging by a factor three. When averaging by a factor of three, we use 6 × 6 = 36 pixels from the source, to generate a 2 × 2 = 4 pixel result. The result consists of one blue pixel, two green pixels and one red.

The pixel values are given by the mean of the corresponding color pixels in the original frame. The four kernels are shown in fig. 3.7. To expand further we analyze the kernel for the blue color.

The sum of the blue elements in the original frame is given by

S_Blue= X

Blue pixels

M(i, j)(fig. 3.7a)= M(0, 0) + M(2, 0) + M(4, 0)

+M(0, 2) + M(2, 2) + M(4, 2)

+M(0, 4) + M(2, 4) + M(4, 4). (3.13) The output pixel is given by the mean of the input, which with eq. (3.13) gives the output blue pixel

B0 = SBlue

(40)

GB GB GR GB GR G B G R G R GB GB GR R GB GRGB GR GB GR G (a) GB GB GR GB GR G B G R G R GB GB GR G R GB GR GB GR GB GR (b) GB GB GR GB GR G B G R G R GB GB GR G R GB GRGB GR GB GR (c) GB GB GR GB GR G B G R G R GB GB GR G R GB GR GB GR GB GR (d)

Figure 3.7: Kernels for averaging a Bayer image by a factor three.

A division by nine is a little troublesome. In order to realize that in hardware we need to employ some tricks. By realizing that binary shift operations are cheap in hardware we can rewrite eq. (3.14) as

B0 = SBlue 9 = S_Blue 9 × 1024 1024≈ 113 × SBlue 1024 = (113 × SBlue) >> 10, (3.15) where ( · ) >> 10 signifies a logic binary right shift ten times. The maximum value that SBluecan take is 9 × 255, which means that it can be contained in a 12 bit

word. This means that the division by nine in eq. (3.15) can be realized using only an 12 × 12 unsigned multiplier. The shift operation comes for free in hardware, since shift implies choosing which bits of the result form the output word. In our case that would be index 17:10 of the multiplication result.

One thing to keep in mind is that since we are not dealing with any form of float-ing point arithmetic, the operation in eq. (3.15) will be lossy. The result of SBlue/9

is almost certainly not an integer. So implementing the operation using integer multiplications and shifts, will truncate the result. In order to be mathematically accurate, the true value that we will get, in hardware, is given by

B00 = $ 113 × SBlue 1024 % . (3.16)

It should be noted that this operation could have been implemented just as well with a denominator other than 1024. Increasing the bit count would give

(41)

bet-3.3 Summary 23 ter precision at the cost of a wider multiplier. We chose to stick by 210 _{= 1024}

because it is a nice even number, and it has proven to work well in practice.

3.3 Summary

This chapter highlights the steps needed to preprocess the video stream. The need for Bayer to RGB interpolation is motivated, and two methods are compared. After that we introduce the concept of averaging the frame, which is necessary for the neuron functionality. Averaging is made harder by the fact that we process raw Bayer data, but a fairly simple algorithm is presented.

(42)

(43)

4

Neural networks

N

euralnetworks, and more specifically artificial neural networks, is a very extensive field. In this chapter we will present some of the core con-cepts and ideas. For more extensive run-down we refer to Montavon et al. [13] or Malmgren [11]. More background material about the FPGA implementa-tions of ANNs is found in Omondi and Rajapakse [14].

The basis of an artificial neuron is a reference pattern vector. We call the reference pattern ¯p = (p1, ..., pN), which is a constant vector that is stored in the neuron. In

a neural network each neuron will have a unique reference pattern. The reference pattern is often referred to as the neuron’s training data or prototype.

The function of the neuron is to compare the reference pattern vector to an input vector ¯x = (x1, ..., xN). The output of the neuron is the distance between the

vectors ¯x and ¯p. Using the distances from an array of neurons, the neural network can make a decision whether or not the system has found a match.

4.1 Neuron distance

In order to categorize the input vector we need to calculate the distance between ¯x and ¯p. Doing so we have a number of options to consider. The most intuitive way is the classical Euclidean distance, which is calculated according to

dEuclidean= v u t _N X k=1 |_x_k−_p_k|2_. (4.1) 25

(44)

A mathematician might like this approach, but since this is to be used on an FPGA the expression is troublesome. The calculation contains a square root that is very costly to perform in hardware. Also the absolute difference being squared gives a long critical path and consumes a lot of hardware resources.

An alternative distance calculation would be to use the L1 norm instead, also

known as the Manhattan distance.

dL₁=

N

X

k=1

|_x_k−_p_k| (4.2)

Equation (4.1) is the equation for a circle in two dimensions, and a (hyper)sphere in higher dimensions. Equation (4.2) on the other hand describes a tilted square or diamond in two dimensions, or a (hyper)cube in higher dimensions.

The consensus is that the Euclidean distance is the best and most accurate method to use. But due to the high complexity we choose the L1 norm instead. In

soft-ware simulations the different methods have been compared, and although the Euclidean distance is more accurate, the L1norm is deemed to be “good enough”.

4.2 Influence field and decision spaces

In fig. 4.1 a two dimensional decision space is shown. The input vectors are marked as dots and the reference pattern vectors are the center of the diamonds. If and input vector is within a diamond, then that vector is close enough to the reference pattern. When this is the case the neuron is said to have a match, or “spark”.

The “radius” of the diamonds is the so called influence field. A larger radius would imply that the chance for the input vector to be situated within that dia-mond is larger, i.e. the neuron has a more prominent influence field. The influ-ence field can be constant across all neurons in the neural network, or different for each neuron as shown in fig. 4.1.

From fig. 4.1 we can also identify three different scenarios:

• The input vector is not covered by any diamond, i.e. no match. • The input vector is situated inside one diamond.

• The input vector is situated in an area where two diamonds are overlapping. If we have overlapping areas we cannot correctly categorize the input vector with-out uncertainty. In order to minimize overlapping rectangles the system needs to have the appropriate reference patterns. If we for example want to use our sys-tem to categorize between a female or a male face, overlapping patterns would become a problem.

(45)

4.3 Neural network 27

Figure 4.1:Two dimensional decision space with two categories.

4.3 Neural network

Let ¯h = (h1, ..., hM) be a vector of M number of neurons, each with their own

reference pattern ¯p. In fig. 4.2 the broadcasting of the input vector ¯x is depicted. Each neuron in ¯h will calculate the L1norm between the stored prototype pattern,

¯p, and the input vector ¯x according to eq. (4.2). The function f (x) will then interpret the output based on the input vector. The actual function is dependent on the application and could be more or less advanced.

From fig. 4.2 one sees that a neural network is parallel in nature. This means that the number of neurons in each layer could be increased without increasing the computational time. The parallelism is also what makes it suitable for a hardware implementation instead of software. One should note however that the function

f (x) might grow in computational complexity when the number of neurons

in-creases.

4.4 Radial basis function

Radial Basis functions are used as an activation function for the neurons. Several functions exist but all have similar characteristics with a monotonically increas-ing/decreasing response from a central point. The method that we use, which will give the sort of binary behavior shown in fig. 4.1, is the Heaviside activation function.

(46)

(47)

4.5 An artificial neuron 29 RBFj =        1, if d ≤ r, 0, else, (4.3)

where r is the influence field and d is the neuron distance. Apart from the Heavi-side activation, the most widely used one is the Gaussian function given by

RBFj = exp d

2

r2

!

. (4.4)

The Gaussian function is most often used when the objective is to categorize. For example, let’s say that we want to categorize between male and female. We have four neurons in the system, two for each category. Using an influence field of 0.5, the male neurons have distance d1 = 0.3 and d2 = 0.3, whereas the female

neurons have distance d3= 0.2 and d4 = 0.4.

The sorting function would in this case form two linear combinations of the RBF functions σmale= X male neurons RBFj= exp 0.3 2 0.52 ! + exp 0.32 0.52 ! = 3.1 (4.5) σfemale= X female neurons RBFj= exp 0.2 2 0.52 ! + exp 0.42 0.52 ! = 2.9 (4.6) The conclusion would be that the input vector depicts a female, since that sum is the smallest.

4.5 An artificial neuron

The basis of this system is the ability to learn patterns and remember them. On an FPGA this means that the neuron memory will consist of a block RAM of a fixed size. The size of the block RAM will determine the size of the patterns we are able to store in a neuron, as well as the computational time it will require. As we discussed in section 4.1 we are interested of computing the sum of absolute difference, SAD. This means that the latency of a neuron is determined directly by the size of the pattern used.

In fig. 4.3 the basic building blocks of an artificial neuron are shown. The SAD box is the operation in eq. (4.2). Note that the shimming delay at the ¯x node is there because of the delay from the memory.

(48)

Figure 4.3:Block diagram of an artificial neuron.

4.6 Training of ANN

The field of artificial neural networks and training methods is a large research field. Many complex algorithms exits and the interested user could for instance look into Montavon et al. [13]. For this application we will however consider some simpler, more suitable ways to train the system.

To form a decision space as in fig. 4.1, one has to assign the neurons with the ap-propriate patterns, ¯p. The system needs to be trained in a structured way so that resources are not wasted and neurons of the same category are not overlapping. The user should specify min/max of the activation function radius, r, also called the influence field. If the minimum value is too small, the system may risk losing its ability to generalize.

When training a system, the typical work flow will be something like the follow-ing bullets.

1. Assign the pattern to a neuron with the maximum influence field. If the pat-tern corresponds to a decision space covered by another category, decrease their respective influence fields.

2. If the neuron instead has overlapping decision spaces with another neuron in the same category, the information is redundant and the two patterns are too similar.

3. Continue refining their respective influence fields to create as good decision space as possible.

(49)

4.7 Importance of good training data 31

Figure 4.4:Decision space accuracy is a logarithmic function.

The decision space accuracy as a function of number of neurons follows a loga-rithmic curve, as can be seen in fig. 4.4. This means that the network will reach a point when assigning more neurons improves the decision space only marginally.

4.7 Importance of good training data

Having good training data is vital for good performance; however, there are some pitfalls. One problem that sounds a bit counterintuitive is that one could actually “over-train” a system. This means that if we train the system too much, the result

may be a loss in generalization.

Consider a simple example: We want to distinguish between cats and dogs based on size, i.e. two categories as in fig. 4.1. If we use Chihuahuas (small dog) in the training data and assign them the category dog, the system would in a sense assume that dogs are rather small. Statistically, most dogs are larger than Chi-huahuas. The result will be a system that performs well on the training data, since it consists of Chihuahuas, but will perform badly in most real-world cases. If we disregard the Chihuahuas in the batch of training data, we run the risk of being unable to categorize them as dogs. On the other hand, the amount of cats that run the risk of being identified as dogs because of unsuitable training data, are far more.

To sum up this problem, all training should be performed with statistically rep-resentative data for each category. To test the effectiveness of the training and the ability to properly be able to generalize, different data than the training data should be used.

(50)

4.8 Summary

We have presented the basic theory for ANNs in this chapter. An explanation is given to the choice of distance calculation and how we can use that to form a decision space. We also highlight the parallel nature of ANNs which motivates a hardware implementation. Further, a basic training scheme is shown and a com-mon pitfall related to the training of ANNs explained, and how a well-trained system should behave when increasing the number of neurons.

(51)

5

Tracking Algorithm

T

heidea behind the tracking algorithm is to scan the frame and compare parts of the frame with the patterns we have stored in the neurons. If a part of the frame has enough similarities with the pattern in a neuron it will activate. The smallest Manhattan distance would then represent the closest match to the prototype stored in that neuron. Since we are tracking only one object we are only interested in the best match. This means that we need to store the coordinate for the best match and only replace the coordinates if a better match occurs.

The first and easiest way to implement the tracking algorithm is to always scan the entire frame in order to find the smallest Manhattan distance. However, this approach will require a lot of calculations. We assume that we will scan the frame with a scan window size of 32 × 16 pixels, and each pixel takes one clock cycle to process. Scanning the entire 640 × 480 frame at 60 FPS would require

60 = CLK

(640 − 32 + 1) × (480 − 16 + 1) × 32 × 16 ⇔ CLK ≈9 GHz (5.1) A clock running at 9 GHz is nowhere near realistic, so the number of calculations needs to be decreased. The first thing we did is split the neurons into two parallel neurons, that each compute their SAD on a 16 × 16 square. Together they form the 32 × 16 window we want. Doing this will double the speed, but it also has some other implications that are discussed in section 6.1.

The second step is to limit the part of the frame in which we are actually scan-ning. If we know where the target was in the last frame, we probably only have to search a small part of the new frame in order to find the target. In fig. 5.1

(52)

Figure 5.1:Illustration of the tracking technique.

that is called the tracking window. We can also identify a basic principle in this problem:

• A smaller tracking window will require a lower amount of calculations, and thus go faster.

• A faster system will require a smaller tracking window since the object has not moved that much relative to the previous frame.

Another idea is to increase the step size that we use when moving the scan win-dow. Doing so will also relax the demands on how fast we need to run the system according to:

CLK= 60 × (# of calculations in frame)

x_step × y_step (5.2)

Obviously, this solution introduces the risk of missing an object if the step size is too large. Since we are working with Bayer data, we are also limited to step sizes that are multiples of two. The reason is that we need to keep track of the color and do not want to mix for instance green and red pixels.

As discussed in section 3.2 the system uses different averaging steps, with factors two, three and four. With an averaging factor of two, we scan the frame at the lower resolution 320 × 240 pixels. The averaging stage is done in real-time when reading in from the camera stream. This means that the system only needs to scan a fourth of the original frame, at the cost of a one clock latency increase. With even higher averaging factors the number of calculations will decrease even more.

(53)

5.1 Zoom function 35 Finding an object for the first time will require a scan of the entire active frame. If we now calculate the clock required for the system to run at 60 FPS with x_step= y_step = 2:

60 = CLK × x_step × y_step

(320 − 32 + 1) × (240 − 16 + 1) × 16 × 16 ⇔ CLK ≈250 MHz (5.3) This is still a reasonably high frequency and would be hard to target on the Zynq FPGA. If we allow the system to be somewhat slower when we need to scan the en-tire active frame and instead look at the clock frequency needed when the object has been identified and the tracking window used. Let’s assume that the tracking window has a size of 80 × 80 pixels and the same step size as before.

60 = CLK × x_step × y_step

(80 − 32 + 1) × (80 − 16 + 1) × 16 × 16 ⇔ CLK ≈12 MHz (5.4) This speed is not hard to reach on a modern FPGA and it would be reasonable to assume that we can run the system at a clock speed between the calculated ones in eq. (5.3) and eq. (5.4). In that case, the tracking algorithm would be a little bit slower than 60 FPS when the object is identified for the first time. When the system has found the object the tracking algorithm will run at speed greater than 60 FPS.

5.1 Zoom function

In order to be able to identify an object situated on different distances a zoom function is required. Since we process the image in hardware we need some pre-defined zoom levels. In this application, three levels have been chosen but it is easy to add further zoom capability.

In fig. 5.2 a flowchart of the different levels is shown. The system starts searching for the object in the smallest frame, i.e. closest to the camera. If no neuron activates, the system will start searching for the object further away from the camera. When the object is identified, the system will only search in that frame until no match exists.

For the stereoscopic depth calculation, appendix B, it is important that we search at the same level of zoom in both cameras. If the zoom level is not in sync between the cameras, we will introduce and error in pixel difference.

5.2 Graphics overlay

When the object is identified and tracking is active this should be displayed on the screen. This means that we need to produce some simple graphics in hardware. The overlay needs to be calculated at the same rate as the output frame rate in

(54)

Figure 5.2:Displays how the system searches in different zoom levels.

Figure 5.3:Illustration of the graphics overlay produced in hardware.

order to avoid tearing and glitches. We also want to display numbers and letters on the screen which means that we will need to produce a bitmap stored in a ROM. Since we have access to all counters and synchronization signals for the HDMI transmitter as well as the coordinates of tracked object, we can calculate the position within the frame.

When the algorithm receive a coordinate from a match, we use that coordinate plus a user defined “radius” to produce the overlay around the object. In fig. 5.3 this is illustrated where pxand py are the horizontal and vertical pixel counters,

respectively. If px and py are situated inside the gray area we will set a certain

switch. When it is activated, we send #FF0000 as the RGB output to the HDMI transmitter, instead of the usual Bayer converted camera data.

The same reasoning applies to displaying the distance value, with the exception that the condition for setting the switch comes from the bitmap font stored in the ROM. In fig. 5.4 the number two is shown together with the content stored in the ROM for that number. The number is bit reversed in the memory to simplify the readout logic.

(55)

5.3 Worst case scenario 37 (a) 001111111100 011111111110 011000000111 111000000011 111000000000 011000000000 011100000000 001111111100 000011111110 000000000111 000000000011 000000000011 000000000011 000000000011 000000000011 000000000011 111111111111 111111111111 (b)

Figure 5.4: (a)The original BMP image used for generating the ROM con-tent. (b) The data stored in the ROM.

It takes the system two clock cycles to produce an #FF0000 RGB output and three clock cycles to produce the font because of a one clock delay from the ROM. This is not considered a problem since the only effect will be the font shifted one pixel to the right.

5.3 Worst case scenario

Previously we have calculated some estimates regarding clock frequency depend-ing on different trackdepend-ing modes. We could conclude that reachdepend-ing 60 FPS in the scan mode should not prove difficult. However, a worst case scenario arises when we have to identify the object for the first time, and the object is farthest away. In this situation we have no previous knowledge about the object distance from the cameras we have to scan all zoom levels as depicted in fig. 5.2.

If we assume that we use a 50 MHz clock, which Should be easy to implement on a modern FPGA, we calculate the worst case FPS as

CLK × x_step × y_step 16 × 16 × (N2+ N3+ N4) ≈ 200 × 10 6 27 × 106 ≈7.4 FPS, (5.5) where N2= (320 − 32 + 1) × (240 − 16 + 1), N3= (215 − 32 + 1) × (160 − 16 + 1) and

(56)

N4= (160 − 32 + 1) × (120 − 16 + 1). We have also used x_step = y_step = 2 and

a 32 × 16 scan window split into two 16 × 16 dittos. The implication of eq. (5.5) is that the absolute worst case delay in finding an object is just below 0.14 seconds.

5.4 Summary

In this chapter we have calculated that reaching 60 FPS for most situations should be feasible. We have also established a baseline for how step size and tracking window size relate to speed. In this chapter we also explain the basic ideas be-hind zoom and graphics functions. We calculate the worst case detection time to 0.14 seconds under reasonable assumptions. This implies that even first time detection would feel quite responsive and fast.

(57)

6

Hardware aspects

V

ideoprocessing can be quite demanding in terms of hardware and the system should be able to process large amount of data in real time. The task is specially memory intense since we need to be able to hold the entire frame from the cameras, as well as the averaged ones for the different zoom levels, see section 3.2.

6.1 Frame buffers

One frame in 640 × 480 resolution and a one byte pixel would require

640 × 480 × 1 byte = 307200 bytes. (6.1) If we compare eq. (6.1) and table 2.1, almost 15% of all available block memory is required to store a single frame. We also need to store the frame in different averaging levels, or zoom levels, shown in fig. 5.2. This will further increase the required memory according to

((320 × 240) + (215 × 160) + (160 × 120)) × 2 bytes = 260800 bytes. (6.2) Where the factor two comes from the fact that we use a 32 × 16 scan window. By design a 32 × 16 scan window uses two parallel neurons which means that we need to read two values at the same time, with an offset relative to each other, from the memory.

The obvious solution would be to use a true dual port RAM, which supports two readouts at the same time. The problem with this type of memory is that it

(58)

Figure 6.1: Illustrating how the scan window is connected to the frame buffers.

prevents us from using different clock domains on the input and output. This leads to a fundamental design issue when calculating algorithm 1, which is a sequential computation. We could use one memory and compute the entire 32 × 16 SAD sum, or we could compute two 16 × 16 sums in parallel. Due to some memory limitations regarding the clock domains this will require two memories, as we will explain in section 6.3.

The solution using one memory will take twice the time to compute compared to using two memories with duplicated content. The choice between double speed →twice the memory is application dependent and a quite classical trade-off. From the calculations in eq. (6.1) and eq. (6.2) we obtain the memory requirement for one camera. For storing the data we need to process we would require a total of 2 × (307200 + 260800) bytes, which is around 52% of the available block memory according to table 2.1.

6.2 Number of neurons

Since each neuron has a memory of 256 bytes, we can at the most use 2180000 × 0.48

256

(59)

6.3 Clock domains 41 From eq. (6.3) we can then compute the maximum number of patters the system could hold when using a 32 × 16 scan window and two cameras as

4087 4

= 1021 patterns. (6.4)

If we assume that not that many slices will be used for storing frames and other various activities, we can calculate some ballpark numbers for the slice utilization of the neurons. The main computation will be the distance calculation in eq. (4.2) which can be written as:

Algorithm 1Computation of the L1norm.

fork=1:N do

dL₁ = dL₁+ |x_k−p_k|

end for

In this application both xkand pkare 8-bit wide and dL₁is 16-bit wide to prevent

overflow. From section 2.1 we also know that each slice supports 4-bit wide carry-chains which means that each distance computation will require 2 + 4 = 6 slices. This implies that we in the ideal case, with 54650 available slices [18], could instantiate:

54650 6

= 9108 neurons (6.5)

Following the same reasoning as above with a 32 × 16 scan window and two cameras we can compute the number of patterns related to slices as:

9108 4

= 2277 patterns (6.6)

Obviously the result in eq. (6.6) will not be reached because of the surrounding logic and functions. However, the result is sufficiently high that we can draw the assumption that the number of slices will not be limiting the amount of possible neurons. For targeting a low-cost FPGA with higher resolution, focus has to be put on how to store a frame effectively since that is the limiting factor.

6.3 Clock domains

The system is somewhat complex because it needs to have three different clock domains. The three domains, illustrated in fig. 6.2, are:

• Pixel clock for the cameras • Main system clock

(60)

• HDMI output timing clock

The block RAMs that we use are of simple dual port type, which means that reading and writing can be done at different clock rates. This also means that we are unable to read two values at the same time with this type of memory. In the picture, the data flow is from left to right. A frame is read from the cameras and saved into frame buffers, which are then read by the neural network. In order to avoid corrupt frames and weird results there are some status bits passed around. Basically the neural network does not start scanning a frame until the read in logic says that a whole frame has been written. After that a new frame is not read in until the neural network has finished processing the entire old frame.

The frame that is sent to the HDMI output bypasses all this logic. That frame is continuously read in and sent to the monitor at 60 FPS. This means that even if there are delays in the neural network, the output shown on the monitor is always updated.

Not pictured in fig. 6.2 is the PS, which runs at a significantly higher clock speed (666 MHz). Also the DDR3 memory which is run at 533 MHz. Communication with these is done with the AXI4 bus protocol which supports different master and slave clock frequencies.

The clock signals are generated with on-chip analog PLLs. The Zynq FPGA has three user-programmable PLLs and that are based on a 33.33 MHz reference crys-tal oscillator. Neural network HDMI interface Frame processing Camera Camera 25 MHz 50 MHz 26.6 MHz

(61)

6.4 Summary 43

6.4 Summary

From this chapter we can relate some design choices and performance metrics to the platform used. The different clock domains restrict the way we can use our memories and increases the overall memory cost. We also calculate the maximum ideal number of neurons we can utilize before all memory resources has been exhausted, and relate that to other hardware elements.

(62)

(63)

7

Results

I

nthis project we lack some of the quantifiable results that other topics can produce and use to benchmark the system. One type of measurable result is speed, which we have covered in chapter 5 and can quite easily be derived the-oretically. The other measurable type of result would be some kind of detection performance. This is however trickier to derive since the performance depends on cameras, bias and quality of training data, lighting conditions, etc. Even if one could calculate a value for the detection performance, it actually would not tell us that much about the performance. The reason is that there are no other sys-tems working under the same conditions and the area also lacks an established evaluation model.

A video application is per definition also hard to visualize in a written report. Figure 7.1 shows the tracking system with neurons trained to recognize faces but, as we discussed earlier, the object could be literally anything. There are limitations to how much you can turn your head or tilt it without losing the object. This is because of the simple reason that a face does not look the same from, for instance, the side as from the front. If we want to detect faces from the side, we could simply assign neurons with the appropriate pattern for that and extend the decision space for how a face looks.

You can also see that the cameras are mounted very close to each other. The distance between the cameras is 5 cm, and the person is approximately 2 m from the camera. This means that the precision in depth calculations will be quite bad but can be increased by separating the cameras. How much the cameras should be separated given a desired precision can be calculated from eq. (B.15). Note that in this figure, we display the difference in number of pixels between the cameras

(64)

(a) Tracking a face from the front. (b) Tracking a face tilted down.

(c) Tracking a face looking to the right.

(d)Tracking a face looking to the left.

(e) Tracking a face tilted to the right. (f) Tracking a face tilted to the left.

Figure 7.1: The figures showcase a face being tracked from different posi-tions.

Real-time stereoscopic object tracking on FPGA using neural networks

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Real-time stereoscopic object tracking on FPGA using

neural networks

Real-time stereoscopic object tracking on FPGA using

neural networks

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Notation

1

Introduction

T

1.1

Background

1.2

Problem approach

1.3

Related work

1.4

Thesis outline

2

System hardware setup

T

2.1

Xilinx Zynq-XC7Z045

2.1.1

AXI4-Lite bus

2.1.2

Configurable logic block (CLB)

2.1.3

Slice

2.2

Aptina MT9V022

2.3

Analog Devices ADV7511

2.4

Camera PCB

2.5

Summary

3

Frame preprocessing

T

3.1

Bayer to RGB interpolation

3.1.1

Comparison of interpolation algorithms

3.1.2

Bilinear interpolation

3.2

Averaging

3.3

Summary

4

Neural networks

N

4.1

Neuron distance

4.2

Influence field and decision spaces

4.3

Neural network

4.4

Radial basis function

4.5

An artificial neuron

4.6

Training of ANN

4.7

Importance of good training data

4.8

Summary

5