A Resource-Efficient and High-Performance Implementation of Object Tracking on a Programmable System-on-Chip

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

A Resource-Efficient and High-Performance

Implementation of Object Tracking on a Programmable

System-on-Chip

Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet

av

Alexander Mollberg LiTH-ISY-EX--15/4914--SE

Linköping 2016

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

(2)

(3)

A Resource-E

fficient and High-Performance

Implementation of Object Tracking on a Programmable

System-on-Chip

Examensarbete utfört i Datorteknik

vid Tekniska högskolan vid Linköpings universitet

av

Alexander Mollberg LiTH-ISY-EX--15/4914--SE

Handledare: Johan Hedborg

SICK IVP AB

Johan Pettersson

SICK IVP AB

Andréas Karlsson

isy_{, Linköpings universitet}

Examinator: Andreas Ehliar

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Avdelningen för Datorteknik Department of Electrical Engineering SE-581 83 Linköping Datum Date 2016-01-18 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-124044

ISBN — ISRN

LiTH-ISY-EX--15/4914--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

En resurseffektiv och högpresterande implementation av objektföljning på programmerbart system-on-chip

A Resource-Efficient and High-Performance Implementation of Object Tracking on a Pro-grammable System-on-Chip Författare Author Alexander Mollberg Sammanfattning Abstract

The computer vision problem of object tracking is introduced and explained. An ap-proach to interest point based feature detection and tracking using FAST and BRIEF is pre-sented and the selection of algorithms suitable for implementation on a Xilinx Zynq7000 with an XC7Z020 field-programmable gate array (FPGA) is detailed. A modification to the smoothing strategy of BRIEF which significantly reduces memory utilization on the FPGA is presented and benchmarked against a reference strategy. Measures of performance and resource efficiency are presented and utilized in an iterative development process. A system for interest point based object tracking that uses FAST for feature detection and BRIEF for feature description with the proposed smoothing modification is implemented on the FPGA. The design is described and important design choices are discussed.

Nyckelord

(6)

(7)

Abstract

The computer vision problem of object tracking is introduced and explained. An approach to interest point based feature detection and tracking using FAST and BRIEFis presented and the selection of algorithms suitable for implementation on a Xilinx Zynq7000 with an XC7Z020 field-programmable gate array (FPGA) is detailed. A modification to the smoothing strategy of BRIEF which significantly reduces memory utilization on the FPGA is presented and benchmarked against a reference strategy. Measures of performance and resource efficiency are pre-sented and utilized in an iterative development process. A system for interest point based object tracking that uses FAST for feature detection and BRIEF for feature description with the proposed smoothing modification is implemented on the FPGA. The design is described and important design choices are discussed.

(8)

(9)

Acknowledgments

I would like to thank my examiner Andreas Ehliar and my supervisors Andréas Karlsson, Johan Pettersson and Johan Hedborg for valuable discussions during the making of this work.

Linköping, January 2016 Alexander Mollberg

(10)

(11)

2.3.4 Scale Pyramid . . . 20 2.3.5 Pixel Buffer . . . 20 2.3.6 Detector . . . 20 2.3.7 Descriptor . . . 24 2.3.8 Feature Buffers . . . 25 2.3.9 Processor Application . . . 25 3 Results 27 3.1 Pre-study Results . . . 27 3.2 Performance . . . 27 3.3 Resource Efficiency . . . 28 3.4 Implementation . . . 28 vii

(12)

viii Contents 3.4.1 Segment Test . . . 30 4 Concluding Remarks 31 4.1 Discussion . . . 31 4.1.1 Method . . . 31 4.1.2 Results . . . 32 4.1.3 Sources . . . 34 4.1.4 Future Work . . . 34 4.1.5 Environmental Impact . . . 35 4.2 Conclusions . . . 35 Bibliography 37

(13)

Notation

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated Circuit AXI Advanced eXtensible Interface

BRAM Block Random Access Memory

BRIEF Binary Robust Independent Elementary Features [2] CPU Central Processing Unit

DDR Double Data Rate (memory interface) DMA Direct Memory Access

DSP Digital Signal Processor (FPGA primitive) FAST Features from Accelerated Segment Test [16]

FF Flip-flop

FIFO First In, First Out (queue) FPGA Field-Programmable Gate Array

HDL Hardware Description Language LUT Look-Up Table

MLUT Memory Look-Up Table MRU Max Resource Utilization

PL Programmable Logic

PS Processing System (CPU and related circuits) RANSAC RANdom SAmple Consensus [5]

RTL Register Transfer Level

SIFT Scale Invariant Feature Transform [11]

(14)

(15)

1

Introduction

There is a demand for computer vision systems that can process data from high throughput sources like industrial high-speed camera sensors. Embedded sys-tems with low power consumption and advanced digital signal processing ca-pabilities are very suitable for this application. Application-specific integrated circuits (ASICs) further decrease power consumption and increase processing capabilities for the chosen application at the cost of programmability. Field-programmable gate arrays (FPGAs) represent a middle-ground in terms of both power and processing capabilities combined with programmability. They repre-sent a flexible application platform with unique opportunities and constraints. One opportunity is massive parallelizability. The many hardware primitives that make up the programmable logic grid can be configured to process large amounts of data in parallel, in pipeline and other concurrent modes of computation. As such, it can be programmed to handle large amounts of streaming data in com-puter vision applications. In order to set the stage for this thesis, it is necessary to review some image processing theory in section 1.1. In section 1.2 the FPGA design flow is introduced before section 1.3 introduces some computer vision al-gorithms and some existing implementations of these on FPGA.

1.1 Image Processing

Image processing offers tools useful for computer vision tasks such as object track-ing. These tools handle visual information to extract knowledge from images. Visual information in photographic images is represented as a grid of picture elements—pixels. The image can have multiple channels to represent color or a single channel to only represent light intensity. In the latter case, the image is said to be monochromatic. This thesis assumes only monochromatic images are processed.

(16)

2 1 Introduction

1.1.1 Filtering

An important part of image processing is to manipulate images to eliminate obstructions such as noise and focus on relevant visual information. As such, some image processing tasks require filtering of images as part of—or before— analyzing them.

Kernels

Filtering is a mathematical operation that in the static case involves an input signal (or image) and a filter kernel. The kernel is convolved with the image to produce the resulting filtered image. The properties of the filter is determined by the layout of the kernel, which in the 2D image case is a 2D grid of coefficients (a matrix). The filtered image I0 is calculated from the original image I and the kernel K of size M × N as I0(x, y) = I(x, y) ∗ K(t, u) = M−1 X t=0 N −1 X u=0 I(x + t − ax, y + u − ay)K(t, u)

where (ax, ay) is the anchor point, i.e. the origin in the kernel matrix, typically

the center.

Blur filter

Different filters can be applied to images. An important filter for this thesis is the blur filter that smooths out an image, removing small details and noise. It can be used before downsampling an image to avoid aliasing effects and can in that case also be known as an anti-aliasing filter.

The computational time complexity of filtering an image of size W × H using a kernel of size M × N is proportional to W H MN . Naively, each pixel in the resulting image requires multiplication of all MN kernel coefficients followed by accumulations (summations) of the MN resulting terms. However, if the kernel is separable, i.e. has the property that

K(t, u) = Ky(t) ∗ Kx(u)

for some column matrices Kx, Kythen the computation can be simplified as

fol-lows: I0(x, y) = I(x, y) ∗ K(t, u) = I(x, y) ∗ (Ky∗K 0 x) = (I(x, y) ∗ Ky) ∗ K 0 x.

In other words, the two-dimensional convolution can be simplified into two suc-cessive one-dimensional convolutions. The number of operations to calculate

I0(x, y) is thus reduced to M + N multiplications and accumulations, resulting in faster computation.

(17)

1.1 Image Processing 3

Gaussian filter

A Gaussian filter is a separable low-pass filter whose kernel is a sampled Gaussian distribution, i.e.

K(t, u) = 1

2πσ2e

−(t−ax)2+(u−ay )2

2σ 2

where (ax, ay) is the anchor point in the kernel matrix. As defined here, the

coef-ficients of K can only approximatively represented with rational numbers. With the additional desire to represent the coefficient using fixed-point numbers, other similar filters become more attractive.

Binomial filter

One such filter is the binomial filter whose kernel is defined as:

K(t, u) = 1 2n n − 1 t ! 1 2n n − 1 u ! .

It is evident that the binomial filter is separable by construction. Also, the coef-ficients of K are integers up to normalization. Note that since the normalization is in factors of two it can be performed in fixed-point arithmetic as a single arith-metic right shift.

Integral images

If a low-pass filter is desired and if separable filters in general do not meet speed requirements, then integral images can be used. The idea of an integral image is to trade filter quality for speed by computing a set of intermediate values Ii(x, y)

such that: Ii(x, y) = x X x0 =0 y X y0 =0 I(x0, y0).

Note that Ii(x, y) = I(x, y) + Ii(x, y − 1) + Ii(x − 1, y) − Ii(x − 1, y − 1) so Iican be

com-puted successively across the image, requiring a constant number of operations per pixel. Ii can thereafter be used to compute sums of arbitrary rectangular

ar-eas of the image using the following: Let A, B, C, D be the corners in the desired rectangle, ordered such that A has lowest x- and y-coordinates and D has the highest. Then X (x0 ,y0 )∈ABCD I(x0, y0) = Ii(A) − Ii(B) − Ii(C) + Ii(D).

The sum can thus be computed using only four terms once the integral image

Ii has been computed. This sum is equivalent to the result of convolving the

original image with a kernel with all coefficients equal to unity. After normal-ization, such a kernel makes what is known as a box filter or a moving average filter, which is a type of low-pass filter. Thus, an image can be low-pass filtered

(18)

4 1 Introduction

using three accumulations per pixel to compute Ii. And after setting A,B,C and

D — such that the desired kernel size is achieved — another three accumulations per pixel can be spent to compute I0 for a total of six accumulations per pixel. This can be compared to the general separable case of M + N multiplications and accumulations per pixel. Since multiplications are generally more computation-ally expensive than additions it is clear that even for small kernel sizes integral images offer computation speed gains over generalized separable kernel convolu-tions. However, moving average filters are sub-optimal in terms of noise suppres-sion versus kernel size. Binomial or Gaussian filters are much closer to optimum [18] and can thus filter more noise for a given kernel width. Hence the trade-off between quality (for a given kernel width) and speed.

1.2 FPGA Design

To be able to program any larger useful systems of synthesized logic, it is nec-essary for the programmer to be able to design the system at a sufficiently high level of abstraction, much higher than the low abstraction level of the flat, bi-nary information that an FPGA must be loaded with to function. A consequence of this disparity is that FPGA design requires multiple stages, each descending down through levels of abstraction. Here follows a rudimentary description of this design flow. The input to the design flow is a collection of source code files written in a hardware description language (HDL) such as VHDL or Verilog. This source code can be organized into modules, and packaged into IP cores and in-tegrated into a highly abstract system design. When the system design has been completed, the design flow—which is largely automated—can be initiated.

Compilation

During compilation, the HDL source code files are parsed and translated into a register transfer level (RTL) design. It consists of a description of the system as a collection of registers connected by transfer functions described by the processes in the HDL sources.

Logic Synthesis

The logic synthesis lowers the level of abstraction by translating the registers and transfer functions in the RTL-level design into a gate-level design consisting of a netlist which describes connections of physical components on the target FPGA. The logic synthesizer also performs optimizations of the resulting combinational nets by performing certain logic transformations.

Physical Synthesis

The physical synthesis takes the gate-level design and maps it onto individual in-stances of hardware primitives on the target FPGA. These inin-stances are grouped

(19)

1.2 FPGA Design 5

into units called slices, which on the Zynq contains four LUTs, three multiplex-ers, an arithmetic carry block, and eight flip-flops [23]. The physical synthesis places the logic gates onto these and routes the nets while trying to conform to user-defined constraints. These constraints can be specified at a number of levels of abstraction, and some are optional while others are not. In order to operate, the FPGA requires knowledge about clock frequency. The clock frequency rep-resents a constraint on the timing and delay of signals propagating through the FPGA. One measure of how well a design meets timing constraints is the Worst Negative Slack. Slack is the time difference between the maximum delay per-mitted by the timing constraint and the actual delay. Negative slack implies a violated timing constraint. Therefore, the Worst Negative Slack indicates which area of the design that must be improved and how much it must be improved before meeting the timing constraints.

It is quite possible to create constraints that cannot be physically realized by the target FPGA. It may sound undesirable but is of help when benchmarking different designs. For example, only a discrete set of clock frequencies can be syn-thesized by the clock circuits on the Zynq but for purposes of benchmarking and experimenting, it is quite possible to enter a theoretical clock frequency and let the physical synthesis tool try to meet timing requirements to generate a timing analysis report.

Depending on how well the synthesis tool is able to place the design, the re-sulting configuration of hardware primitives can be more or less efficient. One measure of how well a design has been placed is the number of LUTs that are listed as "route-through". Since a slice contains multiple different types of hard-ware primitives and the only input to slices are via its LUTs, a situation can arise when the design tool requires it to consume a LUT only to access a component inside a slice. This LUT is then called "routed-through". The number of route-throughs are therefore an indication of the efficiency of the design, in some sense. After the design has been placed and routed, it can be analyzed. Timing and power analysis is mentioned in this thesis. The timing analysis reports whether the design meets the timing constraints and can detail the delays of any signal path in the design. This is useful when optimizing the design for speed by raising the clock frequency. However, since it is the output of the physical synthesis, it can only be known for a completed system and the development iteration cycles depend on how fast the design can be synthesized.

The power analysis uses a statistical model of the hardware primitives and the generated routes to estimate the power consumption of the system. Diverse factors such as toggle rates of flip-flops and ambient temperature can be taken into account and the estimate is only as accurate as the input values. Behavioral factors such as toggle rates can be extracted from simulations of the behavior of the system in order to provide accurate estimates.

Bitstream Generation

Once the physical synthesis has finished the design and it meets all constraints it is time to export it into a format that is suitable for configuration of the FPGA

(20)

6 1 Introduction

Top layer

First downscaled layer

Second downscaled layer

Figure 1.1: A conceptual illustration of a scale pyramid. Several downscaled layers can be contained in one octave.

board. The design is serialized into a binary format for uploading onto dedicated on-board non-volatile memory. Once the memory has been programmed, the next time the board boots up it will load the bitstream from memory and config-ure the fabric components with it. Alternatively, the bitstream is uploaded to the FPGA directly without being written to non-volatile memory first.

1.3 Computer Vision

Since computer vision is a large and active field of research, this thesis narrows the focus to the computer vision task of object tracking. Visual object tracking is the process by which a system analyses moving images to determine the location of an object at any time. To track objects, a vision system extracts features from the images. There are many types of features that can be used to track objects. In-terest points, often corners, are one such type which has been studied extensively and utilized in many algorithms [6, 11, 16, 17].

To track objects across video frames, detected feature points which might have moved between frames must be matched so that a correspondence between them can be established and motion be estimated. This can be facilitated by creating feature descriptors, which create a vector that characterize a detected feature (in-terest point) in a way that is invariant under certain transformations, so that track-ing can be performed in the presence of certain kinds of changes between video frames such as rotation, translation, perspective change, scaling and blurring.

Gaussian Scale Pyramids

Scale invariance of descriptors can be recovered if the processed image is shrunk to multiple scales and features are detected and described at each scale. This construction is known as a scale-space representation of the image and one tool to create it is the Gaussian scale pyramid. The pyramid consists of layers of images in descending scale, see Figure 1.1. Layers which differ by a length factor of two are said to be one octave apart and may have many fractional octave layers between them. To create the pyramid, one can iteratively create each layer based upon previous. Each Gaussian pyramid step down is performed by filtering the image with a Gaussian filter kernel and subsequently downsampling the image.

(21)

1.3 Computer Vision 7

SIFT

One algorithm that uses a scale pyramid is SIFT — Scale Invariant Feature Trans-form — which was introduced by [11] and is an advanced feature detector and descriptor. It detects features called SIFT keys that fulfill a certain condition and then creates descriptors of them. The image is pre-processed by generating a Gaussian scale pyramid where each layer is downsampled by a factor of 1.5 per layer using Gaussian filtering, interpolation and resampling. The pixels in each layer are then subtracted from the layer below, which is filtered using a Gaussian function with larger variance.

The scale-space representation of the image is then searched, looking for pix-els whose intensities are local extrema in their 8-neighborhood. For each such pixel, the corresponding pixels in the layers below and above are tested similarly. If all three pixels fulfill the requirements then the middle pixel is identified as a SIFT key. It has a definite location, scale and rotation. The rotation is decided by calculating the apparent gradient direction θ in the 4-neighborhood using:

θ(x, y) = atan2(I(x, y + 1) − I(x, y − 1), I(x + 1, y) − I(x − 1, y)).

Once location, scale and rotation has been determined, the detected feature — the SIFT key — can be described in a coordinate system relative to those param-eters. The apparent gradient direction and magnitude of all pixels in a square around the detected feature are calculated. They are then grouped into a number of subregions within the square. The magnitudes of the gradients of each pixel are then weighted by a Gaussian function centered on the detected feature. The weighted magnitudes are subsequently accumulated in a histogram of orienta-tions with, for example, 8 bins spread over the 360 degrees. The result for every subregion is a histogram that consists of, for example, 8 rational numbers repre-senting the gradient value in each respective direction. The collection of these values are then taken to be the descriptor for the detected feature.

It is evident that a large number of computations are needed to compute de-scriptors for every detected SIFT key and that these calculations may be most suited for a floating-point implementation, to capture the wide range of possible sums of gradient values. Indeed, many FPGA implementations of SIFT such as [3] seem to have had to make a concerted effort to minimize the precision penalty for implementing it in fixed-point arithmetic.

FAST

Features from Accelerated Segment Test (FAST) invented by [15], detects corners by comparing intensities of pixels in a ring with the intensity of a candidate pixel in the center, see Figure 1.2. The differences are then thresholded. A corner is said to be detected if 12 or more consecutive pixels that are significantly darker or brighter than the candidate are present in the ring. This variant is known as FAST-12. There are other variants such as FAST-10 where 10 consecutive pixels are sought. To combat the possibility that neighboring pixels are all detected as corners, it is necessary to calculate a score for each detected corner and pick only

(22)

8 1 Introduction

(a)Full view (b)Detail with FAST ring

Figure 1.2: Illustration of the 16 pixels in the FAST ring and corner candi-date c. It is detected as a corner since there are 12 consecutive pixels (marked with dots) in the ring that are brighter than c.

the one with the highest score. This process is called non-maximal suppression and is discussed further in section 2.3.6.

As such, the FAST algorithm requires few computation steps compared to other detectors and works very well with fixed-point arithmetic. The proposed implementation of FAST is presented in section 2.3.6.

BRIEF

One algorithm for generating feature descriptors is BRIEF which was introduced by [2]. BRIEF creates a feature vector in the form of a bit string by comparing the intensity of pixels in a predetermined arrangement around a detected feature with each other, see Figure 1.3. Each comparison results in a binary bit which is placed in the vector, which usually has a length of 128, 256 or 512. Comparison between different detected features can then be performed by examining the bits in the feature vector. A close correspondence means that the region around the candidates are similar, the intensity comparisons should give the same results and thus most of the vectors should be identical. A measure of the difference is commonly chosen to be the Hamming distance between the bit vectors. Like FAST, BRIEF is computationally simple and works well with fixed-point arith-metic. However, since the spatial arrangement of pixel pairs is often irregular and asymmetric, the same feature in a different orientation or at a different dis-tance from the camera can give wildly different bit vectors. In other words, BRIEF is neither invariant under rotation nor scale.

(23)

1.4 Other Related Work 9 0 0 X Y 10 -10 10 -10

Figure 1.3: A spatial arrangement of pixel pairs used in BRIEF. Each pixel pair is illustrated as a line around the interest point at the origin. The distri-bution of pixels being tested is roughly Gaussian.

1.4 Other Related Work

There are multiple existing implementations of feature detection and description on FPGA. [13] implement a system that uses an iterative architecture for corner detection using FAST and matching using BRIEF. However, they focus on optimiz-ing for small area for implementation on an ASIC and they do not report FPGA primitive utilization and only do detection and description on one scale.

[7] report FPGA primitive utilization and also focus on small area. They per-form detection using FAST-12 and description using BRIEF on three scale octaves starting from a top level resolution of 640 × 480. However, they do not report any smoothing before forming the descriptor vector.

[21] use SIFT for feature detection and BRIEF for description and have match-ing on FPGA. For smoothmatch-ing in BRIEF they use a Gaussian 15 × 15 pixel kernel that is separate from the Gaussian filtering that is performed for the Difference of Gaussian image pyramid. They use a Xilinx Virtex-5 to achieve a throughput of 60 frames per second at 1280 × 720 pixels resolution with a utilization of 4.605 Mb (91%) BRAM. In comparison, the Xilinx Zynq7000 with the XC7Z020 part has 4.375 Mb BRAM in total.

[20] utilize multiple integral images with different multiplication factors to approximate a second order derivative of a Gaussian kernel which is part of the detection step done on programmable logic. They perform feature description on a PowerPC CPU on-board a Xilinx XC5VFX70 chip and they achieve a through-put of about 10 frames per second at 1024 × 768 pixels resolution.

To get a comparison of a processor core approach and a more free-form FPGA approach, one can look at [9]. They implement a Histogram of Gradients (HoG)

(24)

10 1 Introduction

algorithm — which is part of the SIFT descriptor described above — for video analysis on an XC7Z020 by hand-coding an FPGA design and comparing it to a design that uses programmable processor cores. The result is that the processor core approach consumes more resources but achieves higher performance and took shorter time to implement, once the processor core IP was finished.

A comparison of the proposed system and these implementations is done in subsection 4.1.2.

1.5 Problem Description and Motivation

FPGAs represent a flexible application platform with unique opportunities and constraints. As such, not all algorithms that are suitable for applications on a CPU are suitable for applications on an FPGA. Some existing computer vision implementations suffer from a loss of quality when implemented on FPGA com-pared to their CPU implementations [4, 25]. It appears that trade-offs were made such that quality was reduced in order to promote other requirements. Though difficult, it is useful to create an implementation that produces results of equal quality—perhaps even identical results—as a CPU reference implemen-tation. FPGAs offer high performance by supporting massive computation par-allelism, which is attractive for high throughput object tracking applications. These applications often include other system components or smaller hardware platforms and have demands for flexibility to tolerate varying vision scenarios. Thus, it is desirable for the object localization system to consume minimal re-sources. To state the problem explicitly:

How can a system for feature-based object tracking be implemented on an FPGA chip with demands for high throughput, undiminished precision and efficient resource use?

A major cause of quality loss when porting algorithms from CPU to FPGA is sub-optimal implementations of floating-point arithmetic on FPGA. Often, the resource demands for synthesizing floating-point adders and multipliers are very high, prompting a change to fixed-point arithmetic with a loss of dynamic range or precision. If the need for floating-point arithmetic can be eliminated, much of the quality can be preserved. If undiminished quality is a goal, the selection of tracking algorithms is dictated by their reliance on floating-point arithmetic. It is therefore of interest to investigate arithmetically simple algorithms such as FAST and BRIEF that perform well without floating-point arithmetic and to in-vestigate the possibility of reaching high performance while maintaining quality by implementing these.

One crucial detail of BRIEF is that since the intensity comparisons are done on individual pixels, noise can easily affect the result of the comparison if the intensity data is not filtered in beforehand. Normally it is prohibitively slow to filter the image on each coordinate that is being accessed. One approach is to filter the image using a traditional filter kernel and use that during feature description. However, this can be sped up by making use of integral images

(25)

1.6 Purpose 11

that enable calculation of sums of arbitrary patches of pixels in the image. A sum (integral for continuous signals) can be calculated by accessing four precomputed values in the integral image instead of every pixel to be summed. This approach is fast but it is not optimal. Computing sums of rectangular pixel areas is equivalent to filtering with a constant kernel, a 2-dimensional moving average filter. Such filters are sub-optimal in terms of noise suppression versus kernel size. Binomial or Gaussian filters are much closer to optimum [18] and can thus filter more noise for a given kernel width.

Another place where Gaussian filters are used is in calculating Gaussian scale pyramids. A central idea now appears: If a scale pyramid is already constructed then much of the filtering necessitated by BRIEF is already performed. Instead of calculating integral images to smooth intensity values, perhaps BRIEF can use already-filtered intensities from a pyramid layer below. In other words:

Can the Gaussian scale pyramid filtering be utilized by BRIEF to meet the demands for undiminished precision and efficient resource use? The fact that the filter kernel in the pyramid may be smaller is counteracted if an efficient Gaussian filter is used, which might be able to compete with a larger but rectangular filter kernel. There is also the option of descending not one but two octaves down the pyramid. This should increase the smoothing since the cascaded filtering and downsampling should result in a larger effective filter kernel. However, the effects of coordinate quantization due to downsampling may become too disrupting. These matters are investigated in section 2.1. The proposed implementation of BRIEF is presented in subsection 2.3.7.

1.6 Purpose

This thesis aims to implement an efficient algorithm for feature tracking on a Xil-inx Zynq7000 with the XC7Z020 FPGA part with memory resources shared with an embedded CPU. This work will then investigate the consequences of modifi-cations to the smoothing strategy of the selected descriptor algorithm that aim to improve memory resource utilization on the FPGA. If this modification incurs minimal quality losses while enabling higher performance, it can be of relevance to systems on both FPGA and CPU. This investigation is part of an evaluation of performance, quality and resource efficiency of the implemented system.

1.7 Assumptions and Delimitations

To demonstrate the plausibility of an efficient implementation on FPGA with identical data output as one on PC, it is the aim of this thesis for the main sys-tem implementation to exactly replicate the results of the reference PC imple-mentation. Therefore, some details of the implementation like the exact spatial pixel arrangement of the BRIEF pairs are taken to be theoretically sound to be-gin with. The results of the reference implementation are evaluated in terms of performance and quality in the pre-study.

(26)

12 1 Introduction

Even though it may be interesting, the latency of the system is not measured. This is to simplify the performance analysis a bit. Based on previous application knowledge, it is deemed unlikely that latency becomes catastrophically high. Ad-ditionally, even though power consumption is estimated, it is not used as a basis for design decisions during development. This is because power consumption is a property of the integrated system and thus is unavailable or misleading before the system has been fully integrated.

Functionality-wise, some properties of the scale pyramid are limited due to development time constraints. Even though a scale pyramid can contain frac-tional octave scales, only full octave scale layers are computed in this work. The choice of number of octaves is scenario-specific and this work does not specify any specific scenario. Therefore, the choice of four octaves is selected arbitrarily, taking into account the lack of fractional octave scales.

(27)

2

Method

Two implementations are part of this work: A reference implementation on PC and the main system implementation. The overall development method focuses on the different submodules that make up the main system tracking module. The development is divided into units called cycles and the general development cy-cle involves

1. proposing different versions of a yet unimplemented module; 2. identifying the design choices that must be made;

3. comparing the impacts of the design choices upon the performance and resource efficiency of the system with respect to the current system; 4. implementing the version which maximizes these;

5. measuring the new performance and resource efficiency of the system; and 6. based on these new data, re-evaluating design choices made in the current

or previous modules.

This continues until all desired functionality is implemented whereby additional cycles can be performed to improve resource efficiency and/or performance.

2.1 Pre-study

It is necessary to prove that replacing the integral image filtering with Gaussian-like kernel smoothing in the scale pyramid will not degrade the quality of the generated BRIEF descriptors. To achieve this, the alternatives are implemented in a Windows application on a PC. It generates interest points from a well-known set of test images [19] and uses the different descriptors to describe the features.

(28)

14 2 Method

The resulting feature vectors are matched between sequential image frames using a brute-force method. To compare the quality of the descriptors, the precision of the matchings is measured.

The precision is defined as the fraction of correct matchings among all match-ings. To decide which matchings are correct, a criterion is required. This cri-terion should reflect the real relation of features between the images—whether they correspond to the same point in reality in the scene being portrayed. To determine that, the fundamental matrix can be used. It describes the orientation of two camera views with relation to each other. By knowing the coordinate of a point in an image and the fundamental matrix, the corresponding point in the other image can be calculated. Conversely, by knowing a number of pairs of cor-responding points, the fundamental matrix can be calculated. If all matchings are correct they should therefore give rise to the same fundamental matrix. How-ever, evaluating all possible selections of matchings is too time consuming and an estimation algorithm such as RANSAC [5] can be used to speed up the pro-cess. RANSAC estimates parameters (here: elements in the fundamental matrix) of a linear system by randomly selecting data points, calculating the parameters from those points and counting how many data points agree with this choice of parameters. These data points are known as inliers. The number of inliers is recorded and the algorithm repeats. Finally, the parameters which gave the most inliers are given as the result of the algorithm. Importantly, RANSAC also gives a count of the number of inliers, which can be taken to be an estimate of the num-ber of correct matchings. Thus, the precision of the matchings can be measured. If the feature vectors are of high quality, they are able to be matched correctly between image frames, which increases the fraction of inliers in the estimation of the fundamental matrix which raises the calculated precision.

The smoothing strategies that are compared are (i) integral image smoothing, (ii) binomial kernel downscaling of one octave in pyramid, and (iii) binomial ker-nel downscaling of two octaves in pyramid. The idea of (iii) is that a greater depth in the pyramid means a wider effective filter kernel, which is beneficial for the quality of the descriptors [2]. A potential negative effect of (ii) and (iii) is that as one descends the pyramid, the coordinates become more quantized because of the downsampling. At downsampling factor s, the error between the true down-scaled coordinate xd = xs and the coordinate in the downsampled image ˜xd = bxdc

is

∆xd = xd− bxdc= frac(xd) < 1

and the error in the reconstructed coordinate becomes

s∆xd < s.

Therefore, since s increases exponentially down the pyramid, so does the maxi-mal quantization error. A comparison of the precision of the different smoothing strategies is given in Figure 3.1 in chapter 3.

If the modification does not incur a loss of quality it can be utilized to save memory and logic in the FPGA. The elements in an integral image need to be deeper than the 8 bit depth of the input intensity values since many such pixels

(29)

2.2 Measures 15

are to be summed and stored. In fact, the bit depth of the elements in the inte-gral image is governed by the dimensions of the image frame. An image frame of dimensions 1280 × 1024 with a bit depth of 8 bits requires elements with 29 bits each. The BRIEF pixel test map in the reference implementation occupies 47 × 47 pixels and smoothing is required for pixels on the border of that area which means that 54 image lines would need to be stored if a direct, fully paral-lel memory structure is to be adopted. For the top layer this means a memory requirement of 245 kB. The descriptor module needs to test 512 pixels in 256 pairs, each value being calculated by adding and subtracting four 29 bit values in the integral image and comparing the two results. This requires 56 4-bit adders per pixel pair for a total of 14336 adders per pyramid layer. A less naive approach would be to only calculate smoothed pixels from one column slice being output from block memory banks and buffer the results for 47 clock cycles. However, this incurs additional memory requirements and reduces the number of adders to a still-excessive 3136 per pyramid layer.

Alternatively, if the pyramid is used the bit depth can remain at 8 bits and since pixels are retrieved from the downsampled octave below the current, the BRIEF pixel test map is scaled down and only 25 image lines need to be stored if a direct, fully parallel memory structure is to be adopted. For the top layer this means a memory requirement of 31.3 kB, which is 12.8 % of the above memory requirement. Additionally, the descriptor module needs only 2 adders per pixel pair for a total of 512 4-bit adders per pyramid layer.

This modification has potential to increase performance on a CPU system if a scale pyramid is already constructed, since the processor does not have to calcu-late an integral image in that case. To show this, the alternative implementations (i) and (ii) above are modified to record the time spent in descriptor calculations and to present the sum total before termination. The change in execution time is used to calculate the speedup factor, defined in terms of execution time as T0

T .

The results of this experiment are presented in section 3.1.

2.2 Measures

To evaluate the performance and resource efficiency of the system, measures are specified and used in the development process.

2.2.1 Performance

To be able to handle data in real-time from a high data rate source such as an industrial camera sensor, the throughput of the system must be at least as high as the data rate of the source. Thus, throughput is a critical measure of performance.

2.2.2 Resource Efficiency

An efficient system maximizes performance in relation to the resources it uti-lizes, or conversely, utilizes the fewest resources in relation to the performance

(30)

16 2 Method

it exhibits. Regardless, the performance-to-resource-utilization ratio is a central measure of efficiency.

The question of how to determine the overall resource utilization of a system with multiple types of resources can be answered in many ways. [1] looks at the maximum rate of utilization Umax of a number of measured resources, such as

FPGA primitives. The most used primitives are typically

• lookup tables (LUTs) which function as combinational logic; • flip-flops (FFs) which function as storage in sequential logic;

• memory LUTs (MLUTs) which are LUTs that have been configured to func-tion as shift registers or distributed RAM;

• block RAMs (BRAMs); and

• digital signal processing blocks (DSP blocks) which are specialized units for multiplications and additions.

Such a measure would penalize systems that hog one resource while ignoring others and prefer systems which can be horizontally scaled i.e. the constituent modules can all be duplicated by a certain factor R. It can be argued that such a system is efficient in the sense that it only utilizes a portion of all resources and enables maximal performance increase with the current available resources through horizontal scalability. Therefore, a definition of the resource efficiency of the system can be formulated in terms of the performance (throughput) T with the minimum replication factor R related to the utilization rate U of each resource as:

E = T R = T

Umax

, Umax= max(ULU T, UMLU T, UFF, UBRAM, UDSP).

The unit of resource efficiency is thus 1 byte per second per maximum re-source utilization that shall be denoted as 1 B/s/MRU. The rere-source with the highest utilization shall be denoted as the critical resource.

2.3 Implementation

The implementation of the main system is performed in cycles. To aid the itera-tive design of the system, the impact of a changed design choice can be calculated using the proposed resource efficiency measure as: ∆E = ∆T

∆Umax

Especially, if Umax = UA > UB and the change incurs the new utilization

U_A0 = UA− ∆UA and U

0

B = UB+ ∆UB, the change is beneficial if ∆UA > 0 and

∆UB< Umax−UBand the change that maximizes the margin in these inequalities

is the one that is preferred.

Along with the cyclic development strategy it is natural to have a testing method that focuses on small units: unit testing. This can help verify the sys-tem in the perspective of its constituent components which can help to quickly

(31)

2.3 Implementation 17

axi_bram_ctrl_0

AXI BRAM Controller

S_AXI BRAM_PORTA blk_mem_gen_0

Block Memory Generator

BRAM_PORTA BRAM_PORTB

axi_bram_ctrl_1

AXI BRAM Controller

S_AXI BRAM_PORTA axi_fifo_mm_s_0 AXI-Stream FIFO S_AXI AXI_STR_RXD axi_dma_0

AXI Direct Memory Access

S_AXI_LITE M_AXI_SG M_AXI_MM2S M_AXIS_MM2S M_AXIS_CNTRL processing_system7_0_axi_periph AXI Interconnect S00_AXI M00_AXI M01_AXI M02_AXI AXIS_feature_tracker_0 AXIS_feature_tracker_v2_5_2 M_AXIS S_AXIS S_AXIS_CTRL DDR axi_mem_intercon AXI Interconnect S00_AXI M00_AXI processing_system7_0

ZYNQ7 Processing System

PTP_ETHERNET_0 DDR FIXED_IO USBIND_0 S_AXI_HP0_FIFO_CTRL M_AXI_GP0 S_AXI_HP0 FIXED_IO

Figure 2.1: A high-level schematic of the components on programmable logic. The tracker module is highlighted in orange. Non-interface connec-tions such as clock, reset and interrupt signals have been omitted for clarity.

identify a fault, but they can potentially be time-consuming to design for each component in the main system.

2.3.1 Reference Implementation

A reference implementation is developed in a Windows application for an Intel PC. It serves both as a benchmark for comparing throughput and for verifying the correctness of the feature descriptors generated by the FPGA implementation. To save time and focus efforts on the FPGA implementation, the reference uses the freely available OpenCV library∗as a framework for processing the images. Just like the main FPGA implementation, the reference computes the scale pyramid using a 5×5 Binomial filter kernel (approximating a Gaussian kernel), detects corner features using FAST-12 and describes them using BRIEF with 256 pixel pairs. The roughly Gaussian descriptor map is ported to the FPGA implementa-tion without modificaimplementa-tion other than compensating for the coordinate downsam-pling by pre-scaling all the pixel coordinates. The standard smoothing strategy of the description algorithm is adopted from the implementation used in the pre-study. By default, OpenCV computes an integral image for every processed image and then uses that to compute smoothed pixel values but the code is modified to reuse the scale pyramid pixel values instead.

Here follows a description of the main system implementation, first detail-ing the programmable logic components and then focusdetail-ing on the design of the tracking module.

2.3.2 Programmable Logic Infrastructure

A diagram of the programmable logic blocks is shown in Figure 2.1. All mod-ules are instances of standard Xilinx IP cores except the tracker module

(high-∗

(32)

18 2 Method

lighted in orange) which has been implemented by the author. The central com-ponent in the programmable logic infrastructure is the DMA engine. Data flow to the tracker module is controlled by the DMA engine, which is configured by the CPU via a memory-mapped command buffer in block memory. The PL-side DMA engine reads commands from this memory via a block RAM controller, in-terfaces with the DDR3 memory banks and transmits sequential image frames from the memory-mapped interface to the tracker module on the streaming in-terface. The output from the tracker module is streamed into a FIFO block RAM with a memory-mapped interface towards the CPU, which can poll or receive interrupts on arrival of new data to fetch. All interfaces on the tracker module are AXI-Stream interfaces, which enable unlimited burst lengths, require mini-mal control signals and support high clock frequencies. Alternative designs that make use of IP cores with a specialized accelerator interface were discarded due to increased control signal overhead. For instance, the LogiCORE AXI-4 Stream Accelerator Adapter [24] that is bundled with Xilinx software exposes a custom interface of control signals that gives complete control over processing "jobs" sub-mitted to the accelerator. However, the added control capability over the more simple streaming-oriented AXI4-Stream interface does not offset the added re-source requirements of synthesizing the added IP core and additional control logic inside the tracking module.

The asymmetric setup of DMA input and FIFO output is suitable when con-sidering the data volumes to and from the tracker module. The input consists of pixel intensity values that are coded as 8-bit fixed point numbers. To process enough pixels from a camera sensor with a frame rate of 150 Hz and a resolution of 1280 × 1024, this requires a flow of 197 Mpixel/s, thus 197 MB/s of contin-uous input data flow. The DMA is able to handle these volumes by interfacing directly with the DDR memory—which is a 32-bit DDR3-1066 memory module which gives a maximum theoretical bandwidth of 4.267 GB/s—and reading im-age frames in sequential bursts. In contrast, the tracker output data volume is governed by the number of features detected. Each feature is characterized by four data fields: The scale level, the x- and y-coordinate, as well as the descriptor vector. This amounts to 36 bytes per feature. Based on the reference implementa-tion, the expected maximum number of features detected per frame is in the or-der of one thousand, which translates to 5.4 MB/s. This volume of data can likely be handled by uncached, memory-mapped block memories interfacing with the CPU. The CPU itself can likely handle that data volume as well considering it has no other tasks to perform once the DMA is configured and running, see sub-section 2.3.9. Alternative designs for the output data flows that make use of the DMA for writing to DDR memory were discarded because the DMA is not de-signed to the handle extremely short data packets of individual features, nor to handle the sporadic nature of their detection process.

2.3.3 Tracker Module Architecture

A high level schematic of the tracker module is shown in Figure 2.2. It displays the correspondence of different submodule instances to the scale pyramid layers.

(33)

Figure 2.2: A high-level schematic of the components in the tracker module.

A high performance FPGA system exploits the potential for parallel computa-tion. Therefore, a major computer architecture design choice that must be made is the mode of parallelism. One alternative is to implement a small microproces-sor core that can be programmed to perform the necessary steps of computation iteratively. This core can then be horizontally scaled up to desired levels of par-allelism, increasing throughput proportionally. Another alternative is to imple-ment a deep pipeline where each step of computation executes each clock cycle for each data element in a sequence, similarly to working on an assembly line. This allows a constantly high throughput despite the presence of time consum-ing computations. The latter alternative was selected early in the development because of the mentioned properties. Additionally, the microprocessor core ap-proach requires advanced control structures to execute the different steps of com-putations in a programmed pattern while the deep pipeline approach allows for much simpler control structures—and potentially shorter development time— since each submodule normally performs its corresponding computation step every clock cycle. The tracking module is divided into five groups of submod-ules: pyramid, pixel buffer, detector, descriptor and feature buffer. A high-level schematic of the tracking module is shown in Figure 2.2. Note the correspon-dence of different instances of submodules to the scale pyramid layers.

(34)

20 2 Method

2.3.4 Scale Pyramid

The scale pyramid consists of a number of layers of images, each a factor of two smaller in width and height. The layers are cascaded such that each one takes the output from the larger, previous layer and sends its output to the next, smaller image. The downscaling is performed by a filtering stage followed by a down-sampling stage. The first stage filters the image with a separable binomial kernel in two steps: horizontal and then vertical. The downsampling stage halves the width and height of the resulting image by sampling the pixels in a rectilinear pattern of every other pixel on even lines and skipping all pixels on odd lines. Formally, the pixel on location (x, y) is sampled iff 2|x and 2|y†

, see Figure 2.3a. This results in simple control logic but sporadic output since the image is scanned in a row-major fashion. Half the time (even image rows) one pixel is output every other input clock cycle but the rest of the time (odd image rows) no pixels are out-put. An alternative sampling pattern with the same pixel density is a staggered pattern that samples (x, y) iff 4|(x + 2y), see Figure 2.3b. This distributes the out-put pixels better in time. One pixel is outout-put every four inout-put clock cycles except during image border transition for all rows. This increased mean delay between output pixels is notable because all subsequent processing stages are governed by the output pattern of the pyramid and the later stages can utilize the extra clock cycles between the pixels to process current results to a higher extent without buffering. The staggered pattern was ultimately not used because the rectilinear alternative was used in the reference implementation and any other sampling pat-tern would modify the results slightly. The pixel output from the pyramid is fed to the pixel buffer.

2.3.5 Pixel Buffer

The pixel buffer is the largest data structure in the tracker module and provides direct, fully parallel access to all the pixel data that is needed to detect and de-scribe each pixel once per clock cycle. It consists of two major components: the pixel pad and the line buffer. The pixel pad is a 25 × 25 grid of cascaded shift-registers and flip-flops that store pixel values in the current pyramid layer. The line buffer, which is implemented as parallel FIFO block RAM modules, shifts pixels into and out of the pixel pad. The entire pixel buffer is thus able to output any pixel at a distance of at most 25 from the center, sufficient for both the FAST segment test circle and the BRIEF test pairs. Recall that the test pairs are mapped from the pyramid octave above the current and consequently the required patch side length is halved. Additionally, the pixel buffer is able to shift all pixels every clock cycle while maintaining a minimal level of shift register utilization.

2.3.6 Detector

The detector consists of two components: segment test and non-maximal sup-pression. The output from these components can be computed in parallel as they

†

(35)

(a) Rectilinear (b) Staggered

Figure 2.3: Visualization of different patterns of sampling during pyramid downscaling.

Figure 2.4: Conceptual illustration of the border and center pixels where the border pixels fulfill the requirements for a corner.

do not depend on each other. The results are synchronized using a small buffer and subsequently passed to the feature buffer.

Segment test

The segment test component of the detector decides if the current center pixel is a corner candidate or not based on the intensities of the border pixels. The de-sign is divided into two variants, one that determines if there are 12 consecutive darker pixels on the circle and one that determines the same for brighter pixels. The following describes the darker variant. The brighter variant is constructed analogously.

The first stage tests each of the 16 pixels on the circle (monochrome shades in Figure 2.5) whether it is darker than the center pixel or not and outputs a bit p(k) accordingly (marked blue). The second stage forms tests q(k) for all 16 sets of four consecutive pixels. Each set outputs boolean true if and only if all four pixels are darker (marked green). The third stage divides the 16 signals from the

(36)

22 2 Method

Figure 2.5: Conceptual illustration of the components, stages, data flows of the segment test.

second stage into four classes labeled as r(k) according to the four-modulus of their location (visualized as movement of the green cells from a linear arrange-ment into a grid in Figure 2.5). For example, the signal from location 1, 5, 9 and 13 form class 1. Each class outputs boolean true if and only if three or more of its constituent input signals are true. To detect 12 consecutive darker pixels, it is a necessary and sufficient condition to detect that any three signals in some equivalence class under modulus four (represented by columns in Figure 2.5) is boolean true. Therefore, the fourth stage that tests whether any output signal from the third stage is true completes the segment test module.

Formally: Each circle state (p(0), p(1), p(2), ..., p(15)) such that 12 consecutive (with wrap-around) pixels p(n) = ... = p(n+11) = 1 corresponds to one segment test state (k) such that r(k) = 1, where

p(n) =       

1 if pixel j on the circle is darker, n ≡ j (mod 16) 0 otherwise , q(k) =        1 if p(k) + p(k + 1) + p(k + 2) + p(k + 3) = 4 0 otherwise , r(k) =        1 if q(k) + q(k + 4) + q(k + 8) + q(k + 12) >= 3 0 otherwise ,

and no other circle states correspond to such a k. An intuitive definition of q(k) and r(k) is that, under division by four, they access indices with the same quote

(37)

Figure 2.6: Visualization of pixel pattern for non-max suppression. Scores of dark blue pixels are easily retrieved by buffering scores of blue pixels. (q) and remainder (r). This can be contrasted with the naive approach of testing all configurations of 12 consecutive pixels directly, without this factorization into two steps of q(k) and r(k).

This algorithm can be accommodated to other versions of FAST, such as FAST-10 where FAST-10 consecutive darker pixels are sought. In this case, q(k) = 1 if p(k) + p(k+1) = 2 and 0 otherwise and the third stage tests whether 5 or more of the 10 signals in some class is boolean true.

The design exploits common factors between the circle circumference (16 pix-els) and the length of the desired segment (e.g. 12 pixpix-els) and saves resources if these common factors are high.

Verification

This novel algorithm needs to be verified in a unit test, which is mechanical in nature. There are many strategies for mechanical verification, for example brute-force verification and automated provers. Of the two, the easiest to implement is the brute-force verification method. If the domain is sufficiently small and/or if the problem contains symmetries that can be used to reduce the required number of trials, brute-force can be beneficial in its simplicity and ease of integration.

The domain of the corner detector is the set of tuples of the 16 comparison results between the border and center pixel. Each border pixel corresponds to a variable of three possible values: darker, same and brighter. As such, the configu-ration of all 16 border pixels correspond to a 16-tuple among 316 = 43046721 unique possible such tuples. However, if the symmetry between darker and brighter is exploited, only two states remain: same and darker/brighter, and the number of unique tuples becomes a more manageable 216 _{= 65536. A}

ref-erence for the verification is the naive algorithm of iterating over all possible start indices on the border and checking whether it is the start of 12 consecutive brighter/darker pixels.

Non-maximal Suppression

The tracker chooses a corner among detected candidates by scoring them and picking the candidate with the highest score. There are several ways to define the score, some more computationally complex than others. The reference imple-mentation, which is based on code‡provided by Rosten, calculates the highest

‡

(38)

24 2 Method

threshold for which the point is still detected as a corner and uses this number as the score. This definition involves iterating the detection multiple times where each iteration depends on the previous. As such, it is not ideal for implementa-tion on a massively parallel platform such as an FPGA and is a highly quantized scoring measure [16]. The chosen definition of the score is based on the sum of the absolute difference between the border pixels and the center pixel.

Independently from the choice of scoring is the choice of candidate search area. A larger area results in fewer duplicate detections of the same interest point but requires more computations. A minimal strategy is to only search the 4-connected neighborhood of the central pixel, see Figure 2.6. This is the strategy chosen in the reference implementation and is also implemented here. To cope with a data rate of one pixel per clock cycle, it would be necessary to have five segment testing and scoring instances, each examining one candidate pixel. But since pixels are scanned in row-major direction, it is very easy to eliminate the need for two of the detector instances (left and middle) by buffering the result from another (right), see Figure 2.6. If memory resources permit, the number of instances can even be reduced from three to one, with buffering of two image lines worth of scores.

Careful note has to be taken of a few edge cases when comparing identical scores of two adjacent candidates. The implementation provided by Rosten sup-presses both candidates in such a scenario but it may be desirable for the sake of robustness of the detected features that one candidate is chosen in this case. This is accomplished in the proposed implementation by allowing the center pixel to be chosen even with equal scores to that of the pixel to the right and to the one below.

2.3.7 Descriptor

The BRIEF descriptor consists of an array of 256 comparators, each testing in-tensity values from two pixels on the pixel pad in the pyramid octave below. The quality of the generated feature vector depend on the spatial arrangement accord-ing to which these pixels are selected and compared. The reference implementa-tion uses a Gaussian distribuimplementa-tion (see Figure 1.3) which was proved experimen-tally by [2] to be preferable over other arrangements. The arrangement extends over a patch of 47 × 47 pixels in full resolution. In the downsampled lower layer this corresponds to a patch of 25 × 25 pixels.

One consequence of using a precomputed downsampled pixel arrangement is that the coordinate rounding error becomes larger than otherwise expected. The coordinate of a test pixel x has two components: the central pixel coordi-nate c and a coordicoordi-nate m inside the arrangement relative to the center pixel. If a precomputed downsampled pixel arrangement is used then the quantization of c and m must be done separately. Thus, for the ideally downsampled coordi-nate xdand the coordinate ˜xdwhich is an approximation because of the separate

quantization, the following holds:

xd =

_{c + m}

2

(39)

2.3 Implementation 25 ˜ xd= _c 2 + _m 2 , ∆xd = xd−x˜d ≤... ≤ 1.

Therefore, the rounding error may be slightly larger than what is ideal.

2.3.8 Feature Buffers

The detector results on each layer must be synchronized with descriptor results from the layer below. The timing of the results from the detector on each layer is governed by the placement of the detector circles on the pixel pad. This offers a layer of flexibility of timing up to a limit. Experimentally, however, it is still not possible to synchronize the results from the detector and the descriptor without the use of a buffer.

The feature buffer is a dual-port asymmetric block RAM that functions as a FIFO queue for the detector results. By using the dual-port block RAM in an asymmetric configuration, it is possible to enqueue data with one width and de-queue data with a different width, addressing the same data differently on input and output. The detector outputs a single bit for each pixel that indicates pres-ence of a corner and stores it on the top interface of the block memory which is configured for a bit width of 1. The bottom interface is configured for a bit width of 4 such that the results of the 2 × 2 patch that maps to the current pixel co-ordinate on the downsampled layer below can be accessed in one read operation. Apart from pairing detector and descriptor data, it is also the feature buffer’s task to signal the image frame transitions. Since it has information about the coordi-nate of the current detector result, it is able to insert a synchronization signal when the first pixel of a new frame arrives. Once the detector and descriptor results have been synchronized they are passed to the stream buffer.

The stream buffer is a two stage buffer that schedules the output of results from each layer onto the AXI4-Stream interface between the tracker module and the block memory controller. The first stage has dedicated slots for each layer to accommodate features from multiple layers simultaneously. The features are read cyclically and written to the second stage buffer which is a traditional FIFO queue. The data width of the output AXI4-Stream interface is 32 bits and a fea-ture is encoded with 288 bits. Therefore, the transmission of one feafea-ture takes 9 clock cycles during which the next results are buffered in the second stage.

In the event of the first or the second stage buffers becoming full, a signal disables the earlier stages in the tracker module pipeline until the congestion has cleared.

2.3.9 Processor Application

The PS-side part of the system is a standalone application written in C. It gener-ates the image frames from which objects are to be tracked by the PL-side tracker module and stores them in DDR memory. It subsequently issues transfer com-mands to the DMA engine and reads the resulting features back from the FIFO queue. Since layers finish processing an image frame at different times, the order

(40)

26 2 Method

in which results from different image frames are enqueued in the FIFO and then sent to the PS can be scrambled. However, using the frame synchronization mech-anism controlled by the feature buffer the application is able to correctly assign the features to each image frame in the sequence.

(41)

3

Results

3.1 Pre-study Results

Over a dataset of 40 images [19] divided over five measurement runs of eight images each, a total of 242500 corners were detected using FAST and described using BRIEF on a PC with an Intel Xeon W3550 x64 processor with 6 GB DDR3 memory. The original smoothing algorithm which used an integral image took on average 3.85 µs per corner (with a standard deviation of 0.254 µs) while the proposed algorithm which used pixels from one scale pyramid layer below (see section 1.5 and section 2.1) took on average 0.498 µs per corner (with a stan-dard deviation of 0.0407 µs). The resulting speedup based on these averages is 7.73 (with a standard deviation of 0.818). When timing description as well as detection, the throughput of the PC application when using the original smooth-ing algorithm was on average 39.2 Mpixel/s (with a standard deviation of 6.72 Mpixel/s) while the throughput when using the proposed algorithm was 91.8 Mpixel/s (with a standard deviation of 9.61 Mpixel/s). The results of the preci-sion measurements are shown in Figure 3.1.

3.2 Performance

The performance results were determined by the synthesis and the place-and-route tools while configured to target a Xilinx Zynq-7000 xc7z020clg484-1 chip. The system has a maximum clock frequency of 200 MHz. This translates to 200 Mpixel/s and 152 frames per second for a resolution of 1280 × 1024. The through-put is 200 MB/s = 190.7 MiB/s. The Worst Negative Slack is +0.077 ns which is 1.54 percent of one clock cycle. The AXI DMA IP core contains the majority of the paths with low slack. The estimated total on-chip power consumption reported by the tool is 1.924 W. The power estimation was performed using the default

(42)

28 3 Results 1 2 3 4 5 0 20 40 60 80

Image pair separation

Precision

(%)

Integral image Pyramid (1 octave down) Pyramid (2 octaves down)

Figure 3.1: Precision for different smoothing strategies in the BRIEF de-scriptor. The image pairs are grouped by their separation, i.e. how far the images are separated in the sequence. The error bars indicate the standard deviation.

settings for voltage, toggling rates and an ambient temperature of 25◦C.

In addition to being implemented while using the fabric clock from PS, the FPGA components were re-implemented using a virtual external clock. This con-figuration allowed the system to achieve a theoretical maximum clock frequency of 210.9 MHz (to within 10 kHz). This translates to 210.9 Mpixel/s and 160.9 frames per second for a resolution of 1280 × 1024. However, it shall be noted that this configuration cannot be used on an actual chip as the Zynq does not have external clock source pins; it only serves as an indication of the theoretical clock frequency limit on the PL components.

3.3 Resource Efficiency

The utilization of the different FPGA primitives are given in Table 3.1. Note that "post-implementation" refers to post-place-and-route results, which can be con-trasted with post-synthesis results. No LUTs were reported as used as exclusive route-through. The system exhibits a performance of 589 MB/s/MRU.

3.4 Implementation

The development cycle that impacted resource efficiency the most is the replace-ment of two instances of segreplace-ment test and scoring components (in the detec-tor submodule) with block memory buffers from the third instance. The