Polynomial Expansion-Based Displacement Calculation on FPGA

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Polynomial Expansion-Based Displacement Calculation

on FPGA

Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet

av

Carl Ehrenstråhle LiTH-ISY-EX--16/4981--SE

Linköping 2016

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Polynomial Expansion-Based Displacement Calculation

on FPGA

Examensarbete utfört i Elektroteknik

vid Tekniska högskolan vid Linköpings universitet

av

Carl Ehrenstråhle LiTH-ISY-EX--16/4981--SE

Handledare: Andreas Ehliar

isy_{, Linköpings universitet}

Olof Kraigher

Synective Labs AB

Examinator: Oscar Gustafsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Division of Computer Engineering Department of Electrical Engineering SE-581 83 Linköping Datum Date 2016-11-22 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-132755

ISBN — ISRN

LiTH-ISY-EX--16/4981--SE

Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Polynomexpansions-baserad förskjutningsberäkning på FPGA Polynomial Expansion-Based Displacement Calculation on FPGA

Författare Author

Carl Ehrenstråhle

Sammanfattning Abstract

This thesis implements a system for calculating the displacement between two consecutive video frames. The displacement is calculated using a polynomial expansion-based algorithm. A unit tested bottoms-up approach is successfully used to design and implement the system. The designed and implemented system is thoroughly elaborated upon. The chosen algorithm and its computational details are presented to provide context to the implemented system. Some of the major issues and their impact on the system are discussed.

Nyckelord

(6)

(7)

Abstract

This thesis implements a system for calculating the displacement between two consecutive video frames. The displacement is calculated using a polynomial expansion-based algorithm. A unit tested bottoms-up approach is successfully used to design and implement the system. The designed and implemented sys-tem is thoroughly elaborated upon. The chosen algorithm and its computational details are presented to provide context to the implemented system. Some of the major issues and their impact on the system are discussed.

(8)

(9)

Acknowledgments

I’d like to thank multiple people for helping me during my work on the thesis. Mattias Tiger, for helping me with understanding the Farnebäck algorithm. Olof Kraigher, for helping me with everything related to the Zynq and VUnit. Ludvig Vidlid, for being a good discussion partner and helping me with the memory support modules. Lastly I want to thank Synective Labs and the university for having patience during the whole process.

(10)

(11)

1 Introduction 1 1.1 Background . . . 1 1.2 Purpose . . . 1 1.3 Method . . . 1 1.4 Choice of Algorithm . . . 2 1.5 Limitations . . . 3 1.6 Related Work . . . 3 2 Algorithm 5 2.1 Polynomial Expansion . . . 5 2.1.1 Cross Correlation . . . 5 2.1.2 Inverse G . . . 6 2.2 Displacement Calculation . . . 7 2.2.1 A . . . 7 2.2.2 Delta b . . . 7 2.2.3 G . . . 8 2.2.4 h . . . 8 2.2.5 Averaging of G and h . . . 8 2.2.6 d . . . 8 3 System 9 3.1 Overview . . . 9 3.2 Control . . . 10 3.3 Polynomial Expansion . . . 11 3.3.1 Cross Correlation . . . 11

3.3.2 Sparse Matrix Multiplication . . . 12

3.3.3 Control . . . 12 3.4 Displacement Calculation . . . 13 3.4.1 A and Delta b . . . 13 3.4.2 G and h . . . 14 3.4.3 G and h Cache . . . 14 3.4.4 G and h Averaging . . . 14 3.4.5 Displacement Calculator . . . 15 vii

(12)

3.4.6 Flow Container . . . 15 3.5 Support . . . 16 3.5.1 Video Generator . . . 16 3.5.2 Video Processors . . . 16 3.5.3 Memory Interfaces . . . 17 3.6 Memory Usage . . . 18 3.6.1 Polynomial Expansion . . . 18 3.6.2 Displacement Calculation . . . 18 3.6.3 Video Generation . . . 19 3.6.4 System Total . . . 19 4 Architectural Discussion 21 4.1 Memory Bandwidth vs. FPGA BRAM Resources . . . 21

4.2 Block Overlap . . . 22

4.3 Data Bit Widths . . . 22

5 Conclusions 25 5.1 Results . . . 25 5.2 Evaluation . . . 26 5.2.1 Purpose . . . 26 5.2.2 Method . . . 26 5.2.3 Results . . . 26 5.3 Further Work . . . 26

A Cross Correlation Block Overlap 31

B Output Video Statistics 33

(13)

1

Introduction

1.1 Background

Synective Labs has the need to display the areas of expertise to potential cus-tomers. To showcase this a system containing the relevant areas of expertise is designed and implemented. A real time video processing system achieves this and produces a visually perceivable result. The type of video processing algo-rithm chosen for the system is the optical flow type. The computation of a dense optical flow is both computationally demanding and produces a result that can be visually rendered, which is desirable for a system used for demonstration.

1.2 Purpose

The purpose of this thesis can be described as taking an optical flow algorithm and pushing it to the limit, in terms of resolution and frame rate, to see where the bottlenecks appear. Which architectural decisions does it evoke, in the context of the used hardware platform?

1.3 Method

The method used is coarsely divided into the following subsequent steps: 1. Algorithm choice

2. Software reference implementation 3. Hardware implementation

4. Driver implementation

(14)

5. System integration

Firstly the algorithm to implement is chosen. This is done by analyzing the complexity of the freely available dense optical flow algorithms in order to find one suitable for FPGA implementation. Another criterion is the availability of a high level reference implementation to speed up the prototyping of the bit-exact software reference. Section 1.4 contains more details about choosing the algorithm.

Two versions of the reference software is implemented, one using double float-ing point precision in order to match the high level reference and a second usfloat-ing fixed-point arithmetic. This is to ensure correctness and to enable analysis of the results acquired by varying the number of bits used by the fixed-point represen-tation.

The testing of the hardware is made easier by using the data produced by the bit-exact software reference. By dumping the intermediate results from the software reference, each stage of the hardware computational pipeline can be tested using real data. The data is re-used to ensure that any integrated parts of the system still produce the expected result after integration.

For testing, the unit testing framework VUnit1_{is used. VUnit supplies some}

very useful tools when creating test benches. Examples of VUnit features which are useful are: the array package and the OS-VVM2integration. The array pack-age is used to hold stimuli and expected result data, while the OS-VVM integra-tion is mainly used to provide pseudo-random numbers to control jitter genera-tion. Naturally the checking functionality provided by VUnit is used as well.

The hardware is constructed using the bottom-up design methodology which allows for early and thorough testing. With the support of the software reference and the VUnit testing framework, a high degree of confidence in the correctness of the implemented hardware is established from the start. This greatly aids the debugging process when doing system integration.

When debugging the hardware during the system integration phase, the Xil-inx ChipScope can be used to great effect. The ChipScope logic analyzer is built into the bit file produced by the synthesis tool, in this case Xilinx Vivado. This allows cycle accurate measuring of the selected signals. ChipScope also allows triggering the logic analyzer via finite state machines; this is very useful when a fault condition is triggered due to non-trivial prerequisites.

1.4 Choice of Algorithm

Choosing the algorithm suitable for the thesis comes down to the following major criteria:

1. Realtime processing 2. Suitable complexity

1_{http://github.com/VUnit/vunit} 2_{http://osvvm.org}

(15)

1.5 Limitations 3

3. Low usage of memory bandwidth 4. Parallelizable operations

5. Accurate enough result

The main criterion, 1, is to be able to process the video stream and produce the optical flow field in real time with the least amount of latency possible. Criteria 2-4 are necessary consequences of the main criterion. To be able to fulfill the needs of Synective Labs, as described in Section 1.1, criterion 5 is introduced.

To find a suitable algorithm the evaluation table provided by Karlsruhe Insti-tute of Technology and Toyota Technological InstiInsti-tute at Chicago [2015] with the KITTI Vision Benchmark Suite provides an excellent resource. Algorithms with very long runtime and high memory usage can easily be discarded.

The availability of reference source code is an important factor in choosing the algorithm to implement. This is due to multiple factors. Firstly, as mentioned in Section 1.3, the time to produce the bit-exact software reference is reduced. Secondly the behavior of the algorithm can be observed when adjusting algorithm specific parameters.

One of the prominent parameters to analyze is the number of iterations the algorithm uses to produce the final result. The reasoning behind this statement is that every iteration consumes either hardware resources or memory bandwidth. This in turn will affect the final throughput of the whole system which can jeop-ardize criterion 1.

1.5 Limitations

This thesis does not reflect upon the ethical impacts that the production of an op-tical flow system has; neither does it consider the impacts that the system might have to society. Due to strained time constraints the thesis does not cover any optical flow algorithm specifics other than the details needed for computation.

1.6 Related Work

Optical flow calculations performed on an FPGA platform is not an unexplored area. Researchers have implemented different algorithms using different archi-tectures.

Komorkiewicz et al. [2014] present an efficient pipelined dense optical flow calculation system implementing the Horn-Schunck algorithm. The article con-tains detailed information about the architecture and its performance. It also compares eight previous implementations of the same algorithm along with two implementations proposed by the authors in the article. The comparison consists of the amount of iterations used by each implementation, the frame rate and res-olution of the input, their throughput, eventual pre- or post-processing and if the result is verified. An example of the compared implementations is one by Gultekin and Saranli [2013] which uses a PC connected via RS232 as the video

(16)

source and sink. The PC is also used to verify the result. The article provides a thorough analysis of the implemented system.

Architectures using modern SoC solutions have also been explored. A Pow-erPC and FPGA SoC is used by Claus et al. [2009] to accelerate an optical flow algorithm based on the census transform presented by Stein [2004]. The Pow-erPC cores are used for post-processing and presentation while the FPGA is used to accelerate the algorithm beyond the performance of a pure software implemen-tation.

To improve the accuracy of the algorithm introduced by Lucas and Kanade [1981] authors Barranco et al. [2012] use a multi-scale approach. The architecture successfully increases its accuracy by over 100% using two scales instead of one. Additional levels of scale result in a further increase of accuracy however at a lesser degree than using two. The improvement of accuracy comes at a cost of frame rate. This approach can be applied to the polynomial expansion based algorithm implemented in this thesis. Sections 7.6 and 7.7 of Farnebäck [2002] cover how this can be done.

(17)

2

Algorithm

This chapter introduces the algorithm implemented in this thesis. The relevant mathematical equations are presented together with the references to the source paper, Farnebäck [2003], and dissertation, Farnebäck [2002]; if both references are provided the paper will be referenced first.

2.1 Polynomial Expansion

The method used to calculate the polynomial expansion is the fast method intro-duced in Chapter 4.3[Farnebäck, 2002].

2.1.1 Cross Correlation

By using the fast method, regular cross correlation can be used instead of the more computationally demanding normalized convolution. The usage of regular cross correlation together with the used basis functions’ property of Cartesian separability enables efficient computation of the polynomial coefficients. Cross correlations are performed in two passes, first horizontally then vertically. The used cross correlation tree structure is similar to the one displayed in Figure 4.4[Farnebäck, 2002, p. 49]. The main difference lies in the signal dimensionality which is only two dimensional in this use-case rather than the three dimensional as illustrated in the figure. With the extra cross correlations removed the final computational tree becomes:

The kernel used in each cross correlation box in Figure 2.1 is a product of the chosen applicability and the used basis function. The resulting vector of the cross correlations is referred to in (2.2) as f0.

(18)

f 1 x x² 1 y y² 1 1 y 1 y y² x xy x²

Figure 2.1:Polynomial expansion cross correlation tree structure

2.1.2 Inverse G

To calculate the final polynomial expansion coefficients the result from the cross correlations are multiplied by the matrix G−1, as displayed in Equation (4.9)[Farnebäck, 2002, p. 47]. The resulting structure of G−1 turns out to be very favorable in respect to the amount of computations needed to perform the matrix multiplica-tion. For a two dimensional signal the structure of G−1turns out to be:

G−1=                      α 0 0 δ δ 0 0 β 0 0 0 0 0 0 β 0 0 0 δ 0 0 γ 0 0 δ 0 0 0 γ 0 0 0 0 0 0                      (2.1)

As can be seen in Equation 2.1 a significant amount of the elements are 0 thus reducing the amount of computations needed to calculate the final polynomial expansion coefficients. Additional reductions to the amount of needed computa-tions can be made by not calculating the DC level of the polynomial expansion, since it is not used by the displacement calculation algorithm. The final polyno-mial expansion coefficients are computed using the following equations:

C=                      C2= βf 0 x C3= βf_x02 C4= δf 0 1+ γ f 0 y C5= δf 0 1+ γ f 0 y2 C6= fxy0 (2.2)

(19)

2.2 Displacement Calculation 7

2.2 Displacement Calculation

To reduce the amount of computational resources used by the displacement cal-culation part of the algorithm a constant motion module was used. By using any motion model other than the trivial constant model the required equation to solve is as displayed in Equations (20) [Farnebäck, 2003, p. 4] and (7.38) [Farnebäck, 2002, p. 116]. Since the constant motion model is used the required equation to solve simplifies into the following equation[Farnebäck, 2003, p. 3 eq. 13][Farnebäck, 2002, p. 115 eq. 7.23]:

d=XwATA

−₁X

wAT∆b (2.3)

Where w is a weight function. The matrices AT_A_{and A}T_{∆b have been assigned}

to the variables G and h respectively. This turns Equation (2.3) into: d=XwG

−₁X

wh (2.4)

More details about G can be found in Section 2.2.3. h’s details are located in Section 2.2.4.

To compact Equation (2.4) even further the spatial averaging can be turned into Gavgand havgrespectively. This turns Equation (2.4) into the very compact:

d= G−avg1havg (2.5)

The same form of the equation can be found at the end of the computational block diagram displayed in Figure 7.8[Farnebäck, 2002, p. 118]. Equation (2.10) contains more information on how the displacement is calculated using Equation (2.5).

2.2.1 A

The A matrix is the first of the two direct results of the coefficients produced by the polynomial expansion part of the algorithm. The equation shown below is a simplified version of the one that can be found in the following relations: (9)[Farnebäck, 2003, p. 3] and (7.15) [Farnebäck, 2002, p. 111].

A="a_a1 a2 2 a4 # =1 4 "2(C14−C24) (C16−C26) (C16−C26) 2(C15−C25) # (2.6)

2.2.2 Delta b

∆b is the second direct result of computation using the polynomial expansion coefficient set computed from two subsequent grayscale images. The correspond-ing equations are displayed as (10)[Farnebäck, 2003, p. 3] and (7.16) [Farnebäck, 2002, p. 111]. ∆b="db1 db2 # = 1 2 "C12−C22 C13−C23 # (2.7)

(20)

2.2.3 G

The variable G is assigned to the product of the matrix multiplication between AT and A. The duplicate entry G2 reduces the computational burden and the

amount of memory needed to store G. G="G1 G2

G2 G4

#

= ATA (2.8)

2.2.4 h

The computation of h requires the following matrix multiplication: h="h1

h2

#

= AT∆b (2.9)

Unlike G there are no trivial computational shortcuts available nor any duplicate entries in the resulting h vector.

2.2.5 Averaging of G and h

To transform Equation (2.4) into (2.5) the G matrix and h vector have to be spa-tially averaged using the weighting function w. By using a Gaussian as the weight-ing function the 2D spatial averagweight-ing can be converted into two 1D spatial aver-ages. This is possible due to the Gaussian’s property of Cartesian separability. More information is available in [Farnebäck, 2002, p. 47 chap. 4.3.2]. The averag-ing is performed usaverag-ing cross correlation.

2.2.6 d

The final displacement vector is calculated by inverting Gavgand multiplying the

inverse with havg. Since Gavgis a 2 × 2 matrix the inverse is calculated by using

the adjoint of Gavg and dividing it by the determinant of Gavg. Computational

resources can be saved by applying the determinant division after the calculation of the matrix product, d0

. d="d_d1 2 # = G−avg1 havg= = 1 det(Gavg) " Gavg4 −Gavg2 −_Gavg₂ _Gavg₁ # havg ! = = d 0 det(Gavg) = "d0 1 d0₂ # det(Gavg) (2.10)

(21)

3

System

3.1 Overview

Camera IP Downsampling Polynomial Expansion

Optical Flow Calculation Upsampling Visualization HDMI Output Zynq Processing System Memcon2 Control

Figure 3.1:Optical flow system component diagram

Figure 3.1 depicts the implemented system in a coarse manner. Filled lines represent data and potential control transfer while the dashed lines correspond to control signals only; this notation is kept in the subsequent overview figures.

The implemented system consists of the main flow calculation system and multiple support modules. In addition to the developed modules there are sev-eral third-party provided IPs being used within the system.

The optical flow calculation system is divided into three main modules: the control module, the polynomial expansion module and the displacement

(22)

tion module.

3.2 Control

Address Decoder BRAM Interface Synchronizing FIFO State Handler Data Dispatcher BRAM Address BRAM Data Control Signals Kernel/Matrix Coefficients BRAM Data

Figure 3.2:Optical flow controller structure

Figure 3.2 describes the control module, which essentially is an interface be-tween the driver and the programmable logic. It handles the clock domain cross-ing between the AXI4-Lite clock domain and the flow system clock domain. Mod-ule specific configurations, such as filter kernels and matrix coefficients, are dis-patched via the control module. Some basic state handling is also done, every optical flow calculation is initiated by the control module. The polynomial expan-sion module is also brought out of its initial state when the first start command is received. Idle start Polynomial expansion waiting Flow Calculation driver start poly complete coldstart∧ poly complete

driver stop

flow complete

(23)

3.3 Polynomial Expansion 11

The state controller, as described by Figure 3.3, contain only the minimum amount of states needed for operation.

3.3 Polynomial Expansion

Mono Camera

Image _{Cross Correlators} Memcon

Sparse Matrix Multiplication Polynomial Expansion

Control

Figure 3.4:Polynomial expansion component diagram

The polynomial expansion module calculates the polynomial expansion of an image. This is done by cross correlating[Farnebäck, 2002, chap 4.3] the image in 2D and then performing a sparse matrix multiplication. Due to the filter used in the cross correlator the 2D cross correlation is turned into two 1D cross corre-lations with a matrix transposition of the intermediate result between the cross correlation passes. This transposition is performed using the main memory as in-termediate storage. The resulting polynomial expansion is stored in a ring buffer in the main memory. Figure 3.4 illustrates the data flow and components of the polynomial expansion calculator. TheControl module in Figure 3.4 refers to the

system-wide control module

3.3.1 Cross Correlation

As described in Section 2.1.1, the method used to compute the polynomial ex-pansion coefficients is based on cross correlation. The structure illustrated in Figure 2.1 can be reduced by reusing the cross correlation cores between transpo-sition passes. The image is fed through the the first level of the structure line per line; with the resulting data getting transposed using the DDR via the memory connection interface provided by memcon. When the whole first pass has been completed the stored intermediate data is read out via the memory connection interface and passed through the cross correlation structure a second time. Each level in the cross correlation structure correspond to the active cross correlation cores used.

(24)

The kernels used by the cross correlators are calculated by the driver and in-dividually programmed via the control module. Due to the nature of the basis functions used in the kernels they need only to be programmed once which saves control logic. The symmetry of each basis function combined with the symme-try of the Gaussian applicabilities enables extensive and effective usage of the DSP48E1[Xilinx, 2014b] slices provided by the Zynq FPGA. By using the pre-adder functionality provided by the DSP slices the amount of used DSP multipli-ers are cut in half per cross correlation core. The resulting saves in utilized FPGA resources are tangiable when the final implementation is inspected.

3.3.2 Sparse Matrix Multiplication

This module is responsible for the calculation described by relation (2.2). The unique elements of G−₁

, displayed by Equation (2.1), are calculated and pro-grammed by the driver; using the control module. The upper left element of G−1_{, α, is skipped since the displacement calculation algorithm does not use the}

constant part of the result. The output is stored in one of the ring buffer slots allocated inside the DDR for the polynomial expansion by the memcon module.

3.3.3 Control

IDLE

start 1pre 1pass 1f lush

2pre 2pass 2f lush calculate poly data counter limit transpose complete data counter limit write complete

Figure 3.5:Polynomial expansion controller FSM

The control module, described by Figure 3.5, acts as a container and controller to the polynomial expansion computation modules. Its main responsibility is handling control signals to the memory interface, see Section 3.5.3 for details. Since the image is transposed between cross correlator passes the data vector length for the cross correlation module is adjusted during the pre states. During the pre states controls signals are altered and no data is read. The flush states wait for the memory writes to complete that is signaled by the memory interface.

(25)

3.4 Displacement Calculation

Memcon1 A & delta b G & h _CacheG & h

G & h averager displacement Memcon2 Optical Flow Control Control

Figure 3.6:Displacement calculation system

The biggest part of the system is the optical flow calculator, desribed by Figure 3.6. It consists of both computational and memory submodules. The calculations made are mostly small matrix multiplications but like the polynomial expansion a 2D cross correlation is split into two 1D cross correlations. Unlike the polyno-mial expansion the transposition is made in block RAM rather than in the DDR. Due to the limited amount of block RAM the polynomial expansion data is di-vided into multiple blocks. This results in some inefficiencies as data has to be read twice along a part of the block edges to ensure correct calculation results from the cross correlations. This is somewhat mitigated by using a cache for one of the block sides. The cache is also implemented in block RAM and its width is dependent on the width of the filter kernel used in the cross correlation. The resulting optical flow is stored inside the main memory. A chart of the computa-tional flow can be seen in Gunnar Farnebäck’s PhD dissertation[Farnebäck, 2002, p. 118 fig. 7.8]. The implementation does not make use of a priori knowledge which results in ˜d= 0. A constant motion model is used by the implementation of the optical flow calculator; the computational impacts can be seen at the end of Chapter 7.5[Farnebäck, 2002, p. 116] in the dissertation. More information about the different motion models can also be found in the dissertation.

3.4.1 A and Delta b

The A and∆b module is responsible for calculating the A and ∆b matrices via equations (2.6) and (2.7) respectively. The matrices are calculated as presented in the aforementioned equations replacing the divisions with arithmetic right shifts; where the C matrices are the polynomial expansions of the first and second image respectively. The duplicate A2entry is only computed and stored once.

(26)

3.4.2 G and h

The Gh module calculates matrices G and h; both matrices are detailed in equa-tions (2.8) and (2.9) respectively. The data used are the recieved A and∆b matri-ces computed by the A and∆b module.

The computation of h is done as regular matrix multiplication. Due to the duplicate entry in A the calculation of G2can be reduced to a single addition and

multiplication.

3.4.3 G and h Cache

Since the optical flow calculator processes blocks, the cross correlations will in-duce edge effects along the block borders. To solve this problem you have to have overlapping block edges. This requires the same areas to be read multiple times from the DDR which leads to inefficiency. The G and h cache is introduced to re-duce the required DDR bandwidth; resulting in increased performance. To save block RAM and logic resources only the left side of a block is cached; however there is additional speedup to be gained if the top side of a block is cached as well. In the current implementation a part of the top side is cached as a part of the left side, as displayed in Figure 3.7.

Figure 3.7:Cached area of typical block overlap The width of the cache is depends on the width of the filter kernel.

Widthcache=Widthkernel−1 (3.1)

3.4.4 G and h Averaging

Each of the G and h matrices’ elements are spatially averaged both horizontally and vertically. This is done by a normalized Gaussian filter applied via cross cor-relation. Due to properties of the Gaussian filter the spatial averaging is done in two passes. The module transposes the block after the first filter pass before the data is once again fed to the cross correlators for the second pass. All of the elements are averaged using the same filter. The result of the second pass is output to the last module of the calculation chain, the displacement calculator. The transposing memory is allocated in block RAMs. This directly affects perfor-mance since the block size is dictated by the size of the transposing memory. Just like the cross correlations performed when calculating the polynomial expansion

(27)

the cross correlations performed in this module use symmetric kernels. The sym-metric kernels enable the usage of the pre-adding feature of the DSP48E1 slices in the same way as described in Section 3.3.1.

3.4.5 Displacement Calculator

The result of this module is the calculated displacement between the two images. The displacement vector is calculated as displayed by Equation 2.10. The matrix multiplication between the adjoint of Gavgand havgis performed in parallel with

the determinant calculation. Both of these calculations are performed in fixed point precision. The resulting vector d0 and det(Gavg) are converted into

float-ing point format before the last calculations are performed by the floatfloat-ing point IP cores. The final d vector is then converted back into fixed point format and presented as the module output.

3.4.6 Flow Container

Idle start Block Request Block Calculation Block Confirmation Post Calculation calculate flow data counter limit last block ¬ last block memory write complete

Figure 3.8:Displacement calculation controller FSM

Due to the block-based calculation of the optical flow the amount of logic in the optical flow control module is higher than in the polynomial expansion control module. Aside from state it controls the validity of the output optical flow and the dimensions of the input and output blocks. Since the block edges are overlapped to ensure the correct result from the cross correlation, there will be

(28)

additional unwanted data produced by the G and h averaging. This data’s validity is filtered via a process that inspects the current data’s position inside of the block and the block’s position within the image. Depending on the position of a block its dimensions change. These dimensions are calculated during elaboration and applied during the post block state, denoted Post Calculation in Figure 3.8. They depend on the image dimensions, chosen transposition memory dimensions and the maximum width of the filter kernel. Figure A.1 contains all the different cases and their output.

3.5 Support

The support modules utilized in the system are focused around memory interfac-ing and video stream processinterfac-ing. There are three different video stream proces-sors and three different memory interfaces in use.

3.5.1 Video Generator

This module is not really a part of the optical flow calculation system since it only processes the calculated optical flow and generates a video output. The generated output video is formatted according to the AXI4-Stream Video specifi-cation[Xilinx, 2014a, chap. 1]. The 2D displacement is mapped to the green and blue RGB color channels using the following formula:

        R G B         =         0 122 + 4 ∗ dx 122 + 4 ∗ dy        

Values outside of [0, 255] are clamped.

3.5.2 Video Processors

Two of the video stream processors used are focused on up- and downsampling of the video stream, both in the temporal and spatial domain. The third module is a video stream multiplexer.

Downsampler

The downsampler works by discarding every other pixel in a line and every other line in a frame. It can also be instructed to discard whole frames to reduce the output frame rate.

Upsampler

The upsampler does the opposite thing in respect to the downsampler. It simply duplicates every other pixel in a line and every other line in a frame. Functional-ity to trigger the transaction of another frame is also implemented in a way that

(29)

3.5 Support 17

respects the timing of the target frame rate. The current implementation requires the source frame rate to be half of the target frame rate.

Multiplexer

To be able to adjust the camera a video multiplexer is used to switch between the optical flow and the camera input. It supports two input video streams. The multiplexer acts as a simple sink to the channel that’s not being forwarded; while the forwarded channel is subject to any potential back pressure generated via the AXI4-Stream ready signal. If the selected channel is switched the multiplexer will start forwarding the newly selected channel when it presents a valid start-of-frame. This causes the current frame to be prematurely finished, this is handled by the AXI4-Stream Video protocol[Xilinx, 2014a].

3.5.3 Memory Interfaces

The purpose of the memory interfaces is to act as a layer of abstraction between the computational modules and the used storage medium, in this case the on-board DDR memory. Unlike the computational core modules the memory inter-face is tightly coupled to the overall architecture of the system.

Polynomial Expansion Read and Write

Memcon is an abstraction layer between the optical flow system and the trans-posing and storing of the final polynomial expansion coefficients. It connects to one of the AXI-HP ports on the Zynq processing system to store data in the DDR. There are two modes of operation included in the memcon support mod-ule: regular storage and transposed storage. When the transposing mode is used the previously stored data from the current buffer is read out.

Polynomial Expansion Read

The memcon1 module provides an interface for block-based concurrent reading of two sets of polynomial expansions. This is used to deliver data to the optical flow calculation subsystem. Due to the high amount of bandwidth needed to pro-vide this service the module uses two of the available AXI-HP ports on the Zynq processing system. The output polynomial expansion coefficients are synced and the output can be suppressed via back pressure.

Optical Flow Read and Write

Memcon2 provides block-based writing and linear reading of the final optical flow. The module utilizes one of the AXI-HP ports to communicate with the DDR. The written data is delivered by the displacement calculator while the read data is delivered to the displacement to video module.

(30)

3.6 Memory Usage

3.6.1 Polynomial Expansion

The polynomial expansion module access the memory for two different purposes. The purpose first is to transpose the intermediate results of the cross correlations; refer to Section 2.1.1 for details. The second purpose is to store the final polyno-mial expansion coefficients so they can be used to calculate the displacement.

To transpose the intermediate results of the 2D cross correlations, data is first written and then read; resulting in a factor of 2 in the following expression. As displayed in Figure 2.1 there are three intermediate result matrices that are trans-posed.

2 × 3Ncalc, databits×Widthframe×Heightframe (3.2)

The resulting polynomial expansion consists of 5 matrices which are written to the memory. Equation (2.2) displays how these matrix elements are calculated. 5Npoly, databits×Widthframe×Heightframe (3.3)

Summing (3.2) and (3.3) together results in the total bandwidth used per poly-nomial expansion.

Polyexp_total=Widthframe×Heightframe×(6Ncalc, databits+ 5Npoly, databits) (3.4)

3.6.2 Displacement Calculation

The displacement calculation module performs two sets of memory operations. Firstly it reads two subsequent polynomial expansions. Secondly it writes the calculated displacement.

Due to the block overlap issue discussed in Section 4.2, the following expres-sion is not identical to expresexpres-sion (3.3).

2 × 5Npoly, databits×(Widthframe×Heightframe+

X

Overlap_block) (3.5)

Writing the displacement is a straightforward write of the two matrices con-taining the x and y displacement.

2Nflow, databits×Widthframe×Heightframe (3.6)

Combining (3.5) and (3.6) yields the total bandwidth consumed per displace-ment calculation.

Displacement_total= 10Npoly, databits×(Widthframe×Heightframe+

X

Overlap_block) + 2Nflow, databits×Widthframe×Heightframe

(31)

3.6 Memory Usage 19

3.6.3 Video Generation

To generate the video the latest calculated displacement is used which makes the equation identical to (3.6).

Videogen_total= 2Nflow, databits×Widthframe×Heightframe (3.8)

3.6.4 System Total

The total memory bandwidth used by the system can be calculated using the following equation:

BWtotal=Freqinput×(Polyexptotal+Displacementtotal)

+Freq_output×_Videogen

total

(3.9) Since the video output runs at a different frame rate with respect to the video input the sum is split up in two different parts.

(32)

(33)

4

Architectural Discussion

This chapter discusses some topics which fit the description in section 1.2.

4.1 Memory Bandwidth vs. FPGA BRAM Resources

The most resource demanding modules in the optical flow calculation system are the two 2D cross correlation modules. They consume a majority of the used DSP slices to perform the multiplications required for the cross correlations. In addition to the DSP slices the modules also require memory to transpose the matrices between the two cross correlation passes. This is done differently in the two 2D cross correlation modules.

The one residing within the polynomial expansion calculation uses the DDR to transpose the whole matrices between the cross correlation passes. Doing the matrix transpose in the RAM increased the memory bandwidth consumed per frame by:

2Nmatrices×Widthmatrices×Heightmatrices×Ndatabits (4.1)

To transpose the second set of matrices, computed by the displacement calcu-lation system, the BRAM is used. Ideally this results in no extra memory band-width used to transpose the matrices. In reality the BRAM is not large enough to contain the matrices. One way to solve this problem when using the BRAM is to divide the matrices into rectangular sub-matrices. However this approach causes some issues with the cross correlations resulting in the following consumption of memory bandwidth per frame:

Nmatrices×

X

x,y∈dim(Blocks)

(Overlap_block(x, y) × Ndatabits) (4.2)

(34)

As long as the following relation holds then this approach saves memory band-width compared to transposing using the DDR memory.

X

Overlap_block < Widthmatrices×Heightmatrices (4.3)

Lower usage of the memory bandwidth is preferential in respect of criterion 3 listed in Section 1.4 which makes this approach worthwhile.

4.2 Block Overlap

Equation (4.2) introduces the issue of block overlap. This is an effect of the type of cross correlation implemented in the system. The implemented cross correlation assumes that any data outside the signal fed into it is 0. When an image frame is divided into blocks the cross correlation will, in most cases, erroneously assume that the data outside the block is 0.

To solve this issue the data bordering two blocks will have to be read twice, once per block sharing the data. The issue manifests itself both vertically and hor-izontally. For an illustration of all the cases and their impact on the computations please refer to appendix A.

The issue of block overlap can be partially or wholly mitigated by caching the overlapping regions; however this comes at a cost of even more hardware re-sources. The implemented system contains a partial cache using the block RAM, more details can be found in Section 3.4.3. It becomes an interesting trade-off be-tween cache and block sizes when using the block RAM to implement the cache as well as the transposing memory. By using some of the block RAM resources to create a cache the block size decreases which might lead to an increase in the total number of blocks per image frame. The result of having more blocks per frame is increased usage of memory bandwidth as shown by Equation (4.2).

The dimensions of the overlapping regions depend on the dimensions of the block and the width of the filter kernel. Figure 3.7 together with Equation (3.1) illustrates these relationships. The actual cost of memory bandwidth cost of the overlap when using a cache can be described by:

Overlap_block(x, y) = Overlap_horizontal(x, y) − Cachehorizontal(x, y)

+Overlap_vertical(x, y) − Cachevertical(x, y)

(4.4) Using (4.2) and (4.4) it is clear that as long as you have a sufficiently large block size memory bandwidth will be saved by using the block RAM instead of the main memory. With a large enough cache the used extra bandwidth can be further reduced or even eliminated.

4.3 Data Bit Widths

The amounts of bits used for data representation in the system affects more than just the accuracy of the performed calculations. Equations (4.1) and (4.2) both

(35)

4.3 Data Bit Widths 23

give a hint of the impact that the bit width has on the usage of memory band-width. All transactions between the optical flow system and the DDR memory are affected by the number of bits used per data word.

The issues with memory bandwidth go a bit deeper than just the amount used. The AXI-HP ports used to transfer data between the DDR memory and the FPGA fabric only accepts data packed in chunks of 64 bits[Xilinx, 2016, p. 12]. This can result in a waste of bandwidth. There are two obvious ways to prevent this inefficiency. The first way is to align the bit width of the data words with the bit width of the memory bus; which can lead to other issues. The second way is to pack the data words into a bigger package with a size that is close to a multiple of 64. The catch to this approach is that the layout of the consumed memory becomes that of the transaction package, which may or may not be desirable.

In regards to memory it’s not only the amount of bandwidth used that’s af-fected. The amount of BRAM consumed by the block based approach, discussed in Section 4.2, is directly tied to the width of the data words.

BRAMblock=Blockwidth×Blockheight×Ndatabits, calculation

+Cachedimensions×Ndatabits, cache

(4.5) The BRAM is not the only limited FPGA resource that is affected by the bit width used for the computations. The amount of DSP slices used is directly tied to the width of the input data, as shown in Figure 2-1[Xilinx, 2014b, p. 14]. The cross correlations cores are heavily affected since they use both the multiplier and the pre-adder functionality of the DSP slices.

(36)

(37)

5

Conclusions

5.1 Results

The system was successfully implemented on a ZedBoard[Avnet, 2015] contain-ing a Xilinx Z-7020[Xilinx, 2016]. Video I/O was performed with the help of a FMC module connected to a VITA-2000[ON Semiconductor, 2013] powered cam-era. With the FPGA clock running at 150 MHz, the system was able to produce 31.5 frames per second at a resolution of 960 × 540. More video statistics are presented in Appendix B.

The FPGA resource utilization of the implemented system is presented in Ta-ble 5.1. The values presented include the required peripheral IPs. The system was implemented with a 25 bit data path. This bit width was chosen to maximize the input bit width of the DSP blocks in the cross correlation modules.

The results are briefly evaluated in Section 5.2.3.

Resource Utilization Available Share Slice LUTs as logic 18416 53200 34.61% Slice LUTs as memory 1071 17400 6.15% Slice registers 37126 106400 34.89% Block RAM tiles 87.5 140 62.50%

DSP48 147 220 66.82%

Table 5.1:Total FPGA utilization

(38)

5.2 Evaluation

5.2.1 Purpose

The chosen algorithm was implemented on the hardware listed in Section 5.1. The system managed to reach a memory bandwidth bottleneck, which restricted the capabilities of the system both in terms of frame rate and resolution. The algorithm relies heavily on cross correlations which could be accelerated by using the pre-adder within the DSP blocks as described in Section 3.3.1. Some of the more prominent memory issues and their solutions are discussed in Chapter 4.

5.2.2 Method

The method described in Section 1.3 proved to be useful when working on the thesis. By completing the steps described by the method in order the project could proceed without any major problems occurring.

The combination of having a bit exact model together with the thorough unit testing proved to be a success. By building the system using bottom-up approach the full value of the unit testing could be utilized. Almost every error during integration was a product of the integration rather than the fault of an underlying module. By having confidence in the underlying modules and their correctness the debug efforts could be redirected to the integration which sped up the debug process.

One downside to the method is its thoroughness. Every step consumed both a high amount of time and work leading to a very strained schedule.

5.2.3 Results

The project can be considered a success with the results presented in Section 5.1. The algorithm is running on the targeted hardware in real time, without using all the resources available. The only concern is that the implementation is not capable of performing at 30 fps when the resolution is 1920 × 1080.

The performance of the system could be increased by massively increasing the amount of Block RAM used. This would require a much larger FPGA. Increasing the clock frequency with the used hardware to improve performance is not fea-sible using the current design. The reason for this is that the current design is already just barely meeting the required timing constraints. Section 5.3 briefly introduces further actions which could be taken to improve performance.

5.3 Further Work

The partitioning of the write AXI-HP ports can be optimized to improve the per-formance of the system. This optimization would enable the system to process higher resolutions or more frames per second at the same clock speed.

(39)

5.3 Further Work 27

The bit widths used in each module can be individually tweaked. This could improve accuracy if the computations require additional bits. This could also lead to reduced FPGA resource utilization.

(40)

(41)

(42)

(43)

A

Cross Correlation Block Overlap

Figure A.1 illustrates the Cartesian separable 2D cross correlation performed by the G and h averager (section 3.4.4). Every row displays a different scenario that occurs when processing an image frame. The rows are ordered starting with the top left block of the frame and continuing with the top middle block and so on.

Black represents the lack of sample data which can be assumed to be 0 i.e. outside the image frame. The dark shade of gray corresponds to the lack of sam-ple data which can’t be assumed to be 0 i.e. outside the block but still within the image frame. White symbolizes valid data. Incorrect results are illustrated by using the lighter shade of gray.

(44)

Cross Correlation Matrix Transpose Cross Correlation Cross Correlation Matrix Transpose Cross Correlation Cross

Correlation TransposeMatrix CorrelationCross

Cross Correlation Matrix Transpose Cross Correlation Cross Correlation Matrix Transpose Cross Correlation Cross

Correlation TransposeMatrix CorrelationCross

Cross Correlation Matrix Transpose Cross Correlation Cross Correlation Matrix Transpose Cross Correlation Cross Correlation Matrix Transpose Cross Correlation

(45)

B

Output Video Statistics

The following statistics were generated using a hardware counter. Duration in this context is the amount of cycles used to present a single frame. Delta is the amount of cycles between two consecutive frames.

−−−−_{DURATION −−−−}

Frame duration min: 2291987 at: 771 max: 2300665 at: 623 diff: 8678 Average frame duration (cycles) is : 2295449 vs min/max [3462, 5216] Average frame duration (s) is : 0.015302993333333334

Average frames per second in respect to duration is : 65.34669252072253 −−−−_{DELTA −−−−}

Frame delta min: 2376061 at: 621 max: 2388557 at: 531 diff: 12496 Average frame delta (cycles) is : 2380901 vs min/max [4840, 7656] Average frame delta (s) is : 0.015872673333333333

Average frames per second in respect to delta is : 63.001359569339506 −−−−_{COMPARISONS −−−−}

Target vs Duration − Min: 88965.38095238106(0.037365460000000045) Max: 80287.38095238106(0.03372070000000005)

Target vs Delta − Min: 4891.380952381063(0.0020543800000000466) Max: −_{7604.619047618937(−0.0031939399999999533)}

(46)

(47)

Bibliography

Avnet. Zedboard, 2015. URL http://zedboard.com/product/zedboard. Cited on page 25.

Francisco Barranco, Matteo Tomasi, Javier Diaz, Mauricio Vanegas, and Eduardo Ros. Parallel architecture for hierarchical optical flow estimation based on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 20 (6):1058–1067, 2012. Cited on page 4.

Christopher Claus, Andreas Laika, Lei Jia, and Walter Stechele. High perfor-mance FPGA based optical flow calculation using the census transformation. In IEEE Intelligent Vehicles Symposium 2009, pages 1185–1190. IEEE, 2009. Cited on page 4.

Gunnar Farnebäck. Polynomial expansion for orientation and motion estimation. PhD thesis, Linköping University, Computer Vision, The Institute of Technol-ogy, 2002. Cited on pages 4, 5, 6, 7, 8, 11, and 13.

Gunnar Farnebäck. Two-frame motion estimation based on polynomial expan-sion. In SCIA13 : Gothenburg, Sweden, number 2749 in Lecture Notes in Computer Science, pages 363–370, 2003. Cited on pages 5 and 7.

G.K. Gultekin and A. Saranli. An FPGA based high performance optical flow hardware design for computer vision applications. Microprocessors And Mi-crosystems, 37(3):270–286, 2013. Cited on page 3.

Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago. KITTI vision benchmark suite, April 2015. URL http://www.cvlibs.net/ datasets/kitti/eval_stereo_flow.php?benchmark=flow. Cited on page 3.

Mateusz Komorkiewicz, Tomasz Kryjak, and Marek Gorgon. Efficient hardware implementation of the Horn-Schunck algorithm for high-resolution real-time dense optical flow sensor. Sensors (Basel), 14(2):2860–2891, 2014. Cited on page 3.

(48)

Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence (IJCAI), volume 81, pages 674–679, 1981. Cited on page 4. ON Semiconductor. VITA 2000 2.3 Megapixel 92 FPS Global Shutter CMOS

Im-age Sensor, June 2013. NOIV1SN2000A/D. Cited on pIm-age 25.

Fridtjof Stein. Efficient computation of optical flow using the census transform. In Pattern recognition, pages 79–86. Springer, 2004. Cited on page 4.

Xilinx. AXI4-Stream Video IP and System Design Guide, April 2014a. UG934. Cited on pages 16 and 17.

Xilinx. 7 Series DSP48E1 User Guide, November 2014b. UG479. Cited on pages 12 and 23.

Xilinx. Zynq-7000 All Programmable SoC Overview, January 2016. DS190. Cited on pages 23 and 25.

(49)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förla-gets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet — or its possi-ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances.

The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be men-tioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

Polynomial Expansion-Based Displacement Calculation on FPGA

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Polynomial Expansion-Based Displacement Calculation

on FPGA

Polynomial Expansion-Based Displacement Calculation

on FPGA

Examensarbete utfört i Elektroteknik

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Acknowledgments

Contents

1

Introduction

1.1

Background

1.2

Purpose

1.3

Method

1.4

Choice of Algorithm

1.5

Limitations

1.6

Related Work

2

Algorithm

2.1

Polynomial Expansion

2.1.1

Cross Correlation

2.1.2

Inverse G

2.2

Displacement Calculation

2.2.1

A

2.2.2

Delta b

2.2.3

G

2.2.4

h

2.2.5

Averaging of G and h

2.2.6

d

3

System

3.1

Overview

3.2

Control

3.3

Polynomial Expansion

3.3.1

Cross Correlation

3.3.2

Sparse Matrix Multiplication

3.3.3

Control

3.4

Displacement Calculation

3.4.1

A and Delta b

3.4.2

G and h

3.4.3

G and h Cache

3.4.4

G and h Averaging

3.4.5

Displacement Calculator

3.4.6

Flow Container

3.5

Support