Implementation and Evaluation of Concurrency on Parallella

(1)

Implementation and Evaluation of Concurrency on Parallella

GUSTAV ENGSTRÖM AND MARCUS FALGERT

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND LEVEL

STOCKHOLM, SWEDEN 2014

(2)

(3)

Implementation and Evaluation of Concurrency on Parallella

Gustav Engström and Marcus Falgert

2014-08-01

Bachelor’s Thesis

Examiner

Mats Brorsson

Academic adviser

Artur Podobas

(4)

(5)

Abstract

The question asked is what optimizations can be done when working with the Parallella board from Adapteva and how they differ from other concurrent solutions. Parallella is a small super computer with a unique 16 core co-processor that we were to utilize. We have been working to parallelizing image manipulation software, and then analyzing the results of some performed tests. The goal is to conclude how to properly utilize the Epiphany accelerator, and also see how it performs in comparison to other CPUs. This project is a part of the PaPP project, which will utilize Parallella, and the work can be seen as an initial evaluation of the board.

We have tested the board to see how it holds up and made our best efforts to adapt to the hardware and explain our path of working.

This report is worth reading for anyone who has little experience with Parallella, who desires to learn how well it works and what it’s good for.

There are descriptions of all libraries used and detailed thoughts on how to implement software solutions for Epiphany. This is a bachelor level project and was performed with no prior knowledge of Parallella.

(6)

1 Introduction 4

1.1 Background . . . . 4

1.2 Problem . . . . 4

1.3 Purpose . . . . 4

1.4 Goals, Benefit, Ethics and Sustainability . . . . 5

1.5 Methods . . . . 5

1.6 Delimitations . . . . 5

1.7 Outline . . . . 6

2 Theoretical Background 8 2.1 Parallella Specifications . . . . 8

2.2 Epiphany Specifications . . . . 8

2.3 Hardware Background . . . . 9

2.4 Software Background . . . . 12

3 Work Methods 15 3.1 Tools . . . . 15

3.2 Plan . . . . 16

3.3 Methodology . . . . 17

3.4 Progress . . . . 17

3.5 Method Evaluation . . . . 18

4 Implementation 19 4.1 Comparison . . . . 19

4.2 Concurrency . . . . 20

4.3 Local Memory Allocation . . . . 24

5 Code 28 5.1 Host . . . . 28

5.2 Kernel . . . . 29

6 Analysis of results 33 6.1 Test methods . . . . 33

6.2 Results . . . . 34

7 Conclusion 39 7.1 Project Summary . . . . 39

7.2 Result Conclusions . . . . 40

7.3 Goal Fulfilment . . . . 41

(7)

1 Introduction

This project was carried out as a part of the Papp-project [1], a project for creating portable software solutions applied on different kinds of hardware. The PaPP-project involves a multitude of countries as well as research institutes and IT companies. This thesis project is a 15 hp bachelors degree project, performed by two students at the ICT school of KTH in Kista and took place at SICS [2]

in Kista. The work is supervised by Mats Brorsson, professor at KTH and head of the PaPP-project.

1.1 Background

The Parallella [3] board from Adapteva [4] has a dual core ARM processor and a 16 core Epiphany [5] Accelerator co-processor. The first is a 32 bit processor that handles all basic operations and is used by default. The latter reaches 600 MHz on each core and can theoretically perform 22 billion floating point operations per second. Maximizing efficiency, however, relies heavily on concurrency, since the cores individually aren’t that powerful. A fully parallel program can run almost 16 times as fast on Epiphany as it would on a single core. If the program has inherently serial parts, meaning it has to be run on one core, the strength of the Epiphany is wasted.

1.2 Problem

As previously explained, libraries and programs that are fully or mostly paral- lizable have far more use for the Epiphany processor. The main focus of the project is to determine whether the board can increase efficiency of some useful libraries. The primary objective is the image manipulation tool OpenCV [6], that contains many modules for analysing and transforming images. These use many independent operations and can therefore be parallelized on many cores.

1.3 Purpose

The purpose is for the board to be available for use in future programming

projects and that parallel program running on it use the extreme calculation

speed of the device to its full extent. There are already many programs that

run on similar modules and even programs that with a slight change will run on

the Epiphany, but the purpose here is to take advantage of the huge calculating

(8)

1.4 Goals, Benefit, Ethics and Sustainability

The product of this project is, excluding this report, code meant to illustrate how to efficiently parallelize applications using Epiphany. In our work we are striving to reach as general a solution as possible, to enable easy application in other contexts. Hopefully the conclusions of the project and this report will be of aid to the progress of the PaPP-project and the research at SICS.

Since the goals are still vague and not entirely specified, their fulfilment will be a question of where our delimitations are put. A successful execution of the relevant programs should be rather accomplish-able, while a thorough analysis of the efficiency of different solutions might need to be regarded as a secondary goal.

The goals can be summarized to the following.

• Set up and test Parallella

• Create software solutions for Epiphany

• Evaluate the efficiency of Parallella

1.5 Methods

Our method will be software applications created in OpenCL [7] with C/C++.

OpenCL is a library for C/C++ used as a framework for heterogeneous plat- forms which we will explain more in detail later. We have created code that can tell the board which cores are to be used for what and how work is to be distributed. OpenCL has many tools for this and is made to force the use of certain hardware. Consequently, we can make the execution divide tasks over the Epiphany cores.

1.6 Delimitations

Our focus is parallelization with OpenCL; other programming languages are

secondary goals and preliminarily excluded. The aim is to evaluate usability

and performance of the Parallella board, for which other parallellizing methods

aren’t necessary, even though they might add more to the analysis. OpenCL has

a structure for dividing tasks that fits Epiphany very well, and in conjunction

with the stdcl[12] library, makes implementation relatively easy. For a broader

testing of the board, other methods might be needed. For example to test how

(9)

1.7 Outline

2. Theoretical Background

Information about the Parallella board and the libraries we have used.

Focus is on concretely and succinctly describing the hardware and statis- tics on all the software. Performance, components and measurements are to be included. Usage of components and detailed description of their purposes are not.

3. Work Methods

Programming languages, development methods, ways of working and prob- lem solving are important points to be taken up. What we want to present here is however not only programming and parallelization, but also how we planned the work through the different stages of the project. Detailed plans, methods of solving problems and the transformation and changes of the project are to be included. Applications, code and description of performed work do not belong here.

4. Implementation

Parallelization is the core of this project. This item is in the report to thoroughly go through how Epiphany works in practice and which methods are the most efficient. Focus lies on what can actually be regarded as progress or relevant ideas about how the cores are put to use. In that sense, this point is an analysis, but not of the project as a whole, but rather of Parallella and the parallelization of big calculations. A memory allocation algorithm created specifically to run programs faster on Epiphany is also explained here.

5. Analysis of Results

The results of our work and a subsequent analysis of it. This is not a final analysis of the project, but more of a detailed walk-through of what all ideas and tests resulted in. Execution results and performance comparison are brought up here, as well as discussion of what these mean in reality.

6. Code

There may be parts of the written code that work as perfect examples of general software solutions. We aim to bring up the purposes of the different parts of our code and explain why it was implemented in that certain way.

Naturally, we will not include all our code, but we will focus on what bears

significant meaning to the project as a whole and parallelization on this

(10)

7. Conclusion

This point is the final and conclusive analysis of the project. We will

compile all work, all important results and essential data to an evaluation

of what has actually been accomplished. The grade of accomplishment

and completion of goals come in here, as well as discussion of future areas

of use for this work. We also bring up the progress of the work. Lastly,

we give a conclusion of what the project has lead to, both for SICS and

us as students.

(11)

2 Theoretical Background

Adapteva created Parallella as a portable and easy to use supercomputer, made for big, parallelizable calculations.

2.1 Parallella Specifications

As Parallella is small and doesn’t have the most advanced of hardware. Proper programming can make good use of the 16 processor cores on Epiphany, but unmodified programs for other hardware will not, since the use of Epiphany has to be forced by the program.

This is an overall specification of the hardware and software of the Parallella board. These are the components of every board created, and not specific for our board or project.

• 8.6 x 5.5 cm Form Factor

• Zynq-7000 Series Dual-core ARM A9 CPU

• 16-core Epiphany Multicore Accelerator

• 1 GB RAM

• MicroSD Slot

• MicroHDMI Port

• 2x USB 2.0

• 10/100/1000 Ethernet

• Ships with Ubuntu OS

2.2 Epiphany Specifications

With high speed and low power usage, the Epiphany accelerator can be de- scribed as efficient considering its size and power usage (2 W). It’s not the fastest processor on the market and for sequential programs, it is unlikely to perform well because of the low individual core frequency. If you know its strength and limits and adapt to these, it is strong for such a small device.

This refers to the physical components and values of the Epiphany Accelerator.

The following list includes the most important parts and numbers.

(12)

• 512 GB/s Local Memory Bandwidth

• 64 GB/s Network-On-Chip Bisection Bandwidth

• 8 GB/s Off-Chip Bandwidth

• 0.5 MB On-Chip Distributed Shared Memory

• 2 Watt Maximum Chip Power Consumption

2.3 Hardware Background

Parallella is promoted as cheap, effective solution in super computing. While still a prototype, the Parallella board has high potential and can despite its small size finish calculations close to the speed of a modern PC. It is inspired by Raspberry Pi [8], but has its focus on high-speed calculations.

In this project we have been working with the 16 core version of Epiphany, but

there is currently a Parallella board with 64 Epiphany cores, and co-processors

with even more cores are to come. There is no structural difference between the

versions, but the 64 core model would of course have the potential for bigger

calculations. If the work is not of a magnitude worth dividing into 64 tasks, such

as an 8x8 matrix, the work of calling 64 cores will most likely take more time

than the work on each core, making the full use of the accelerator redundant.

(13)

Figure 2: Bottom view of the Parallella board.

The prototype state of the Parallella limits the user interface, which is practically non-existent for the moment, and detailed user guides are mostly written by a community of enthusiasts. It can be said though, that for anyone comfortable with working in a Linux terminal, there should be no large obstacles in this regard. The Parallella will most likely become more user friendly in its final state. With the right tools it could be used in the same manner as a regular Linux machine. Worth noting is that we did not have these tools during our project.

The Epiphany is a co-processor. This means that it’s not the main processor

of the computer, but serves as an extra set of cores that work can be offloaded

on. While the host program runs on the ARM, it can spawn new threads that

make use of the Epiphany cores. Memory sharing between the two processors is

relatively fast, since there is no cache on the Epiphany and its built in memory

is rather small. Accessing shared memory is still a problem of the Epiphany,

though, and if the software implementation doesn’t contain a solution for this,

the speed could suffer.

(14)

When using Epiphany, a great deal of the effort has to be provided by the software developer. The board is not overly complex, but any program made for another module will most likely not scale on the Epiphany. First of all, unless the use of Epiphany is forced, it will not happen. This is the case with any GPU or accelerator. Usually the processor has its own architecture and needs very specific code. OpenCL is a library used to run on GPU:s and accelerators without having to adjust to the architecture. While this is not always going to be an optimal solution, it works well enough for us.

The second problem with porting a program made for other hardware is the lack of cache on Epiphany. If the program heavily relies on the hardware cache, it might be very slow even if it is otherwise optimized for 16 cores and forces the use of Epiphany. Implementing some sort of software cache can be crucial.

This would mean that a small part of the data is loaded into the local memory, allowing quicker access to it. While a hardware cache does this without the programmer having to implement it, a software cache is created in the code.

Just as with hardware cache, memory locality impacts its efficiency.

The structure of Epiphany makes it best suited for independent calculations.

Image processing and matrix operations are perfect examples. If each core

can be assigned its part of the work and complete it without synchronization

or interrupts, you have an ideal scenario. There is support for inter-process-

communication, but the shared memory is very limited and the cores excel at

calculating.

(15)

Figure 3: Parallella high level architecture.

2.4 Software Background

In order for the user to be able to utilize Parallella properly, code must be written specifically for the Epiphany in some manner. The programming model we used to accomplish this was OpenCL, whic we used to write cl code together with a library called stdcl.

Since the Parallella board still is a relatively new piece of hardware, support

for the more popular libraries and tools for simplifying the parallelization of

epiphany is still missing, such as OpenMP [9]. Recently, however, support for

epiphany in OpenCL has been added in the form of stdcl, a library meant to

make coding applications for OpenCL easier. OpenCL is a library used for

writing kernels. Kernels contain code which are to be executed on a specific

device, such as Epiphany in this case. By offloading execution on Epiphany in

this manner, the cores of Epiphany can then be instructed to perform certain

calculations.

(16)

If we take a matrix of numbers as an example, we want to perform some manner of operation on each element in the matrix. We will then offload execution on Epiphany with 16 threads, the same amount of threads as there are cores. Each thread has an identifier, which the OpenCL kernel can read. This identifier is the only thing that differentiates the threads from each other, so the task delimitations must be calculated using this identifier. This technique is called SPMD [10], and is the most common method of parallelizing. It means that every thread runs the same code, while not necessarily the same parts of it or with the same parameters. In the case of a matrix, the identifier can be divided into two-dimensional coordinates, representing each core in a 4x4 grid. This makes visualising the task management easier, and it becomes clearer which cores are to be used.

Figure 4: An illustration of how the work of performing operations on each

element in a matrix can be shared among the cores of the Epiphany.

(17)

In order to get OpenCL support for Epiphany, an additional library called stdcl has to be included. The programmer has to check the stdacc identifier, which will recognize Epiphany as an available accelerator if it is present. Code can then be offloaded to execute on Epiphany by passing stdacc as an argument when launching the kernel. There are other benefits of stdcl as well, such as cleaner and easier to understand functions in a lot of cases. As OpenCL can be cumbersome to program for, this was also a welcome addition. Overall, it was necessary to get code working properly for Epiphany.

The code in the final program was based on OpenCV. OpenCV is a large li- brary containing many different modules for manipulating images, and has be- come the de-facto standard to use for this purpose. Since the plan was to get OpenCV modules working on Epiphany, this naturally had to be included in our code. OpenCV has a modular structure, which means that the package includes several shared or static libraries. Imgproc, for example, is an im- age processing module that includes linear and non-linear image filtering, geo- metrical image transformations (re-size, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. Video on the other hand, is a module that includes motion estimation, background subtraction, and object tracking algorithms. The whole library is very modular and flexible overall.

OpenCV also already contains some manner of support for multicore CPUs.

While Epiphany is not supported, a lot of existing modules could be coded

to do so. Modules such as the filter we wrote contains a lot of parallelizable

calculations, meaning that those parts could run a lot faster if more cores where

used, such as the 16 cores of the Epiphany. The main problem with doing this

for Epiphany is its memory limitations. As Epiphany does not have a cache and

each core only has 32kb at their disposal, storing images on the actual Epiphany

is problematic. Therefore, a sort of memory allocation on Epiphany has to be

coded in some manner, which we will explain more about in detail later.

(18)

3 Work Methods

Our work is centered around understanding and using the Parallella board.

Optimizing parallel programs in general might not translate that easily to op- timizing on the Parallella board. While parallelizing as a concept focuses on finding the dividable and non-dependant tasks and putting different threads on performing them, the use of Epiphany adds an extra element, the element of high speed calculations. Because of the capacity of the co-processor, smaller calculations are practically instant. Finding a balance between dividing the work and making use of this capacity is the key to using Parallella to its full potential.

3.1 Tools

Here follows a list of tools, both hardware and software, that needed to get the Parallella board working correctly and optimize the results of our efforts.

USB table fan A cheap, simple fan connected to a computer, cooling the board. While not of very high quality, it is sufficient, since the Paral- lella board generally doesn’t heat up that much. We never encountered any problems with the heat buildup, so we are assuming this should be enough.

Standard Ethernet cable No specifications should be needed. This is a reg- ular ethernet cable that connected the board to a data port.

16 GB MicroSD Card Made by SanDisk. We installed Linux Ubuntu on the card to get the board running.

WinSCP We used WinSCP to connect our computers to the board using ssh, and thereby getting a file directory interface. This was a simple and easy way for us to edit and create folders and files, even though the tool lacks some utilities.

PuTTy With a PuTTy terminal and Xming server running, we could run visual

applications on our Parallella.

(19)

3.2 Plan

Since some objectives were unclear and the purpose of the project not entirely concluded, it was difficult making a specific plan of our work. The reason for these problems was delays from other parties, making the software originally intended for the project unavailable. With the main purpose still within reach, but not for the moment, we were assigned the task of parallelizing other pro- grams, without exact specifications. Our main goal in the early stages was therefore to understand the module and its role as a super computer. We made several installations and tests to determine the limits and possibilities of the hardware and software we were dealing with.

Our approach was to first read about the board, making sure we understood the platform we were using. We got it running very fast and managed to explore the file system with the help of winscp and ssh. Once this was done, our next step was to make simple C programs run on it. We made sure we had all necessary libraries and tools, so that the later programming work would not circle around getting minor libraries installed, but instead could be focused on implementing our objectives without distractions.

Once the programming environment had been set up, the aim was to make sure it worked by trying to run basic OpenCL programs. In order to make sure advanced OpenCL programming wouldn’t be an issue, we needed to know that all OpenCL functions and tools could be utilized. Since our goal was to use the Epiphany to its full potential, we also needed to ensure that applications could indeed make use of all the 16 cores. When all of this had been established, we could coding applications ourselves.

After all the setup, our plan was basically to learn enough about OpenCV to

implement its modules with OpenCL on Epiphany. We chose Median Blur as a

primary target and made it our first task to learn how it worked. This would

prove very hard, and we later reevaluated our priorities and decided to take

smaller steps at the time.

(20)

3.3 Methodology

While the project mainly was conducted through individual research and exper- imenting, we chose pair programming as our way of working, for the utility the method has. It’s highly focused on finding solutions and getting past obstacles, and two people working on the same code, one of them dedicated only to finding the problems and errors, will reach the goal with more ease. If we had split up the work, the individual progression of the code would be slower and the use for dividing the work in this way is not optimal for a project with a quite singular goal, such as the one we had. While it was possible to program simultaneously from two different computers with the help of ssh, we would still be accessing the same code and would need a very specific change log to avoid overwriting anything.

Our choice to instead work together took away the need for a log, since we had constant communication and mutual access to all documents. Comments and documentation are consequently for outside readers.

3.4 Progress

The progress could initially be considered slow. The original task was that a programming environment for the PaPP project was to be ported for use on the Epiphany. In the early stage of the project, we were informed that this was not possible, since the programming environment was not yet delivered.

Our objective was changed and unfortunately not as clear. Since we were still working on setting everything up, no work was in vain.

We were later informed that parallelizing and evaluating OpenCV code was something of interest, and began reading up on how to best implement modules to take advantage of Epiphany. We found one module in particular, namely a median filter, which seemed like a good fit for the 16 core Epiphany. Median filters can be used to eliminate noise in images by smoothing them out. The code for the OpenCV Median Blur module lacked comments and had a lot of calls to other modules. It took us some time to realize we weren’t making enough progress.

We figured that we needed a reworking of our priorities. We had previously

focused on one goal, one that we did not seem to be able to complete. At

this point we had no problem understanding median blur algorithm, just the

OpenCV implementation. Our focus, instead of implementing the OpenCV

module to work on Epiphany, became to create a median blur algorithm that

(21)

We wrote code for the filter, based on OpenCV, and parallelized the essential parts of the filter where a great deal of performance could be gained from adapt- ing the code to best fit Epiphany, since most parts of the filter consists of smaller individual calculations.

After some issues, we managed to get the module working on Parallella, execut- ing on all 16 cores of the Epiphany. While we saw an increase in performance on the module running on epiphany over the ARM, it still seemed lackluster, and we felt it could run faster than it did. After overseeing the way we implemented the allocation of memory, we managed to speed up our application significantly, yielding a quite acceptable performance.

Evaluating and analyzing the given results is also a big part of the project. We ran several different tests on different platforms in order to be able to compare it to what we had had accomplished with parallella and Epiphany. We also made sure to include tests performed on less than 16 cores of Epiphany, in order to observe the speedup gained from utilizing its concurrency. More on this later in the Analysis of Resutls section.

3.5 Method Evaluation

We would consider our way of working a success for the most part throughout the project. As we did most of the work in pair, we were always in synchronization with each other and rarely had communication issues. Pair programming also seemed like the way to go, as we were able to identify errors in the code easier and could subsequently make progress faster. In larger groups it could be more problematic, but for just the two of us it worked out nicely.

The problems we had were caused by unclear circumstances. As our plan changed several times, we should probably have anticipated our final direction earlier. Our task was first of all to attempt running OpenCV modules concur- rently on Epiphany. We made it our main goal to decipher and use the Median Blur code, which was a mistake. The lack of comments and guides to OpenCV made this a very big task, something we were not very likely to complete during the given time. Our method should have been to from the beginning start with smaller goals. Our failure to do so ended up in a lot of time spent on reading code beyond our grasp.

Once we decided to change method, we made a lot more progress. This lead

to us getting deeper understanding of Epiphany and OpenCL, opening up for

possibilities to create more solutions and perform more tests. Because of our

slow early progress, our end product is less than it could have been.

(22)

4 Implementation

As the Parallella platform consists of a dual-core ARM CPU and an Epiphany RISC array 16-core co-processor, being able to correctly implement applications is important.

4.1 Comparison

The Epiphany cores are not particularly fast. 600 MHz is in the modern day quite a low frequency. The average modern day Intel processor can reach over 3 GHz on each core, just to give some perspective. An effective way of speeding up a computer today is to parallelize. This is exactly what the Epiphany was made for. No program is fully parallelizable. Even if there was such a program, perfectly synchronizing it would be an extremely difficult task. The possibilities of Epiphany are nevertheless worthy of examining, as the potential is there.

What it essentially comes down to is if it can be used well enough. The essence of this project is really to answer that question.

The Paralella board itself takes up very little space, uses little power, and is also a relatively cheap piece of hardware to acquire. The use for it is not massive programs in large software project, but for embedded hardware and smaller hardware modules with the need for fast calculations. Our focus in this project, image manipulation software, is a great example. If a smaller module needs face recognition, utilizing the Parallella board is a good idea, since the co-processor will run the software fast, and you will not commit an expensive computer for this small module.

For larger software, parallelizing is more or less a requirement lately. What we

strive to answer here is if the parallelizing method is different when working

with Parallella. Since most processors commonly used in ordinary computers

don’t have as many as 16 cores, there is clearly a difference. With four or six

cores, a larger program can simply distribute its tasks so that none of the cores

are overused. This might be a simplification of the matter, but the fact remains

that it is of higher importance to equally divide the tasks amongst the cores if

the cores are in higher numbers, but each core is slower. A parallel program

finishes when the last thread finishes. This is called the critical path [11]. On 16

cores the run-time multiplier caused by bad parallelizing is much bigger. One

also needs to take into consideration that any sequentially implemented part

that could have been implemented concurrently, will slow down the program by

a higher factor.

(23)

The Epiphany also does not have a cache memory. This makes a significant difference to the programmers approach, since there is no relying on fast memory access. Any memory you want the Epiphany cores to access quickly, you have to allocate in the kernel code and store on the processor core memory. The memory, like a cache, is rather small. Each core has 32 kB at their disposal, which is insufficient to store an image. Since the Epiphany structure is otherwise optimal for operations on large matrices, this makes for somewhat of a problem.

To work as fast as possible on large amounts of memory, the program needs some sort of algorithm for allocating often accessed bits on the processor core memory, without ever exceeding the maximum storage capacity. One might need to consider the use of Epiphany unnecessary in cases where very short operations are to be performed on very large amounts of data, since memory access might take more time than the actual operations.

4.2 Concurrency

We have been working to create image manipulation software, specifically to run in parallel on the Epiphany accelerator. To solidly prove that we could utilize the Epiphany, we needed to see a speedup of about 16 times when all cores were used compared to when only one was. As previously mentioned, this is not as easy to do in practice as it is to theorize about. With the median blur algorithm already having a strict purpose, our flexibility was only in how we chose to divide the work that had to be done. As a picture is essentially a matrix, where every element is a pixel. The usual way to represent a coloured picture is with RGB, meaning a number between zero and 255 for each of the colours red, green and blue. The range of one byte is 256, which makes a colour channel ideally stored in one byte.

Median blur blurs an image through setting each pixels to a median of its sur-

rounding pixels. While this on the surface might sound simple, it does become

more complex when dealing with very limited memory, a problem described in

the Comparison section. Basically, the program needs to loop through all the

pixels and, for each colour channel of each pixel, gather all the pixel within

the filter range. The filter size is basically the number of pixels on the side

of the square that is taken into account when determining the median. Once

these are gathered, they are sorted. From a sorted array there is only a basic

operation needed to get the median. The respective channels are then set to the

median received. This will decrease the difference between neighbouring pixels

and smooth the image, which is the purpose of any blur algorithm.

(24)

Figure 5: An illustration of how median blur works. The median value of every

surrounding pixel determines the new value of the pixel.

(25)

Figure 6: Original image.

(26)

Solving how to divide the work on Epiphany is a fairly simple matter. The task for one core should be one 16th of the whole body of calculations. As median blur works on a matrix, it comes down to letting each thread take one 16th of the matrix and apply the algorithm to this designated part. This is a slightly simplified view of the problem, but it describes the beginning of a solution.

For any paralellizing of this sort to work, of course, no task performed can be dependant on another. One cannot easily predict the order of work in OpenCL programs. The way OpenCL simply divides the work based on numbers make dependencies hard to work around. Median blur is very fitting for testing the Epiphany, because it has no such dependencies. Every new colour is based on the old pattern around it, and completely unaffected by the changes to surrounding pixels. Therefore, an old version of the picture needs to be stored and never changed, while all changes are done to a copy. With this implementation, there are no dependencies.

The width of the image is divided into four parts, and the same goes for the

height. Each of the 16 threads is assigned two coordinates and thereafter calcu-

lates an area within which it performs all changes. It may not change anything

outside this area. It will, however, need to read pixels outside of the area. This

is because the pixels on the edge of the area will need pixels outside the area to

get their median value. The program, now knowing which parts of the matrix to

alter, proceeds to go through all the pixels in this area. In our implementation,

the matrix is actually represented by an array of unsigned characters, meaning

every index in the array has to be calculated from the given coordinates. The

three colour channels of every pixel also need to be taken into account. Values

of the same colour channel from the pixels in an area around, its size depending

on the filter size. The array where they are stored is sorting with quick-sort and

then the middle element is used to alter the pixel in focus.

(27)

Figure 8: Each core handles one part of the image.

Overall, our solution is based on assigning areas to be processed to the different cores, each one getting an equally large workload. We did this in the most logical way possible, to make evaluating it and changing it as simple as possible.

4.3 Local Memory Allocation

With the algorithm we created, any image could be processed. Size and colours make no difference. The problem is that it’s quite slow. The obstacle is, as mentioned before, the memory access. There are no exact numbers given on this, but based on results we can establish that it takes at least double the time to access data on main memory, compared to data allocated on Epiphany. Since this memory is accessed frequently, it makes a significant difference. To speed up the program, the data needs to be allocated on the Epiphany cores’ memory.

Of course, a whole image cannot fit into this small memory. And in cases of

really large images, there isn’t even space for a 16th of the image.

(28)

The solution is to divide the work into even smaller pieces. Copying the appro- priate amount of pixels to a smaller matrix allocated in the OpenCL kernel will allow faster access. The question comes down to which parts are worth copying.

If the data is only accessed once, a transfer does no good. Therefore the copy of the matrix, where the results are stored, is not transferred. It is only written to when changing colour, which is done once per byte. In the original matrix on the other hand, every pixel is read multiple times, since they are part of the area around many other pixels. The bigger the filter, the more times each pixel is read. Theoretically, this should mean that this change makes a for a larger speedup if the filter is larger.

The frames loaded into local memory need to be small enough to not interfere

with other necessary storage. It has to be taken into consideration that the

frame must contain not only the pixels to be changed, but also the pixels that

have an effect on the change. How many these are naturally depends on the

filter size. Dividing the area for the specific core into smaller frames is not

the simplest task either. While we as a general solution implemented nested

for-loops to iterate frames in a two-dimensional grid, this was very unnecessary

in cases of small areas, where only one to three frames were needed. Simply

dividing the width and height of the area with respective values of an square

frame would result in a larger amount of frames than actually needed. Instead

the width is divided by the needed number of frames in these cases, so that no

redundant frames are used.

(29)

Figure 9: If a part of the image do not fit in a cores local memory they need to be split up further. This image illustrates what happens if two allocations per core are needed. Each part is split up in two parts, and only one of them is kept in local memory at a time.

With the general algorithm, it is unavoidable that every core has to load pixels

that it was not tasked to change into its memory. This is because these pixels

are needed for the calculations, since the new value depends on surrounding

pixels. Some conditions are needed to assure no pixel is changed by multiple

cores, something that is better explained in section 5.

(30)

Figure 10: The speedup gained with the new local memory allocation. Speedup is calculated by dividing execuction time for the old code with the new. Filter size 3 means a filter of size 3x3 was used.

To ensure that our memory allocation algorithm was beneficial, we tested it for

different filter sizes. Even for the smallest filter it gives a decent speedup, as

can be seen in figure 10. It is only logical that it becomes more apparent for

larger filters, as transferring the data will take some time, and if the pixels are

to be read fewer times, it’s not as rewarding. Overall, proper usage of the local

memory available on each core definitely makes a noticeable difference.

(31)

5 Code

The basic OpenCL program is built of a C or C++ program to setup everything, called a host, and an OpenCL kernel for calculations to be performed on an accelerator, such as Epiphany, called a device. All concurrency should therefore appear in the kernel. This naturally makes the host code completely sequential, and it should contain as little of the work as possible.

Here we are going through our code, explaining what it does and why we chose to do it this way. Mainly, this section will be focused on the parts that are impor- tant to make a program use Epiphany, and the parts implementing concurrency on the accelerator.

5.1 Host

We include OpenCV modules for reading an image into a martix and displaying matrices as images. This is of course outside the main function.

#include ” o p e n c v 2 / i m g p r o c / i m g p r o c . hpp ”

#include ” o p e n c v 2 / h i g h g u i / h i g h g u i . hpp ”

We create the source matrix and the copy, where the result will be stores. Mat is an object in OpenCV, representing a matrix and some meta data.

Mat m a t s r c ; Mat m a t d s t ;

The image is read to the first matrix, which is then copied to the second matrix.

m a t s r c = i m r e a d ( ” /home/ l i n a r o / T e s t / l e n a . png ” , 1 ) ; m a t d s t = m a t s r c . c l o n e ( ) ;

Memory for two matrices is allocated by calling clmalloc. clmalloc allocates OpenCL device-shareable memory within a specific context. In this case the context is Epiphany, identified by the stdacc argument. Epiphany can then access this memory later. The images are also copied to the allocated space by calling memcpy. This is part of the stdcl library.

c l u c h a r ∗ s r c D a t a = ( c l u c h a r ∗ ) c l m a l l o c ( s t d a c c , s i z e o f ( c l u c h a r ) ∗ m a t s r c . rows ∗ m a t s r c . c o l s ∗ 3 , 0 ) ;

memcpy ( s r c D a t a , m a t s r c . data , s i z e o f ( c l u c h a r ) ∗ m a t s r c . rows ∗ m a t s r c . c o l s ∗ 3 ) ;

c l u c h a r ∗ d s t D a t a = ( c l u c h a r ∗ ) c l m a l l o c ( s t d a c c , s i z e o f ( c l u c h a r ) ∗ m a t s r c . rows ∗ m a t s r c . c o l s ∗ 3 , 0 ) ;

memcpy ( dstData , m a t d s t . data , s i z e o f ( c l u c h a r ) ∗ m a t d s t . rows ∗ m a t d s t

(32)

c l m s y n c ( s t d a c c , 0 , s r c D a t a , CL MEM DEVICE | CL EVENT NOWAIT) ; c l m s y n c ( s t d a c c , 0 , dstData , CL MEM DEVICE | CL EVENT NOWAIT) ;

The number of threads is set. This will create a 4x4 grid of threads.

c l n d r a n g e t ndr = c l n d r a n g e i n i t 2 d ( 0 , 4 , 4 , 0 , 4 , 4 ) ;

This operation executes the kernel. It will create 16 threads that all run the kernel code, but with different id:s. Variables to be used are passed as arguments as well, enabling the kernel to identify them.

c l e x e c ( s t d a c c , 0 , & ndr , m e d i a n k e r n m e m f i n a l , s r c D a t a , dstData , s r c C o l s , srcRows ) ;

The program waits for all threads to finish and synchronization to be completed.

c l m s y n c ( s t d a c c , 0 , dstData , CL MEM HOST | CL EVENT NOWAIT) ;

The resulting matrix is copied back to the image matrix.

memcpy ( m a t d s t . data , dstData , s i z e o f ( c l u c h a r ) ∗ m a t d s t . rows ∗ m a t d s t . c o l s ∗ 3 ) ;

Allocated memory is freed.

c l f r e e ( s r c D a t a ) ; c l f r e e ( d s t D a t a ) ;

5.2 Kernel

The kernel name needs to match the one specified in the host. Variables that were sent as parameters from the host also needs to be specified as arguments here in order for them to be properly accessible.

void m e d i a n k e r n m e m f i n a l ( u c h a r ∗ s r c D a t a , u c h a r ∗ dstData , i n t s r c C o l s , i n t srcRows , i n t s r c S t e p , i n t d s t S t e p ) {

The filter size is a constant value that determines the height and with of the area

around a pixel that is evaluated. It needs to be an odd number, for the pixel in

question to be in the middle. This should actually be in the host code for the

program to be user friendly, but it was initiated here for testing purposes.

(33)

const i n t MEM SIZE = 1 6 8 7 5 ; const i n t MEM FRAME = 7 5 ;

These operations will set i and j to different values depending on the core that is running it. i and j will represent a location in the two-dimensional grid of threads, and are used to know what data to process.

i = g e t g l o b a l i d ( 0 ) ; j = g e t g l o b a l i d ( 1 ) ;

Here we calculate the intervals from i and j.

i n t w4 = s r c C o l s / 4 ; i n t x f i r s t = i ∗w4 ; i n t xend = x f i r s t + w4 ; i n t h4 = srcRows / 4 ; i n t y f i r s t = j ∗ h4 ; i n t yend = y f i r s t + h4 ;

The integer bytes will be set to the amount of bytes that the area takes up.

Then we calculate how many frames fit into this area.

b y t e s = 3 ∗ ( yend− y f i r s t ) ∗ ( xend− x f i r s t ) ; i n t f i t = ( b y t e s / MEM SIZE) + 1 ;

Depending on the amount of frames needed, different methods of dividing the area are used. This is mentioned in section 4.3. xfit and yfit are the number of iterations that are to be done over the width and height of the area. MEM FRAME was set to the square root of MEM SIZE because the width multiplied by the height will then result in MEM SIZE. The reason for some values being increased by one is that rounding is automatically done downwards. In the rare cases where the numbers are not rounded, this code might slow the program slightly, but it becomes negligible because of the boundaries in the iteration algorithm.

i f ( f i t == 1 ) {

h e i g h t = yend− y f i r s t ; w id th = xend− x f i r s t ; x f i t = 1 ;

y f i t = 1 ; }

e l s e i f ( f i t == 2 ) { h e i g h t = yend− y f i r s t ;

w id th = ( xend− x f i r s t ) /2 + 1 ; x f i t = 2 ;

(34)

} e l s e {

h e i g h t = MEM FRAME;

w id th = MEM FRAME;

x f i t = ( ( xend− x f i r s t ) / w i dt h ) + 1 ; y f i t = ( ( yend− y f i r s t ) / h e i g h t ) + 1 ; }

As stated in section 4.3, the frame needs to include not only the pixels to be changed, but also a few pixels around for the filter. This is based on the filter size.

i n t f r a m e H e i g h t = h e i g h t + FILTER SIZE − 1 ; i n t frameWidth = w i dt h + FILTER SIZE − 1 ;

Allocate memory on the Epiphany core for the frame.

u c h a r l o c s r c D a t a [ f r a m e H e i g h t ∗ frameWidth ∗ 3 ] ;

These nested loops will iterate through all the frames needed. The offsets are how far into the image the frame begins. Array indexes are calculated from x and y values. If statements will then set end values so that the iteration doesn’t go outside of the designated area for the core.

f o r ( yFrame =0; yFrame< y f i t ; yFrame++){

f o r ( xFrame =0; xFrame< x f i t ; xFrame++){

i n t y O f f s e t = yFrame ∗ h e i g h t + y f i r s t ; i n t x O f f s e t = xFrame ∗ wi d th + x f i r s t ; i n t i n d e x , f r a m e I n d e x ;

i n t ymin , xmin ;

i f ( y O f f s e t+h e i g h t > yend ) { ymin = yend ; } e l s e { ymin = y O f f s e t+h e i g h t ; }

i f ( x O f f s e t+w id th > xend ) { xmin = xend ; } e l s e { xmin = x O f f s e t+w id th ; }

For each frame, the program will copy data to the allocated memory. This is done pixel by pixel, colour channel by colour channel. While this may seem ineffective, the number of functions that can be used in OpenCL kernels is limited and currently this is the only way.

f o r ( y=y O f f s e t −(FILTER SIZE−1) / 2 ; y<ymin + ( FILTER SIZE−1) / 2 ; y ++){

f o r ( x=x O f f s e t −(FILTER SIZE−1) / 2 ; x<xmin + ( FILTER SIZE−1) / 2 ; x++){

i f ( x >= 0 && x < s r c C o l s && y >= 0 && y < srcRows ) { f o r ( k = 0 ; k < 3 ; k++){

i n d e x = 3 ∗ ( x+y ∗ s r c C o l s ) + k ;

(35)

Iterate through surrounding pixels and their color channels, with size depending on the filter size. For each pixel, find its index in the matrix, and if the index is not outside the image, add it to the array of pixels to be used for getting a median later on.

f o r ( y=y O f f s e t ; y<ymin ; y++){

f o r ( x=x O f f s e t ; x<xmin ; x++){

f o r ( k = 0 ; k < 3 ; k++){

i n d e x = 3 ∗ ( x+y ∗ s r c C o l s ) + k ; i n t x F i l t e r , y F i l t e r ;

u c h a r p i x e l s [ FILTER SIZE ∗ FILTER SIZE ] ; i n t s i z e = 0 ;

f o r ( x F i l t e r = 0 ; x F i l t e r < FILTER SIZE ; x F i l t e r ++){

f o r ( y F i l t e r = 0 ; y F i l t e r < FILTER SIZE ; y F i l t e r ++){

f r a m e I n d e x = 3 ∗ ( x F i l t e r + x − x O f f s e t + ( y F i l t e r + y − y O f f s e t ) ∗ frameWidth ) + k ;

i f ( x F i l t e r + x − ( FILTER SIZE−1) /2 >= 0 && x F i l t e r + x − ( FILTER SIZE−1) /2 < s r c C o l s && y F i l t e r + y − ( FILTER SIZE−1) /2 >= 0 && y F i l t e r + y − (

FILTER SIZE−1) /2 < srcRows ) {

p i x e l s [ s i z e ] = l o c s r c D a t a [ f r a m e I n d e x ] ; s i z e ++;

} } }

Simply sorts the array so that the median can be extracted. Since the amount of data may vary, QuickSort is the most reliable method of sorting.

q u i c k s o r t ( p i x e l s , s i z e ) ;

Set value of a pixels color channel in the destination image to the median of all extracted values.

i f ( s i z e %2 != 0 ) {

d s t D a t a [ i n d e x ] = p i x e l s [ ( s i z e − 1 ) / 2 ] ; }

e l s e {

d s t D a t a [ i n d e x ] = ( p i x e l s [ s i z e / 2 − 1 ] + p i x e l s [ s i z e / 2 ] ) / 2 ;

} } } } } }

(36)

6 Analysis of results

This is the results of our work and the analysis of it. We will go through our tests on the Epiphany and analyse those results.

6.1 Test methods

Besides coding an application to utilize Epiphany as efficiently as possible, mea- suring execution time and seeing how it compares to other platforms was an important part of the project. We have performed a variety of tests on dif- ferent platforms, choosing to test what we believe is most relevant in order to observe how well Epiphany can perform. Each individual test was performed 10 times. We then used the mean of these 10 tests in order to get more reliable data. For timing the tests we took a time-stamp just before applying our filter, and another one immediately afterwards. When timing Epiphany, we started the timer on the ARM before any memory allocation was made, and ended the timer when Epiphany had finished all calculations and all memory had been freed. The results of these tests were then visualized in graphs.

Seeing how Epiphany can speed up an application compared to the ARM CPU on the Parallella is one of the main comparisons we wanted to perform, as it would illustrate how much of a difference Epiphany could make if it just were to be utilized. We therefore tested the filter running on only the ARM without Epiphany, meaning the ARM would perform all the calculations, as opposed to letting Epiphany perform them. Speed-up can then be measured by dividing the ARM-only execution time with the Epiphany execution time.

We have also chosen to include tests performed on a Windows machine running an Intel i5 dual core (4 hardware threads) CPU. What is important to note however, is that the tests performed on the ARM on Parallella and the i5 were done with a sequential version of the program. We realize that the results might seem skewed in Epiphany’s favor that way, but this is the way an application (such as our filter) would have been run in OpenCV normally, and writing specific code for more platforms than Parallella with Epiphany was not in our scope from the beginning.

Another important aspect is to also measure execution time on fewer than all

of Epiphany’s 16 cores, in order to observe how much speedup can be gained

from utilizing more cores. We therefore performed tests on 1, 2, 4, 8 and 16

cores, which would hopefully yield a speedup near proportional to the amount

(37)

6.2 Results

Figure 11: Execution time vs Filter Size. All cores are used on Epiphany in parallel. Filter size 3 means a filter of size 3x3 was used.

The first test performed was on how execution time varies with different filter sizes. Larger filter sizes means more calculations has to be made for each pixel.

In figure 11 we can see that while the processor or algorithm hardly makes any difference for the smaller filter, there is a huge increase in time for the ARM and on Epiphany. We can clearly see the speedup for larger filters. The filter is two-dimensional, so the numbers will be squared. Therefore 9x9 should be nine times the work. The run time only multiplies by five on the Epiphany, which probably is caused by setup and data transfer taking up a larger proportion of the time.

It is to be expected that Epiphany does not run 16 times as fast the ARM.

The ARM cores have cache memory and other architecture, so to hope for the

Epiphany cores to individually run as fast a one of the ARM cores is naive, and

to think that concurrency could actually make run 16 times as fast is even more

of a stretch. Perfect concurrency is simple in theory, but in practice the case is

(38)

On smaller filters, the i5 processor has just about the same speed as Epiphany, for larger ones the time doubles. Since the processor has the fastest individual cores in this test, it’s not strange that it can perform close to the same level as Epiphany, even when running a sequential version of the same program. The reason for later differentiation should even in the case be the constant portion of work on each Epiphany core.

Figure 12: The speedup Epiphany offers. Speedup is calculated by dividing execution time for the ARM with Epiphany. Data used is the same as in figure 10.

To better view the exact difference Epiphany makes, we calculated the speedup,

how much faster Epiphany is compared to the ARM alone. Here in figure 12

it becomes even clearer that the OpenCL parallelizing is more efficient when

working with larger filters. Going from a 3.5 speedup on the 3x3 filter to 8.5

speedup on the 9x9 filter. This is clearly a diminishing curve, most likely going

towards a limit value. This limit is theoretical, since we cannot test very large

filters with the algorithm currently. Important to note is that all test with

varying filter size were done with the same image, which is 512x512 pixels. This

(39)

Figure 13: Execution time vs Image Size. Image size 256 means an image of 256x256 pixels was used. A filter of size 5x5 was used for all image sizes.

In figure 13 we can see how execution time depends on image size. Interestingly

enough, it look almost exactly the same as the graph over filter size. The run

time and difference between the different platforms seems to depend completely

on how much work there is to be done. All pictures are tested with a 5x5 filter,

so the image size is the only thing changing between tests. The 1024x1024 has

16 times the pixels of the 256x256 image, meaning 16 times the work.

(40)

Figure 14: The speedup Epiphany offers. Data used is the same as in figure 13.

Here in figure 14 we can see that the speedup actually turns out to be slightly lower for the larger image compared to the largest filter, while it’s basically the same for the smallest. This despite the fact that the work is 16 times larger instead of 9. As the image size is bigger, more memory has to be retrieved and stored locally, which could be a contributing factor to the speed difference.

Speedup could therefore depend both on the size of the workload and the amount

of references made to external memory.

(41)

Figure 15: The speedup offered by using more cores. Speedup is calculated by dividing execution time for one Epiphany core with 2 or more. Data used is the same as in figure 13.

Another interesting test is how more cores used affects the overall performance.

Figure 15 illustrates the speedup gained when more cores are sharing the work- load for different image sizes. Larger images means more pixels, resulting in more memory needing to be allocated, as well as more iterations of the actual filter. Here a filter of size 5 was used for every image size.

While results are more even for smaller image sizes, the difference is remark-

able when the size of the image increases, as the application running on 16

cores for the largest image is nearly 16 times as fast. Larger images bring more

calculations, meaning more parts which can be parallelized. Miscellaneous op-

erations, such as initialization and inherently serial parts, shrink in significance

in comparison. In other words, more parallelizable calculations brings higher

speedup.

(42)

7 Conclusion

After nine weeks of research and work, we have managed to make use of the parallelism of Epiphany, create an image filter with the help of OpenCV and run extensive tests of its capabilities. Our objectives were changed during the course of the project and we did not end up doing what we planned to do from the beginning, but we still have solid results and a much better understanding of how the board works.

7.1 Project Summary

The project circled around setting up and testing the Parallella board, a proto- type platform with extremely high calculating potential for its size and power usage. Its strenth lies in the Epiphany accelerator, which for our model had 16 processor cores.

Beginning with the goal of parallelizing the PaPP programming environment on Parellella. We were soon informed that we would not have the opportunity to do this, and instead started working with OpenCV modules. The OpenCV modules of interest were image processing algorithm, which are the key parts of the library. This because the task of manipulating thousands of pixels is easily split into 16 independent tasks.

Since the OpenCL is a relatively simple tool for dividing tasks on GPU:s and accelerators, it was a fitting library for parallelizing.

We came to the conclusion that because of heavy usage of calls between differ- ent OpenCV modules and lack of comments, we could not learn and parallize OpenCV in the designated time. We therefore decided to implement the same type of function one of the OpenCV modules used ourselves, making it run con- currently on Epiphany. The algorithm of our choice was median blur, as it is relatively easy do understand, but has the potential of greatly accelerating on a multitude of cores.

At this, we succeeded. We made the image smoothing significantly speed up when running on Epiphany. We also implemented usage of the local Epiphany core memory to further accelerate the algorithm, which gave expected results.

To conclude our work, we performed multiple tests, comparing the concurrent

Epiphany algorithm with sequential versions running on the ARM processor on

Parallella and a regular i5 processor. The results showed that the code scaled

very well on Epiphany.

(43)

7.2 Result Conclusions

After spending a lot of time with the Parallella board, doing research and writing code, we can conclude a few things. First of all, OpenCL together with stdcl provides an easy to use environment, and does not actually differ as much as we initially thought from regular C/C++ programs. The main difference is that the program will be split up in 2 parts; one for the host code, written in C/C++, and one for the device, meaning Epiphany in our case, written in cl code. Besides the need for some OpenCL/stdcl specific function calls, such as making sure memory is synchronized between host and device, and specifying how the kernel is to be executed, the cl code performing all the calculations is basically just C/C++ code modified to fit Epiphany, with a few calls to identify which core is which and such.

Even though OpenCL code doesn’t differ much from C/C++ code, it still had to be coded to match Epiphany. OpenCL code could be made more portable, meaning it would work on more than one platform, but in our case we had write it more specifically for Epiphany. The main issue is the lack of a cache as explained earlier, with each core only having a measly 32 kb at their own disposal. In order to achieve good performance, each core has to locally store the data that they need instead of reading from a sharable memory between the host. Otherwise, reading data takes a lot of time. This can be seen in some of our graphs, where the new memory allocation algorithm made a huge difference.

After performing all tests, we can conclude that Parallella with Epiphany cer- tainly boasts some impressive performance, if the code is written properly. You could not expect Epiphany to run 16 times as fast as the ARM, but we have still gotten results with nine times speedup. This does not only show that Epiphany holds its promises, but also that OpenCL is a legitimate tool for parallelizing on it. Since time measuring is only done for the calculations, and these are basically fully parallelizable, a 16 times speedup could theoretically be expected, and our 14 times speedup on 16 Epiphany cores compared to one core shows that the potential is there, but architecture and overhead simply makes the perfection that can be theorised impossible to achieve in practice.

Image size increase has a fairly predictable effect on the results and the tests

indicate a better scaling for larger input. This is perfectly logical, considering

overhead for setting up threads and communication, and more than anything

proves that the program behaves as any concurrent program should. The sim-

plicity and generality of OpenCL compared to other GPU or accelerator lan-

guages makes it a suitable tool for future projects of this type, considering the

desirable results we have received.

(44)

It should be mentioned that our results have one additional meaning. It might be questionable if image processing is a reasonable use for Parallella. While there is other hardware that can perform better, we have concluded that image processing can, with the right algorithm, be done effectively on the board. Even if we didn’t succeed in porting OpenCV modules, we have at least made way for it, by testing the same kind of algorithm and making it accelerate on Epiphany.

As a conclusion, it can be said that our results are positive. They should be easily replicated for anyone willing to commit to it and can hopefully give some insight in the benefits of using Epiphany and OpenCL.

7.3 Goal Fulfilment

As explained earlier, the goal changed over the course of the project. We were initially going to work on getting a programming environment working on Par- allella with Epiphany, but as we lacked the tools to do so we were instead to parallelize OpenCV modules to take advantage of Epiphany. From doing this we have been able to observe how to write code for Epiphany to run efficiently, especially with OpenCL.

Even though we concluded that a full port of OpenCV was way out of what was manageable for us during our limited time, we have been able to show how modules can be written specifically for Parallella and Epiphany, making ports of other code easier in the future. We have also by performing different tests been able to observe how Parallella and Epiphany measures up to other platforms, which could be interest when deciding on if using Parallella as a platform is of high value or not.

Our goal to efficiently parallelize on Epiphany has been fulfilled, even if not in the way we initially aimed to fulfil it. Since we were the pioneers of the usage of Parallella within SICS, our results should prove useful as benchmark values for what the platform is capable of and the code should serve as an example of how OpenCL can be used to utilise the potential of Epiphany.

Under slightly different circumstances, we could have done more than we did, as far as goals go, we strove to fulfil them and succeeded by the standards we set up.

7.4 Final Conclusion

(45)

We greatly appreciated the opportunity to work with something that hadn’t been done before, but communication and information were lacking at times.

Our failure to make a quicker decision and problems at SICS caused the project to turn out less useful than it could have been. There was no one to tell us that our plan wasn’t realistic, and we didn’t have the experience to immediately see this for ourselves.

Overall the project has been an enlightening experience, both as a programming

job and as an exercise in adjusting to problems. We thank Mats Brorsson for

providing the project and giving us the tools necessary and Artur Podobas for

being our supervisor and helping us whenever we asked.

(46)

8 Appendix

Listing 1: Host Code

/∗ m e d i a n h o s t m e m f i n a l . cpp ∗/

#include <i o s t r e a m >

#include ” o p e n c v 2 / i m g p r o c / i m g p r o c . hpp ”

#include ” o p e n c v 2 / h i g h g u i / h i g h g u i . hpp ”

#include < s t d i o . h>

#include < s t d l i b . h>

#include < s t d c l . h>

#include <s y s / t i m e . h>

using namespace s t d ; using namespace cv ;

/∗ image window name ∗/

char window name [ ] = ” Median B l u r ” ; /∗ t i m e t o d i s p l a y image ∗/

i n t DELAY CAPTION = 2 5 0 ;

/∗ d i s p l a y image f o r d e l a y amount o f m i l l i s e c o n d s ∗/

i n t d i s p l a y d s t ( i n t d e l a y ) ;

/∗ o p e n c v image m a t r i x e s , one f o r o r i g i n a l image and one f o r t h e m o d i f i e d image ∗/

Mat m a t s r c ; Mat m a t d s t ;

i n t main ( ) {

/∗ v a r i a b l e s u s e d f o r t i m e m e a s u r i n g ∗/

s t r u c t t i m e v a l t v ; double s t a r t , end ;

/∗ i n i t i a l i z e window ∗/

namedWindow ( window name , WINDOW AUTOSIZE) ;

/∗ r e a d image and a l s o make a copy o f i t ∗/

m a t s r c = i m r e a d ( ” /home/ l i n a r o / T e s t / l e n a . png ” , 1 ) ; m a t d s t = m a t s r c . c l o n e ( ) ;

/∗ d i s p l a y image ∗/

i f ( d i s p l a y d s t (DELAY CAPTION) != 0 ) { return 0 ; }

/∗ g e t image d a t a and a l l o c a t e s p a c e f o r i t , do t h e same f o r t h e copy ∗/

c l u c h a r ∗ s r c D a t a = ( c l u c h a r ∗ ) c l m a l l o c ( s t d a c c , s i z e o f ( c l u c h a r ) ∗ m a t s r c . rows ∗ m a t s r c . c o l s ∗ 3 , 0 ) ;

memcpy ( s r c D a t a , m a t s r c . data , s i z e o f ( c l u c h a r ) ∗ m a t s r c . rows ∗ m a t s r c . c o l s ∗ 3 ) ;

(47)

i n t srcRows = m a t s r c . rows ;

/∗ s y n c d a t a w i t h d e v i c e memory ∗/

c l m s y n c ( s t d a c c , 0 , s r c D a t a , CL MEM DEVICE | CL EVENT NOWAIT) ; c l m s y n c ( s t d a c c , 0 , dstData , CL MEM DEVICE | CL EVENT NOWAIT) ;

/∗ s t a r t t i m e r ∗/

g e t t i m e o f d a y (&tv , NULL) ;

s t a r t = t v . t v s e c + ( t v . t v u s e c / 1 0 0 0 0 0 0 . 0 ) ;

/∗ a p p l y f i l t e r ∗/

c l n d r a n g e t ndr = c l n d r a n g e i n i t 2 d ( 0 , 4 , 4 , 0 , 4 , 4 ) ;

c l e x e c ( s t d a c c , 0 , & ndr , m e d i a n k e r n m e m f i n a l , s r c D a t a , dstData , s r c C o l s , srcRows ) ;

/∗ s y n c d a t a w i t h h o s t memory ∗/

c l m s y n c ( s t d a c c , 0 , dstData , CL MEM HOST | CL EVENT NOWAIT) ;

/∗ b l o c k u n t i l co−p r o c e s s o r i s done ∗/

c l w a i t ( s t d a c c , 0 , CL ALL EVENT) ;

/∗ s t o p t i m e r ∗/

g e t t i m e o f d a y (&tv , NULL) ;

end = t v . t v s e c + ( t v . t v u s e c / 1 0 0 0 0 0 0 . 0 ) ;

/∗ copy m o d i f i e d d a t a b a c k t o image m a t r i x ∗/

memcpy ( m a t d s t . data , dstData , s i z e o f ( c l u c h a r ) ∗ m a t d s t . rows ∗ m a t d s t . c o l s ∗ 3 ) ;

/∗ p r i n t t i m e ∗/

p r i n t f ( ” \ nt i me : %f \n” , ( ( end − s t a r t ) ∗ 1 0 0 0 ) ) ;

/∗ d i s p l a y f i l t e r e d image ∗/

i f ( d i s p l a y d s t (DELAY CAPTION) != 0 ) { return 0 ; }

/∗ f r e e a l l o c a t e d memory ∗/

c l f r e e ( s r c D a t a ) ; c l f r e e ( d s t D a t a ) ; }

/∗ d i s p l a y image ∗/

i n t d i s p l a y d s t ( i n t d e l a y ) { imshow ( window name , m a t d s t ) ; i n t c = waitKey ( d e l a y ) ; i f ( c >= 0 ) { return −1;}

return 0 ; }

Implementation and Evaluation of Concurrency on Parallella

Implementation and Evaluation of Concurrency on Parallella

GUSTAV ENGSTRÖM AND MARCUS FALGERT

Implementation and Evaluation of Concurrency on Parallella

Gustav Engström and Marcus Falgert

2014-08-01

Bachelor’s Thesis

Examiner

Mats Brorsson

Academic adviser

Artur Podobas

Contents

1 Introduction 4

1.1 Background . . . . 4

1.2 Problem . . . . 4

1.3 Purpose . . . . 4

1.4 Goals, Benefit, Ethics and Sustainability . . . . 5

1.5 Methods . . . . 5

1.6 Delimitations . . . . 5

1.7 Outline . . . . 6

2 Theoretical Background 8 2.1 Parallella Specifications . . . . 8

2.2 Epiphany Specifications . . . . 8

2.3 Hardware Background . . . . 9

2.4 Software Background . . . . 12

3 Work Methods 15 3.1 Tools . . . . 15

3.2 Plan . . . . 16

3.3 Methodology . . . . 17

3.4 Progress . . . . 17

3.5 Method Evaluation . . . . 18

4 Implementation 19 4.1 Comparison . . . . 19

4.2 Concurrency . . . . 20

4.3 Local Memory Allocation . . . . 24

5 Code 28 5.1 Host . . . . 28

5.2 Kernel . . . . 29

6 Analysis of results 33 6.1 Test methods . . . . 33

6.2 Results . . . . 34

7 Conclusion 39 7.1 Project Summary . . . . 39

7.2 Result Conclusions . . . . 40

7.3 Goal Fulfilment . . . . 41

1 Introduction

in Kista. The work is supervised by Mats Brorsson, professor at KTH and head of the PaPP-project.

1.1 Background

1.2 Problem

1.3 Purpose

The purpose is for the board to be available for use in future programming

projects and that parallel program running on it use the extreme calculation

speed of the device to its full extent. There are already many programs that

run on similar modules and even programs that with a slight change will run on

the Epiphany, but the purpose here is to take advantage of the huge calculating

1.4 Goals, Benefit, Ethics and Sustainability

The goals can be summarized to the following.

• Set up and test Parallella

• Create software solutions for Epiphany

• Evaluate the efficiency of Parallella

1.5 Methods

Our method will be software applications created in OpenCL [7] with C/C++.

1.6 Delimitations

Our focus is parallelization with OpenCL; other programming languages are

secondary goals and preliminarily excluded. The aim is to evaluate usability

and performance of the Parallella board, for which other parallellizing methods

aren’t necessary, even though they might add more to the analysis. OpenCL has

a structure for dividing tasks that fits Epiphany very well, and in conjunction

with the stdcl[12] library, makes implementation relatively easy. For a broader

testing of the board, other methods might be needed. For example to test how

1.7 Outline

2. Theoretical Background

Information about the Parallella board and the libraries we have used.

Focus is on concretely and succinctly describing the hardware and statis- tics on all the software. Performance, components and measurements are to be included. Usage of components and detailed description of their purposes are not.

3. Work Methods

4. Implementation

5. Analysis of Results

The results of our work and a subsequent analysis of it. This is not a final analysis of the project, but more of a detailed walk-through of what all ideas and tests resulted in. Execution results and performance comparison are brought up here, as well as discussion of what these mean in reality.

6. Code

There may be parts of the written code that work as perfect examples of general software solutions. We aim to bring up the purposes of the different parts of our code and explain why it was implemented in that certain way.

Naturally, we will not include all our code, but we will focus on what bears

significant meaning to the project as a whole and parallelization on this

7. Conclusion

This point is the final and conclusive analysis of the project. We will

compile all work, all important results and essential data to an evaluation

of what has actually been accomplished. The grade of accomplishment