Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

(1)

IT 17 027

Examensarbete 15 hp Juni 2017

Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

David Escher

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

David Escher

Parallel programming has become ubiquitous and more than a preference, it is a necessity for utilizing hardware resources (multiple cores) in order to speed-up computation. However, developing parallel programs is difficult and error-prone. In order to make development of parallel programs easier and less error-prone, new programming languages are being developed. This thesis compares Encore, a new programming language, against C++ with OpenMP for parallelizing program execution. The focus of this comparison is the respective run-times and parallel scalability, when running a

pedestrian simulation program.

Ämnesgranskare: Alexandra Jimborean

Handledare: Kiko Fernandez Reyes

(4)

(5)

1 Introduction 7

2 Background 7

2.1 Pedestrian simulation . . . . 7

2.2 OpenMP using C++ . . . . 7

2.3 Encore . . . . 7

3 Problem 8 3.1 Problem specification . . . . 9

3.2 Reference implementation . . . . 9

4 Design 10 4.1 Baseline . . . . 10

4.2 Optimizing the baseline . . . . 11

4.3 Shared Lock Matrix . . . . 11

4.4 Passive Agents and Active Regions . . . . 12

4.4.1 Manager implementation . . . . 12

4.4.2 Region implementation . . . . 13

5 Experiments and Results 15 5.1 Parameter test . . . . 15

5.2 Core usage test . . . . 17

5.3 Performance analysis . . . . 18

5.4 Hardware utilization test . . . . 19

6 Related work 20 6.1 Survey of Active Objects . . . . 21

6.1.1 Erlang and Akka . . . . 21

6.1.2 Pony . . . . 23

7 Design Limitations 23 7.1 Pedestrian simulation limitations . . . . 23

7.2 Encore . . . . 23

7.3 Actor reference memory usage . . . . 24

8 Discussion 24 8.1 Object orientation and threads . . . . 24

8.2 Lightweight threads . . . . 25

9 Conclusions 25

10 Future work 26

(6)

11 Appendix 29

11.1 Encore Version . . . . 29

11.2 XML instance data . . . . 29

11.3 Benchmark Machine . . . . 30

11.4 Tables . . . . 30

(7)

1 Introduction

In this thesis we will attempt to give some indication of whether or not Encore’s [4] parallel performance make it a viable alternative among other programming languages. We will measure parallel scalability and hardware utilization. As a reference for comparison, a existing C++ [24] implemen- tation using OpenMP will be used. C++ is a language used in many high performance settings and OpenMP [10] is a technology developed to make it easier to parallelize code.

2 Background

The Encore language is currently under development, yet earlier tests on several benchmarks [26] show that Encore is on-par with Erlang and Scala in terms of speed, but have higher memory usage in some cases.

2.1 Pedestrian simulation

Pedestrian simulations are typically used for simulating people moving in spaces, and can be used for testing architectures of locations for cases like emergencies [5]. To compare Encore and C++ a rudimentary pedestrian simulation is used as a benchmark to compare different aspects of the lan- guages performance, such as run-time, strong scalability, and hardware uti- lization. The requirements of this simulation are described in further detail in a later section of this paper (Section 3.2).

2.2 OpenMP using C++

OpenMP [10] is a framework for parallel programming, which mainly pro- vides fork-and-join style parallel abstractions. In this thesis we are using the C++ [24] binding for the framework. C++ is a language that is widely used in the industry, especially in cases where performance is an important priority.

2.3 Encore

Encore is a programming language under development at Uppsala university.

Its paradigm is called active-object, which is a combination of the actor model from concurrent programming and the object oriented programming paradigm. Encore has so called active-objects which is the main way to express concurrent computations in the language. It’s a type of object with its own single thread of control for manipulating its data.

The language is using lightweight [7] threads similar to those found in

for example the Erlang run-time, and bind each of them to run the methods

(8)

of one object. One of the big advantages of this is that an active-objects data will only ever be handled by that thread, any other object which wants to read or write that data needs to send a message to the actor holding the object and await a reply in order to use the objects data. This in combination with the Kappa [6] type system, prevents low-level data races, and requires programmers to make explicit methods for sharing data.

As an abstraction for asynchronous method invocation, Encore uses fu- tures, which are wrappers around possibly undelivered data. All active ob- ject methods, when called by another active-object does return a value of type future T, where T is the type of the value returned by the method. The main advantage of this is that it makes all method calls on active objects asynchronous by default, while synchronous behavior can be easily achieved.

As an example, a clock implemented as an active object in Encore might have a method called time() which returns the current time. However, if some other object calls clock ! time(), it’s translated to a remote method invocation, making the active-object clock run time() locally, and sending the result back to the local thread. This will result in the thread calling clock ! time() immediately getting a returned value of type future time, representing the fact that, at some later point, a value of type time will be delivered to them.

Encore supports some methods that operate on futures. The ones used in this work are get :: F uture T → T which is a method blocking until the value T is delivered, and then returns T and await :: F uture T → ∅ which is a method that suspends the current active-object method, allowing the object to run other methods. When the F uture is fulfilled, the execution resumes from where await :: F uture T → ∅ was called.

Encore also supports what it calls passive objects, which are very similar to objects [23] in many other object oriented languages, such as C++ and Java [2]. As Encore is in development, future versions of the language might provide additional useful constructs and might experience different behav- ior. However, those future versions are not considered much in this thesis.

Mainly as the true implications of future language extensions are difficult to predict.

3 Problem

In order to compare Encore to C++, this thesis will use a reference imple-

mentation of pedestrian simulation implemented in C++ using OpenMP,

and make implementations in Encore mimicking the reference as closely as

possible. Then some comparison will be done between the implementations

runtime, hardware utilization and scalability in terms of parallelism.

(9)

3.1 Problem specification

This specific pedestrian simulation have been used as an assignment program to parallelize in a course at Uppsala University, and the reference implemen- tation is from that course. It’s an adaptation of Christian Gloor’s ”pedsim”

project ¹ . From the course instructions: ”This project includes software from the PEDSIM3 simulator, a microscopic pedestrian crowd simulation system, that was originally adapted for the Low-Level Parallel Programming course 2015 at Uppsala University.”

This pedestrian simulations coordinate system is 2D dimensional and uses discrete coordinates, in a similar way to a chessboard. Agents each occupy one set of these coordinates, and only one of them can occupy a given set of coordinates at the same time, this also very similar to how chess pieces are placed on a board, there can only be one per square, and a piece is inside exactly one square. Agents can move 8 directionally, unit length, which is the same movement pattern as a king in chess, it can move only one square, but to any of the 8 squares surrounding it. Agents move in a time-sliced fashion, all agents move once (or attempts to move and decides not to), before all of them attempt to move again. They move towards instance data defined way-points and can be moved in an arbitrary order.

The agents attempt to move towards their current way-point by checking first the most direct route, and then their two neighboring locations. As an example, if an agent’s way-point is directly to their right, they will first attempt to move to the right, then if that location is taken, move to either their bottom right or top right location if either of those are available. If none of these locations are available, they will not move. The case when a agent is discovering that they can not move to a location will be refereed to as a collision in this paper.

An agent is considered to have reached their way-point when they are within their radius, and will then start to move towards their next way- point. If their way-points run out, they will start over by moving towards their first way-point again.

3.2 Reference implementation

The reference implementation uses XML-files (Section11.2) describing start- ing positions and way-points for agents as input data, and simulates agents for 10000 time steps, were one time step is each agent attempting to move one step.

The implementation uses ”fork-and-join” [22] parallelism, moving all agents in a parallel code region and then joining before moving all agents again. A join is preformed between each time step to ensure that all move- ment of one time step is completed before the next time step starts. In

1

pedsim project website, http://pedsim.silmaril.org/ accessed 2017

(10)

the parallel code region, the simulation space is divided into regions that are moved in parallel by different threads. The regions split into smaller or merge into larger regions to try to balance out their size, this is done before each tick starts, in a serial code region. For agents moving between these region borders an atomics are used.

The main challenge of making an efficient implementation to solve this problem, is to handle agent collisions effectively. According to Orion Sky Lawlor [17], collision detection for this type of problem is possible in O(N ) time, even if the worst case can be O(N ² ), were N is the number of agents per time step. This is an important fact to know when considering how to design the simulation, as any designs with a worse theoretical time complexity are very unlikely to have good performance. While this problem is different from the one in Orion Sky Lawlors paper, many of the same principles apply. One of the main differences is that Orion Sky Lawlor is trying to detect collisions, when we are trying to avoid them.

4 Design

Because of Encore’s design, the problem has to be re-modeled from the

”fork-and-join” reference implementation. This is mainly as Encore does not support operations like ”fork” and ”join” effectively, and although they can be emulated, they would most likely not be efficient. In the following sections, each presents a different way to design a pedestrian simulation in Encore.

When considering these designs, we focus on some key properties. One of these is a low number of messages sent per time step, as message send and receive are synchronization constructs, which should be avoided if possible.

We also want a design that can do a lot of work in parallel, which means that we want a lot of actors with messages to process. There is an important distinction here, we want many actors to have messages to process, but we don’t want to send many messages, so messages that do more work will likely make the implementation more efficient. Lastly, we want a design with low time complexity, although that can be difficult to discern the time complexity of some of these algorithms in a meaningful way.

4.1 Baseline

As a baseline implementation in Encore, for comparison to other Encore implementations, all agents are modeled as active objects, with no shared data-structure for detecting collisions. The agents are required to send mes- sages to all other agents, checking if the positions they want to enter are clear.

Considering the baseline implementation requires all agents to broadcast

to all other agents on each step to detect collisions, the amount of messages

(11)

sent and time complexity of the implementation is O(T N ² ) were N is the number of agents and T is the number of time steps. This is because an agent before it moves needs to send one message to each of the other agent, learning about it’s position. The fact that the time complexity of each move operation is therefore O(N ) makes the time complexity of all move operations in a time step O(N ² ).

4.2 Optimizing the baseline

In order to reduce the amount of message passing required agents can keep a local list of the minimum distance between themselves and other agents, and only send messages to update that list and detect collisions when the mini- mum distance is small. This list contains the Manhattan distance between the agent and another agent, and is reduced by two in each time step, as it’s assuming the agents are moving towards each-other. When the distance reaches 1, a message is sent to the other agent to find it’s actual position again. As we are assuming the agents always move towards each-other, we have a guarantee that the agent pair can not collide while their Manhattan distance is greater than 1. When the distance is equal to or smaller than 1, the agent sends a message to recalculate the distance and be able to avoid potential collisions.

This increases the memory complexity of the agents (as they all need a local distance list) while it can decrease the required amount of message passing significantly. Depending on the instance data, this optimization can cause very few messages to be sent but in worst case it’s still O(T N ² ).

The problem with this design is that it’s memory complexity is O(N ² ), and that data contains a lot of active object references. This have a spe- cial interaction with the Encore run-time, as discussed later in this paper (Section 7.3).

4.3 Shared Lock Matrix

Instead of agents sending messages directly to other agents, a shared data- structure maintaining information about the agents positions can be used. In this case, a matrix of ”positions”, each representing one particular position.

This implementation does potentially have very high memory complexity, if the space agents can occupy is big. The main advantage of this implemen- tation is that each agent is guaranteed to be able to make it’s move using a constant number of messages, which gives a O(N ) worst case time and message complexity. Its performance is also dependent on how effectively this large matrix can be shared between agents.

This is implemented in Encore as a Matrix of active objects holding a

boolean representing whether or not a position is taken. All agents, that

are still active objects send messages to the ”position” agents, requesting

(12)

to move into their positions. This guarantees that no two agents are in the same position, as the ”position” active objects only allows agents to move to their position if it’s not taken by another agent. It also guarantees that an agents move can be done in constant time, by sending messages to all their desired positions, giving us the O(N ) worst case time and message passing complexity.

This experiences the same kind of memory usage behavior as the previous implementation. Because of this, any implementation similar to this one were also considered unfeasible to implement effectively. See section 7.3 for more details.

4.4 Passive Agents and Active Regions

A different approach to this problem is to divide the space the agents traverse into regions. Each region is an active object that knows about it’s neighbors and responsible for moving agents inside it. When modeling the problem this way messages are only needed for synchronizing the regions between each move and for handling agents crossing the border between regions.

It is also what is used by Orion Sky Lawlor [17] as an efficient parallel collision detection algorithm. This is the only design that have been fully implemented and used in the benchmarks, due to it being significantly faster than all other designed prototypes.

4.4.1 Manager implementation

As the regions are only allowed to move all the agents once, before moving them again, some entity is required to synchronize the regions. We call this a manager and it’s also responsible for creating all the regions, assigning them parts of the coordinate space and introducing them to their neighbors.

In order to figure out how big the 2D space is, the manager finds the extreme values of the coordinates of all agents starting positions and way- points.

When subdividing the space, it’s important to remember what run-time we are working with. As Encore is operating using light-weight threads, our main concern is to create enough regions. According to Orion Sky Lawlor [17], the best results for similar types of run-times using similar abstractions are typically somewhere between 10 and 10000 regions per core of the target machine. This is of course highly dependent on the specific run-time and problem but in general the number of active objects should be much higher than the number of cores used on the target system.

It’s also worth noting that the cost of having uneven workload between

the regions is quite cheap. This is because the run-time will try to find a

even load balancing between the cores using active objects as work units so

having one region take more time than other is only a problem if the time

(13)

Figure 1: An example subdivision with parameter value 4, the 4 by 4 grid of rectangles all have the same size (red), except for the topmost (yellow) and rightmost row (blue) that have differing heights or widths. Only one rectangle (green) can have both a differing height and width

used by that active object is a big in relation to the total load of one core.

Therefore, we split the 2D space into many uniform rectangles, organized in a grid pattern. In order to make the code adaptable to different archi- tectures, it takes a parameter which is proportional to how many regions are created. In this case N , were N ² rectangles are created. The regions width is the total width of the simulation divided with N and their height the total height divided by N . This division is not always perfect as we are working with discrete coordinates, so in order to handle uneven subdivisions the upper and rightmost row of regions will be slightly larger or smaller than the rest, as illustrated in Figure1.

Each of these rectangles are assigned a region, and filled with the agents residing within it’s coordinates. When the simulation starts, the manager sends a message to all regions telling them to move all their agents one step and then awaits all of them doing that, at which point it send another set of messages. It’s also responsible for visualization and timing if that is used.

4.4.2 Region implementation

When considering each region, as movement is eight directional, each region needs to know about up to eight adjacent regions to handle agents, less if they are on the edge of the simulation. In addition, it needs some way of efficiently handling local agent movement.

For internal movement, each region needs some data structure to store

(14)

all the agents current positions. This data structure needs to have quick look-up and update times. The chosen data structure is a quadtree. The main advantage of using a quad-tree is that it’s very effective for storing sparse and empty regions [21].

The Regions also need a second data structure for holding the agents. As agents during any time-step might move from one region to another, quick remove and insert times are needed for this data structure.

Because agents can be moved in an arbitrary order, but only once in each time step, a quick way to iterate this data structure is also required.

The choice made for this structure is a linked list.

Each region have four linked list pointers, N ew pointing to the first newly arrived agent, Last pointing to the lastly arrived agent this time step, Old pointing to the first of the current agents in the region and Current pointing to the first agent left to move in a region.

When one of the regions neighbors sends a request to move an agent into this region, collision detection is done by the receiving region looking up the desired position of that agent in the quad tree. If that spot in the quad tree is empty, it’s filled and the agent is added to the N ew linked list.

This means that the agent will not be moved by this region (as it already have been moved from the neighbor), but it’s collisions will still be detected (as it’s inside the quadtree used for this).

The region will move its agents one at the time, using Current as an iterator. In case one of the agents desires to move outside of the current region, this region will figure out which of the neighbors contains the desired position of that region, and send a request for that agent to move there. The region will then await the reply, which is a language construct meaning that it will only handle incoming requests from other regions and not attempt to make any local progress until it’s request is fulfilled. Once it’s time to move another step, Old is added to the end of Last and N ew becomes the new Old pointer.

This design is made in part to keep the complexity of the regions low, as extra overhead from having multiple outlying requests at once will require more logic and overhead in the regions. It also guarantees that the maximum amount of request to be fulfilled in the system is lower than the number of regions, which in turn eliminates the possibility of large number of messages under some circumstances consuming large quantities of memory. One of the main concerns if regions could have more than one unfulfilled request is that it might be difficult to find an upper bound for how many unfulfilled requests can exist in the system. This in turn can lead to unpredictable memory usage, if for example some unlikely scheduling causes a lot of requests to be created but not fulfilled.

Most agents will not move over the edge of any region, and therefore

their movement can be done completely locally, by simply updating the

local quad tree.

(15)

Figure 2: All the data structures contained in a region

5 Experiments and Results

Experiments were conducted on a 64 core ”bulldozer” machine. More details about the machine can be found in Appendix (Section 11.3). In the tests, min is the smallest time measured, max is the greatest, and geomean is the geometric mean, which is the N th rot of the product of N samples, so the geomean of N samples, x 1 , x 2 · · · x _n is

^N

√

x 1 x 2 · · · x _n−1 x n . All experiments are conducted using the Passive Agents and Active Regions implementation, as earlier testing showed that it was the only viable design in terms of both speed and memory usage. The data used for creating the graphs in this section can be found in the Appendix (Section 11.4).

The two scenarios tested are using 4000 and 16000 agents with densities of 0.12 agents per area unit and 0.44 agents per area unit respectively.

5.1 Parameter test

In this experiment we vary the number of regions in order to find the optimal number of regions for this architecture. All 64 cores were used, and the results are displayed in Figure3 and Figure4.

This test quite clearly show that the optimal configuration for the Encore

implementation is using only one parallel region, which is roughly equivalent

to a serial implementation (alto it’s not nearly as efficient due to still having

overheads from the manager). This is unexpected as the implementation is

designed to be parallel and should be able to use all 64 cores with a different

parameter value. This suggests that the parallelization overhead is greater

than the speedup achieved by parallelization for the Encore implementation.

(16)

Figure 3: Run-times for Encore varying the number of regions, reporting the lowest run time out of 10 independent runs.

Figure 4: Run-times for Encore varying the number of regions, reporting

the Geomean, minimum and maximum run time of 10 independent runs

(17)

Figure 5: Run-times for Encore and OpenMP implementations, varying the number of cores. Tests show the lowest, greatest and Geo-mean run times of 10 independent runs

5.2 Core usage test

In this test, the number of cores used were changed in order to investigate what is the optimal number to use for both the reference and Encore im- plementation. The number of regions used by the Encore implementation is what was found to be optimal in previous tests other than parameter value one, as with only one parallel region, there is no possibility of performance increase with an increased number of cores. The parameter values are there- fore 16, which translates to 16 ² or 256 regions for 4000 agents, and 9 ² or 81 regions for 16000 agents.

The results are displayed in Figure6 and Figure5. The best time achieved

for 4000 agents are 10 seconds for OpenMP using one core, and 110 for En-

core using 25 cores. The best time achieved for 16000 agents are 48 seconds

for OpenMP using one core, and 365 for Encore using 32 cores. This gives

us some more important pieces of information. First of all, it shows that

the Encore implementation does benefit from an increased number of cores,

at least for the tested parameter values. It also shows that the reference

implementation is not benefiting from using many cores, it’s actually slows

down quite significantly with an increased number of cores. This seems to

indicate that the parallelization overhead outweighs the speedup from using

many cores for the reference implementation.

(18)

Figure 6: Run-times for Encore and OpenMP implementations, time in seconds, varying the number of cores. Tests show the lowest recorded time of 10 independent runs

5.3 Performance analysis

In order to investigate what were the dominant factors in the Encore imple- mentations run-time, performance analysis was done using a tool called perf [25]. This produced a breakdown in percent of runtime used per method and function in the implementation (Figure8). This breakdown is difficult to un- derstand without the context of seeing the code and knowing the language.

Therefore Figure9 displays what the perf results really mean. The most interesting about this is the bottom entry included in the breakdown. As a Quadtree lookup is done every single time an agent attempts to move, it’s expected to be a significant portion of the total region local computations.

This is also what we see in the breakdown, it’s the most time consuming Encore method (method/function written by the programmer and not the language designer). However, it’s a total of 1.7% or runtime, less than two different Garbage collection methods, and one message passing method, that all consume one order of magnitude more runtime than it does.

The amount of messages passed by the system is non-deterministic, as

the simulation is non-deterministic and messages are only sent when agents

are crossing the border between regions, and once per time step between the

manager and the regions. This gives us a trivial upper bound for the number

of messages sent to O((N + R) ∗ T ) were N is the number of agents, T is the

number of time steps and R is the number of regions. However, in practice

(19)

Figure 7: Various hardware counters, blue is using 4000 agents, 1 core, and parameter value 9, orange is 16000 agents, 64 cores, and parameter value 9

the number of messages sent is much lower, as most agents will not move between regions and can therefore be moved without any message passing.

This means that we can assume that quad tree lookups are done more often than message sends. If we use this estimate for the scenario profiled in perf, this gives us (16000 + 81) ∗ 10000 = 16081 ∗ 10 ⁴ , as an upper bound for how many messages are passed during the run. We can then use the geomean runtime for this configuration measured in the core test, 402.7 seconds to get the number of messages sent per second, 399329.52. We then divide by the number of cores used (64) to get an upper bound for the number of messages sent per second per core, which is 6239.5 messages per core and second, at most. As we know from perf that message passing takes at least 20% of the total runtime, this is interesting. However, as there is nothing to compare it to, it is difficult to say if that its good or bad.

The perf method breakdown does support the idea that the paralleliza- tion overhead is dominating the runtime. The exact values obtained from perf does vary depending on the scenario and parameter value, but the order of the methods seem fairly stable between configurations, and we have yet to find one were quadtree lockup consumes more than 5% of the runtime.

5.4 Hardware utilization test

In this test, both implementations were run and performance data was gath-

ered from a number of hardware counters using perf.

(20)

Figure 8: Perf runtime breakdown, 16000 agents, 81 parallel regions. This image have been edited to only show the most time consuming methods.

Figure 9: This figure shows the methods from Figure8, with their names replaced by their function within the application

This test shows that Encore have more branch misses, and cache load misses than the reference implementa- tion. This is expected as the Encore implementation is using linked lists and quadtrees, which tend to be less cache friendly than the arrays mostly used in the reference implementation.

As the Encore and Openmp imple- mentations are so different, and the dif- ferences while significant are expected given the differences between the imple- mentation. We cannot draw any conclu- sions about the relative hardware uti- lization between the languages.

6 Related work

This work is not the first to attempt to benchmark Encores performance com- pared to other languages, and is de- signed to complement the earlier work

of Mikael ¨ Ostlund [26]. In their paper [26], Encore is compared to Scala and Erlang over three different benchmarks.

In the Big benchmark, agents are divided into large groups and sends

a lot of messages between random actors in the group. In that benchmark

Encore, while faster than the other languages, does not achieve speedup for

all configurations, with the 3 core configuration resulting in a slow down

compared to the one core configuration. Memory tests of the benchmark

show that Encore is consuming about one order of magnitude more memory

(21)

than Scala and Erlang for all problem instances. The second benchmark Chameneos, is designed to measure fairness for shared resources, Encore has no memory issues, and performs about the same as the other two languages in terms of run time and scalability. For the fairness measurement, it seems like Encore is better than the other two languages. The last benchmark is a parallel version of the Fannkuch benchmark, designed to test parallel performance. In this test, Scala is faster than the other two, but they are all very close in terms of speedup.

The pedestrian simulation used in this thesis is derived from PEDSIM, a pedestrian simulation package designed by Christian Gloor, which was used as an assignment project in the course Low-Level Parallel Program- ming during the spring of 2016. Christian Gloor have published a several papers [14] [13] about pedestrian simulation. These papers largely focus on techniques for realistically simulating pedestrians, and different models that are realistic enough to be useful while preforming good enough to be used.

However, these models include much more complex agent behavior, such a social forces and acceleration, which makes their techniques difficult to ap- ply directly to the pedestrian simulation used in this thesis. Also, while par- allel computation is mentioned in Christian Gloors papers, there is no data documenting its benefits and the quote ”As of now, the update is sequen- tial; for our problems, no large differences to parallel update were observed”

(page 5) [13] seems to indicate that benefits of parallelizing pedestrian sim- ulation is not necessarily beneficial even for other pedestrian simulators.

6.1 Survey of Active Objects

Encore is one of the few active object programming languages currently in existence. In this section we will explore some of the similarities and differ- ences between Encore and the existing techniques for parallel programming Erlang and Akka, as well as Pony, the language associated with the runtime Encore is based on.

6.1.1 Erlang and Akka

Erlang [1] is a programming language originally developed for the telecoms industry [1]. Erlang is a functional programming language [3] and supports lightweight threads [7]. Erlang threads communicate using asynchronous messages, and have an explicit receive message construct.

A common way to structure Erlang code is to have the lightweight threads acting as servers, holding some data and waiting for messages, reply- ing to requests. This is very similar to how active objects operate. However because Erlang supports explicit messages and lightweight threads, this is by no means the only way to structure code.

Akka[15] is a framework which have a lot of overlapping design goals

(22)

with Erlang, as it’s also focusing on building fault tolerant programs using actors. Akka is a framework for the JVM (the Java Virtual Machine) it can be used in combination with either Java [2] or Scala [20], which both compile to the JVM. Depending on which language used, the code can be either object oriented, functional or both.

In order to prevent data races in both Erlang and Akka, only immutable data should be shared, however in Akka there is no way for to enforce this currently[19]. Erlang and Akka messages are simply data, delivered asynchronously to some actor.

Akka and Erlang both use an explicit receive construct for reading mes- sages from their mailbox. One important difference is that Akkas receive message construct is an object method, which implies that there is only one per actor. This in different from Erlang’s receive which is a flow control con- struct, similar to a match statement but where the data to match is from the mailbox. This is important as Erlang threads can encounter multiple receive clauses, matching different message types, were Akka actors can only have one. Akka receive patterns are also required to be exhaustive [18][15], meaning that an Akka actor needs to be able to handle any type of message sent to them. Erlang receives have no such requirement, and can therefore have messages that never gets processed. Akka builds fault tolerant systems typically by designing a hierarchy of actors, which in Erlang is called the supervisor pattern.

One important difference between Akka actors and most other Java ob- jects is that they are not automatically garbage collected, but have to man- ually be destroyed when they have outlived their purpose [18] [15]. This is similar to Erlang, as Erlang actors are just lightweight threads, that termi- nate when they are out of code to run. Therefore Erlang have no concept of actors being garbage collected, as they are not data structures. This is also different from Encore, were active objects are garbage collected when no longer needed. One effect of this is that Encore actor references are guar- anteed to reference a live actor, while Erlang and Scala actor references are not.

Both Java and Scala do have type systems, and we will not go into detail of either of them in this thesis. Erlang however, does not have a type system, alto there are tools such as Dialyzer [16] which can detect the same type of problems a type system can.

Because both Java and Scala support objects, Akka actors are also Java objects, meaning that they have state and methods, however there are some big differences between Akka’s actor model with objects and the Encores active-object model.

One of the main differences is that Encore messages can not easily be

handled in any order other then first in first served, as Encore messages

are generated by calling methods asynchronously and therefore there is no

message receive construct for specifying such things. Encore messages also

(23)

have compile-time defined types, as an active object can only ever be sent messages matching the parameters of its methods. However, Erlang and Akka actors can be sent arbitrary messages (Any object, only immutable ones by convention and any term receptively).

6.1.2 Pony

Pony is a language from which Encore have used some components. Unlike Akka and Erlang, Pony is still a language under development. It promises [9][11] to be type, memory, exception safe as well as data-race and deadlock free. The pony language does seem to have a quite solid theoretical ground [9][8]. However, as it is still in development it has some problems to quote their website ”If your project is going to succeed or fail based on the size of community around the tools you are using, Pony is not a good choice for you. While it’s possible to write stable, high-performance applications using Pony, you will have to do a decent amount of work” [11]. Just like Encore, Pony supports embedding C libraries [11] which does make it easier to use but it is not close to the same amount of available tools and libraries existing for the JVM or Erlang.

The main difference in focus for Pony compared to Erlang and Akka is that Pony wants to catch all errors at compile time, while the Erlang/Akka approach is to fail-fast [1] and then recover from failure quickly and easily.

This is what makes Pony a new bold idea, and only time will tell if catching all problems at compile time is a possible and useful approach to program development.

7 Design Limitations

7.1 Pedestrian simulation limitations

The pedestrian simulation specification used for this performance bench- mark does not seem have any potential parallel performance improvement, at least not for any of the tested configurations. This means that the design is far from the fastest program to solve this particular problem in Encore, as a simple serial implementation, avoiding the overhead of using message passing is faster, and using less memory as it does not create any additional actors.

7.2 Encore

As Encore is in constant development, the results of this experiments might

change with future updates to the language. There are also several other

abstractions for parallel computation that are in development and might

provide improvements to the language performance. Some of these are ”hot

(24)

objects”, which are a type of multi-threaded active object which may be used to implement a parallel quadtree (or similar structure), making implemen- tations similar to the shared lock matrix implementation viable. Another useful language extension are parallel combinators [12], which offer a differ- ent way to describe parallel computations.

Later versions of Encore will also likely feature better optimizations, and features like a standard library offering highly optimized data structures. All these are expected to yield Encore implementations faster with respect to other languages.

7.3 Actor reference memory usage

In current Encore, indirect references to active objects have a size, in the sense that, if you have for example an array of N active objects, each ref- erence to that array increases memory usage by the program by an amount proportional to N. An indirect reference is in this case defined as, if an active object A can send a message to another active object B, A have an indirect reference to B.

This is one of the reasons the shared lock matrix and other baseline optimization’s are unfeasible to get working at any decent efficiency. The reason for this is a side effect of how the garbage collection system operates, and might be changed in future versions of Encore. This behavior have been reported to the language developers ² .

8 Discussion

8.1 Object orientation and threads

One of the main differences between using OpenMP and Encore is that in OpenMP thread synchronization constructs are built in and thread commu- nication is not available directly to the programmer. This means that the Encore programmer has to set up a sound message passing system between their active objects. This has the advantage of being very flexible. However, this also means that Encore programmers need to be able to build sound message passing systems, which can be difficult. For example, designing a message passing system where forward progress is guaranteed, and an upper bound for memory usage can be calculated might seem trivial but are really not, and without such guarantees I would not call a program a sound design.

One of the main problems encountered during the development of this benchmark, is the risk that some message of an unexpected type arrives.

In particular, the region implementations worker threads have 2 different states, when it’s working on moving agents, and when it’s waiting on the next ”move” message. While in the state of moving agents, it need to be able

2

https://github.com/parapluu/encore/issues/646

(25)

to send and receive messages from other workers, as that is how transition between regions are done.

However, when debugging, it’s difficult to know if those are the only messages actors receive. If for example, one region would receive a second move message while working on the first move message, many invariants inside the region code would be violated and the program would likely crash.

The standard way to deal with this kind of issue is to not use the await construct unless any message can be handled correctly, but that is sometimes not an option, and can in many cases determent performance by not allowing some other messages to be handled. In this particular case for regions to handle cross border movement requests while waiting for their cross border movement request to be processed is required to keep the the system from potentially deadlocking by having two regions send the other a message, and not processing any messages until they got a reply.

8.2 Lightweight threads

The active objects representation in the run-time of Encore is as lightweight threads. This abstraction is really powerful, as programmers now don’t have to worry about making too many threads that often. This is one of the abstraction that make Erlang [1] the tool of choose for some applications, and now there is another language that works natively with it. One of the main reasons Erlang is using lightweight threads is that it aids in designing fault tolerant applications [1].

I think this abstraction will see a lot of use in the future, as the number of cores in CPU’s grow. Having the lightweight threads tied to objects is a new idea, but it’s quite commonly how Erlang code is structured even if it have other options.

9 Conclusions

Can Encore get comparative performance to existing technologies? Given our results there is no way to tell. As both implementations aims towards utilizing parallelism to improve performance, and neither achieves that goal.

The run time values are uninteresting, as the run time is dominated by parallelization overhead. Encore is slower, but that is likely due to it having a higher parallelization overhead, which is expected as the Encore run time is less mature than OpenMP.

Some conclusion can be drawn from the tests. It seems like Encore has

the potential to achieve parallel speedup, as the Encore implementations

runtime lower with an increased number of cores if the simulation parameters

are tweaked to allow for parallel computations. The parameter tweaking

results seem to support the theory of many small work unit being optimal

(26)

for Encore, as the lowest run time (other than with 1 region) are with 81 and 256 regions respectively, which is higher than the number of cores.

The run-time test does show that Encore can utilize a wide range of cores with the same implementation. While the optimal number of cores used for the 4000 agent scenario turns out to be 25, and performance deteriorates after that point, it does not do so by much. This deterioration is likely caused by over eager work stealing, which is an indication that there is not enough work for all the cores. In the 16000 agent run-time plot we see less of this deterioration, indicating that all core can divide the work quite effectively.

When analyzing the parameter test, it’s clear that the parameter value is important for performance, however there is a fairly long range of values that have quite similar performance (13-16). That is interesting considering the differences in number of regions created by those same parameter values, as the number of regions is the parameter value squared. In the range 13-16 there is only a 1% time difference, while the difference in number of regions is 87, 13 ² = 169, 16 ² = 256, about 51%. This seems to indicate that Encore can use a wide range in terms of number of actors, and still be effective.

10 Future work

In order to get more insight into how well Encore compares to other lan- guages, we would propose to make a test similar to this one were achieving parallel performance improvement is possible for the reference implementa- tion. Further more detailed investigations of the performance of the garbage collector, and other language features, might also give some more insight into which parts of the Encore language are most efficient, and which have high overheads.

References

[1] Joe Armstrong. Making reliable distributed systems in the presence of sodware errors. PhD thesis, The Royal Institute of Technology Stock- holm, Sweden, 2003.

[2] Ken Arnold, James Gosling, and David Holmes. The Java programming language. Addison Wesley Professional, 2005.

[3] Richard Bird and Philip Wadler. Introduction to functional program- ming, volume 1. Prentice Hall New York, 1988.

[4] Stephan Brandauer, Elias Castegren, Dave Clarke, Kiko Fernandez-

Reyes, Einar Broch Johnsen, Ka I Pun, S Lizeth Tapia Tarifa, Tobias

Wrigstad, and Albert Mingkun Yang. Parallel objects for multicores: A

(27)

glimpse at the parallel language encore. In International School on For- mal Methods for the Design of Computer, Communication and Software Systems, pages 1–56. Springer, 2015.

[5] Adriana Braun, Soraia Raupp Musse, Luiz Paulo Luna de Oliveira, and Bardo E. J. Bodmann. Modeling individual behaviors in crowd simulation. In 16th International Conference on Computer Animation and Social Agents, CASA 2003, New Brunswick, NJ, USA, May 7-9, 2003, pages 143–148. IEEE Computer Society, 2003.

[6] Elias Castegren and Tobias Wrigstad. Reference Capabilities for Con- currency Control. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming (ECOOP 2016), volume 56 of Leibniz International Proceedings in In- formatics (LIPIcs), pages 5:1–5:26, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

[7] Adri´ an Castell´ o, Antonio J Pena, Sangmin Seo, Rafael Mayo, Pavan Balaji, and Enrique S Quintana-Ort´ı. A review of lightweight thread approaches for high performance computing. In Cluster Computing (CLUSTER), 2016 IEEE International Conference on, pages 471–480.

IEEE, 2016.

[8] Sylvan Clebsch and Sophia Drossopoulou. Fully concurrent garbage col- lection of actors on many-core machines. SIGPLAN Not., 48(10):553–

570, October 2013.

[9] Sylvan Clebsch, Sophia Drossopoulou, Sebastian Blessing, and Andy McNeil. Deny capabilities for safe, fast actors. In Proceedings of the 5th International Workshop on Programming Based on Actors, Agents, and Decentralized Control, AGERE! 2015, pages 1–12, New York, NY, USA, 2015. ACM.

[10] Leonardo Dagum and Ramesh Menon. Openmp: an industry stan- dard api for shared-memory programming. Computational Science &

Engineering, IEEE, 5(1):46–55, 1998.

[11] 2017 Pony Developers. Discover what is pony? https://www.

ponylang.org/discover/. Accessed: 2017-06-08.

[12] Kiko Fernandez-Reyes, Dave Clarke, and Daniel S. McCain. Part:

An asynchronous parallel abstraction for speculative pipeline computa-

tions. In Alberto Lluch-Lafuente and Jos´ e Proen¸ ca, editors, Coordina-

tion Models and Languages - 18th IFIP WG 6.1 International Confer-

ence, COORDINATION 2016, Held as Part of the 11th International

Federated Conference on Distributed Computing Techniques, DisCoTec

2016, Heraklion, Crete, Greece, June 6-9, 2016, Proceedings, volume

(28)

9686 of Lecture Notes in Computer Science, pages 101–120. Springer, 2016.

[13] Christian Gloor, Laurent Mauron, and Kai Nagel. A pedestrian sim- ulation for hiking in the alps. In In Proceedings of Swiss Transport Research Conference (STRC), Monte Verita. Citeseer, 2003.

[14] Christian Gloor, Pascal Stucki, and Kai Nagel. Hybrid techniques for pedestrian simulations. In International Conference on Cellular Au- tomata, pages 581–590. Springer, 2004.

[15] Munish Gupta. Akka essentials. Packt Publishing Ltd, 2012.

[16] Fred Hebert. Learn You Some Erlang for Great Good!: A Beginner’s Guide. No Starch Press, San Francisco, CA, USA, 2013.

[17] Orion Sky Lawlor and Laxmikant V Kal´ ee. A voxel-based parallel colli- sion detection algorithm. In Proceedings of the 16th international con- ference on Supercomputing, pages 285–293. ACM, 2002.

[18] Lightbend. Actors. http://doc.akka.io/docs/akka/current/java/

actors.html#creating-actors. Accessed: 2017-06-08.

[19] Lightbend. Akka and the java memory model. http://doc.akka.io/

docs/akka/snapshot/scala/general/jmm.html. Accessed: 2017-06- 08.

[20] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebas- tian Maneth, St´ ephane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the scala pro- gramming language. Technical report, 2004.

[21] Ivan Simecek. Sparse matrix computations using the quadtree storage format. In Stephen M. Watt, Viorel Negru, Tetsuo Ida, Tudor Jebelean, Dana Petcu, and Daniela Zaharie, editors, 11th International Sympo- sium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2009, Timisoara, Romania, September 26-29, 2009, pages 168–173. IEEE Computer Society, 2009.

[22] Robert Stewart and Jeremy Singer. Comparing fork/join and mapre- duce. Cite-seer, Tech. Rep., 2012.

[23] Bjarne Stroustrup. What is object-oriented programming? IEEE Soft- ware, 5(3):10–20, 1988.

[24] Bjarne Stroustrup. The C++ programming language - special edition

(3. ed.). Addison-Wesley, 2007.

(29)

[25] Vincent M Weaver. Linux perf event features and overhead. In The 2nd International Workshop on Performance Analysis of Workload Op- timized Systems, FastPath, volume 13, 2013.

[26] Mikael ¨ Ostlund. Benchmarking parallelism and concurrency in the en- core programming language. Uppsala University, Department of Infor- mation Technology, pages 1–54, 2016.

11 Appendix

11.1 Encore Version

Tests were conducted using encore-development, commit id:

698b8485cc69b2072a54f507dc6d5935aaaf36c1

Which is available at https://github.com/parapluu/encore.

11.2 XML instance data

The instance data provided as input to the reference implementation is for- mated as can be seen below.

<welcome>

<waypoint id="w1" x="0" y="60" r="10" />

<waypoint id="w2" x="160" y="60" r="10" />

<agent xs="0" ys="60" n="1000" dx="40" dy="60">

<addwaypoint id="w2" />

<addwaypoint id="w1" />

</agent>

<agent xs="160" ys="60" n="1000" dx="50" dy="30">

<addwaypoint id="w1" />

<addwaypoint id="w2" />

</agent>

</welcome>

A way-point is defined using its X and Y coordinates, and its radius R. An agent is considered to have reached a way-point when its within its radius.

The element ”agent” refers to a group of agents. It’s attributes are, xs and ys defining the center of a rectangle were the agents are to be created.

The attributes dx and dy are the distance from the center to the edges of

(30)

rectangle in either direction. The number n is the number of agents to create in that region.

As the instance data does not define were the agents are created, and what happens if the number of agents is greater than the number of spaces available in the defined rectangle, the call was made to standardize this.

If the number of agents is greater than the number of available spaces, the program should not start a normal simulation, but print some error message.

The agents are created starting with the lowest allowed X and Y coordinates, increasing X until the greatest allowed X coordinate and then incrementing Y.

11.3 Benchmark Machine

Benchmark Server ”Bulldozer”

OS Linux 4.9.0-2-amd64 x86 64

Processor AMD Opteron(TM) Processor 6276 L1 Cache 16 kB

L2 Cache 2 MB L3 Cache 6 MB 11.4 Tables

This section contains the data described in Section5.

16000 agents, varying number of cores, 10 independent runs

OpenMP Encore

Cores Max Geomean Min Max Geomean Min

1 49 48.5428976594 48 4846 4839.2696621567 4827

2 140 121.0586482571 110 2817 2787.0507434886 2766

3 155 138.9932648212 128 1918 1905.3537575544 1897

4 196 174.3701409809 165 1473 1459.6128037993 1446

5 216 202.6622221593 191 1239 1223.3386702198 1215

6 210 201.8383006284 192 1064 1053.9900167065 1048

7 208 202.0481198358 195 934 928.3562649032 922

8 207 201.5554675015 186 839 832.7198349171 826

9 207 204.9946652319 202 777 767.7143946688 762

10 209 205.4138426171 196 722 713.2612586872 707

11 208 203.5114708564 196 677 672.3573781101 667

12 209 203.7040088268 198 638 630.0790408994 625

13 209 205.1627686612 200 608 598.8013884243 592

14 211 205.1521598941 199 570 565.9008155495 560

15 207 203.7016268252 198 544 538.9900169719 531

16 209 204.5986615189 198 517 512.9945063781 509

17 212 206.5067235881 200 497 490.8954827958 485

(31)

18 209 206.4253213191 199 474 469.2639736809 463

19 211 206.1569731021 199 456 451.2646601265 447

20 210 205.8851485945 198 441 434.1655208936 429

21 210 207.3470307654 200 429 420.156742907 412

22 211 207.4299908682 200 413 410.7243753117 408

23 225 211.2594143265 199 561 430.0620872822 398

24 210 207.8076377875 202 401 395.4322104507 385

25 211 206.685845018 199 398 391.6830342761 379

26 211 207.0722293415 203 395 386.7068769055 380

27 214 207.7707783228 201 389 381.8599897788 365

28 213 207.1405701995 202 387 375.5960540845 368

29 214 209.7911373435 203 380 372.9694455491 364

30 216 210.7704348579 203 378 371.1941378581 352

31 228 214.1980342152 208 536 383.2347168243 355

32 214 210.6145480048 206 373 364.9667519667 355

33 216 211.7870742756 205 374 366.2420649599 360

34 217 211.5992612648 206 377 365.7469137488 352

35 216 213.6181342491 207 376 366.8610340473 360

36 216 212.249015695 207 378 367.5086218147 358

37 218 214.4264432502 207 379 367.9221971989 354

38 216 211.9793457235 207 377 367.96303741 358

39 218 211.9686959326 207 378 369.6013909421 361

40 219 214.0603773637 208 383 371.4763583839 362

41 217 212.9643614035 207 382 370.4057416722 364

42 222 215.781425616 209 383 372.0359438797 360

43 222 215.149184738 208 385 376.7804967178 364

44 221 215.6111761255 209 387 374.1230332846 363

45 222 215.5993501381 209 387 377.2026976683 360

46 224 215.6694925013 208 387 379.2380427021 370

47 220 216.2433817279 210 392 380.9454203216 372

48 226 217.1221384154 210 390 381.1306284475 368

49 220 215.6032669189 210 461 391.6674846942 374

50 223 219.0761070859 213 395 386.7094510264 381

51 223 218.6123161505 212 396 382.4821928129 372

52 242 223.3844156281 212 520 404.9519624325 374

53 225 217.3154339539 211 399 391.324708004 379

54 224 220.4398300266 214 414 392.3629904137 382

55 227 222.9015914124 220 404 391.3559106738 375

56 234 222.9480263141 216 513 411.4240783401 377

57 228 222.3407832312 214 403 394.1324522375 385

58 228 220.7778622641 215 399 393.7812313771 378

(32)

59 232 224.4268511196 217 406 397.7956445584 392 60 227 222.3384622747 217 410 394.816629175 380 61 227 222.515329611 217 408 399.1505798678 390 62 227 223.9683984253 217 408 401.8010438267 397 63 230 225.7804336124 219 406 398.7891798747 390 64 234 226.1346077066 219 414 403.0328358651 392 Table 1: Data for the 16000 agent scenario, parameter value 9, translating to 81 regions. Results of 10 independent runs. All times in seconds.

4000 agents

Cores OpenMp (Min) Encore (Min)

1 10 772

2 29 472

3 34 350

4 42 277

5 45 241

6 49 221

7 49 197

8 49 181

9 50 175

10 49 167

11 50 161

12 51 152

13 50 148

14 51 146

15 50 138

16 50 136

17 51 131

18 51 126

19 52 122

20 51 119

21 51 117

22 52 112

23 52 112

24 52 111

25 53 110

26 51 110

27 52 111

28 52 112

29 51 111

(33)

30 52 111

31 52 111

32 52 111

33 53 112

34 53 113

35 53 114

36 53 114

37 52 116

38 52 114

39 52 116

40 52 117

41 53 118

42 53 120

43 53 120

44 52 120

45 54 121

46 53 121

47 53 122

48 53 125

49 54 124

50 55 124

51 53 125

52 53 125

53 53 126

54 53 127

55 53 126

56 54 129

57 53 128

58 54 128

59 53 130

60 54 131

61 54 131

62 54 131

63 54 133

64 54 135

Table 2: Data for the 4000 agent scenario, parameter value 16, translating to 256 regions. Results of 10 independent runs. Lowest time of 10 independent runs. All times in seconds.

Paramater Value Min

(34)

1 31

2 171

3 227

4 225

5 259

6 207

7 289

8 218

9 219

10 200

11 184

12 169

13 152

14 152

15 145

16 136

17 253

18 254

19 248

20 249

Table 3: Data for the 4000 agent scenario. Results of 10 independent runs.

Lowest time of 10 independent runs. All times in seconds.

Parameter Value Min GeoMean Max

1 159 165.0359829533 174

2 426 465.1908559923 510

3 778 809.2870286198 943

4 653 672.0071845691 702

5 580 618.7972846822 686

6 571 592.4629597712 638

7 468 471.8087913487 477

8 445 458.5041650198 468

9 394 402.7007628588 409

10 629 672.7888768504 712

11 555 577.9191349829 603

12 539 567.6188661844 596

13 532 583.6091755148 744

14 501 542.1910124022 686

15 481 504.5432363084 517

16 481 506.7205340106 524

(35)

17 441 488.9706259991 507

18 436 489.5665567565 547

19 456 531.8923079446 693

20 484 503.9693628195 531

Table 4: Data for the 16000 agent scenario, using 64 cores. Results of 10 independent runs. Sample size of 10 independent runs

Instance 4000 agents, 1 core 16000 agents 64 cores

Encore-Runtime 725 411

OpenMP-Runtime 11 220

encore-task-clock 100 100

openmp-task-clock 100 100

encore-branch-miss-% 2.45 1.4

openmp-branch-miss-% 1.45 0.13

Encore-LLC-load-misses 0.41 0.96

openmp-LLC-load-misses 0 0.3

encore-L1-dcache-load-misses 13.15 6.66 openmp-L1-dcache-load-misses 2.11 0.17 encore-L1-icache-load-misses 12.5 12.6 openmp-L1-icache-load-misses 12.52 12.49

Table 5: Data from the perf performance counters.

Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

IT 17 027

Examensarbete 15 hp Juni 2017

Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

David Escher

Institutionen för informationsteknologi

Abstract

Parallel Performance Comparison between Encore and OpenMP using Pedestrian Simulation

David Escher

pedestrian simulation program.

Ämnesgranskare: Alexandra Jimborean

Handledare: Kiko Fernandez Reyes

Contents

1 Introduction 7

2 Background 7

2.1 Pedestrian simulation . . . . 7

2.2 OpenMP using C++ . . . . 7

2.3 Encore . . . . 7

3 Problem 8 3.1 Problem specification . . . . 9

3.2 Reference implementation . . . . 9

4 Design 10 4.1 Baseline . . . . 10

4.2 Optimizing the baseline . . . . 11

4.3 Shared Lock Matrix . . . . 11

4.4 Passive Agents and Active Regions . . . . 12

4.4.1 Manager implementation . . . . 12

4.4.2 Region implementation . . . . 13

5 Experiments and Results 15 5.1 Parameter test . . . . 15

5.2 Core usage test . . . . 17

5.3 Performance analysis . . . . 18

5.4 Hardware utilization test . . . . 19

6 Related work 20 6.1 Survey of Active Objects . . . . 21

6.1.1 Erlang and Akka . . . . 21

6.1.2 Pony . . . . 23

7 Design Limitations 23 7.1 Pedestrian simulation limitations . . . . 23

7.2 Encore . . . . 23

7.3 Actor reference memory usage . . . . 24

8 Discussion 24 8.1 Object orientation and threads . . . . 24

8.2 Lightweight threads . . . . 25

9 Conclusions 25

10 Future work 26

11 Appendix 29

11.1 Encore Version . . . . 29

11.2 XML instance data . . . . 29

11.3 Benchmark Machine . . . . 30

11.4 Tables . . . . 30

1 Introduction

2 Background

The Encore language is currently under development, yet earlier tests on several benchmarks [26] show that Encore is on-par with Erlang and Scala in terms of speed, but have higher memory usage in some cases.

2.1 Pedestrian simulation

2.2 OpenMP using C++

2.3 Encore

Encore is a programming language under development at Uppsala university.

The language is using lightweight [7] threads similar to those found in

for example the Erlang run-time, and bind each of them to run the methods

Mainly as the true implications of future language extensions are difficult to predict.

3 Problem

In order to compare Encore to C++, this thesis will use a reference imple-

mentation of pedestrian simulation implemented in C++ using OpenMP,

and make implementations in Encore mimicking the reference as closely as

possible. Then some comparison will be done between the implementations

runtime, hardware utilization and scalability in terms of parallelism.

3.1 Problem specification

This specific pedestrian simulation have been used as an assignment program to parallelize in a course at Uppsala University, and the reference implemen- tation is from that course. It’s an adaptation of Christian Gloor’s ”pedsim”

project 1 . From the course instructions: ”This project includes software from the PEDSIM3 simulator, a microscopic pedestrian crowd simulation system, that was originally adapted for the Low-Level Parallel Programming course 2015 at Uppsala University.”

An agent is considered to have reached their way-point when they are within their radius, and will then start to move towards their next way- point. If their way-points run out, they will start over by moving towards their first way-point again.

3.2 Reference implementation

The reference implementation uses XML-files (Section11.2) describing start- ing positions and way-points for agents as input data, and simulates agents for 10000 time steps, were one time step is each agent attempting to move one step.

The implementation uses ”fork-and-join” [22] parallelism, moving all agents in a parallel code region and then joining before moving all agents again. A join is preformed between each time step to ensure that all move- ment of one time step is completed before the next time step starts. In

pedsim project website, http://pedsim.silmaril.org/ accessed 2017

4 Design

Because of Encore’s design, the problem has to be re-modeled from the

When considering these designs, we focus on some key properties. One of these is a low number of messages sent per time step, as message send and receive are synchronization constructs, which should be avoided if possible.

4.1 Baseline

Considering the baseline implementation requires all agents to broadcast

to all other agents on each step to detect collisions, the amount of messages

4.2 Optimizing the baseline

This increases the memory complexity of the agents (as they all need a local distance list) while it can decrease the required amount of message passing significantly. Depending on the instance data, this optimization can cause very few messages to be sent but in worst case it’s still O(T N 2 ).

The problem with this design is that it’s memory complexity is O(N 2 ), and that data contains a lot of active object references. This have a spe- cial interaction with the Encore run-time, as discussed later in this paper (Section 7.3).

4.3 Shared Lock Matrix

Instead of agents sending messages directly to other agents, a shared data- structure maintaining information about the agents positions can be used. In this case, a matrix of ”positions”, each representing one particular position.

project ¹ . From the course instructions: ”This project includes software from the PEDSIM3 simulator, a microscopic pedestrian crowd simulation system, that was originally adapted for the Low-Level Parallel Programming course 2015 at Uppsala University.”

This increases the memory complexity of the agents (as they all need a local distance list) while it can decrease the required amount of message passing significantly. Depending on the instance data, this optimization can cause very few messages to be sent but in worst case it’s still O(T N ² ).

The problem with this design is that it’s memory complexity is O(N ² ), and that data contains a lot of active object references. This have a spe- cial interaction with the Encore run-time, as discussed later in this paper (Section 7.3).