Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large Multi-Core Systems

(1)

Degree project in Computer Science Second cycle

Stockholm, Sweden 2013

Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large Multi-Core Systems

PATRIK SEIM

(2)

Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large

Multi-Core Systems

PATRIK SEIM

Master’s Thesis at NADA Supervisor: Michael Schliephake

Examiner: Prof. Erwin Laure

TRITA xxx yyyy-nn

(3)

(4)

Abstract

Molecular dynamics simulates the behaviour of particles in interaction by calculating the trajectories using computational demanding numerical methods. A Molecular dynamics application often takes advantage of modern computer architecture with large multi-core systems by subdividing the simulated scenario into smaller domains. The subdivided domains are distributed over a number of instances running in parallel to gain computational performance. A problem with domain decomposition is uneven particle distribution which can cause load imbalance. This report sug- gest a solution based on further subdivision and distribution of domains on application level.

(5)

Referat

Lastbalansering av molekylär dynamik genom nerdelning och distribution av domäner i stora multprocessorsystem

Molekylär dynamik simulerar beteendet hos partiklar som interagerar genom att kalkylera partikelbanor med använ- dande av beräkningstunga numeriska metoder. En applika- tion för molekylär dynamik drar ofta fördel av stora mul- tiprocessorsystem genom att dela ner simuleringsscenariot i mindre domäner. The nerdelade domänerna fördelas på instanser som körs parallellt. Ett problem som kan uppstå vid nerdelning av domän är ojämn last på grund av par- tikelfördelning i simuleringsscenariot. Denna rapport föres- lår ett sätt att lösa detta genom ytterligare nerdelning och distribution av tungt lastade domäner.

(6)

Acknowledgements

I would like to express my gratitude to my supervisor Michael Schliephake for the useful discussions, advices and engage- ment through the learning process of this master thesis. I would also like to thank my parents, who supported me economically and made it possible to finish the project.

(7)

Definitions

The definition contains a list with a brief explanation for common definitions and terms used throughout the document.

model A theoretic description of a method to solve a specific problem.

application The implemented version of the model used for evaluation and performance measurement.

instance A copy of the application running on an own processor core with a unique rank as an identifier, executing a subdivided part of the total computational work.

grid A pattern of cells with uniform dimension which divides an area or volume.

domain A property containing the boundary and dimension of the grid.

subdomain A subdivided domain or recursively subdivided subdomain containing the boundary, dimension and neighbour properties of the local grid.

distributed grid A subdivided subdomain with corresponding grid hosted by another instance than the instance owning the subdomain from which the grid was created.

MPI A standard message passing interface to simplify communication among instances running in parallel through identification of a unique rank.

processor An unit containing one or several cores used for computational pur- poses sharing the same external memory.

(11)

(12)

Chapter 1

Introduction

This chapter gives a brief description of the background and purpose of the research.

The chapter also defines the problem and the scope of the investigations.

1.1 Background

Molecular dynamics is a method to numerically simulate the behaviour of molecules and particles using Newtons law of movement. For as long as molecular dynamics has been a research area, scientists and researchers have improved the methods of simulation. Computers used for running applications with molecular dynamics have improved, and the computation has become faster. The computer architecture has also been developed towards large multi-core systems containing thousands of computational units. A computational unit consists of one or several processors sharing the same memory, and each processor has one or several cores for computing.

Computational units are normally connected to each other via a fast network.

The development of the hardware has forced the software to take advantage of the architecture by parallelisation. The computational work is subdivided into instances of the application running in parallel on separate cores. The computed result is communicated among the instances. Running a number of instances in parallel with various amount of computational work causes the execution time of the whole application to be at least the time of the slowest running instance. A more even distribution of workload among instances can improve the total execution time.

A question that arise is why bother about balancing the workload when there are thousands of cores to run the application. The answer is that a large multi- core system also has many users sharing the computational units. The degree of utilisation has to be as high as possible for efficiency. To claim a large part of the computational units for one job means that it can take days to empty the system before the job can run. Many computational units becomes idle during the empty phase which lowers the throughput of jobs. It can be more efficient to run the job on a part of the system with good balancing of the workload.

(13)

CHAPTER 1. INTRODUCTION

1.2 Purpose

The purpose of the work is to find a method of load-balancing using task based distribution. The work is based on a solution with task based distribution of computational workload for a runtime system [25] developed by the institution of high performance computing at the royal institute of technology in Stockholm, but sim- plified and adapted to application level for molecular dynamics.

1.3 Scope

The model is limited to the study of task based distribution of workload and not to find the most efficient way to balance the load. The model should also be adaptable to modern high performance computer architecture with multi-core processors and high speed network communication.

1.4 Method

The considered method is to implement an application with the functionality of the theoretical model and measure the implementation for performance. This proceeding has some disadvantages. The most efficient way to implement a solution will probably not made which affect the performance measurement. The measurement results could also be affected by the workload and utilisation of the computer.

A reference application with the same functionality regarding calculation and communication must also implemented. The reference application should be implemented without any mechanism for balancing the load. A good approach could be to implement the reference application first and then modify a copy with the model of load balancing.

1.5 Problem Description

Many applications take advantage of modern computer architecture with computational units and multi-core processors using workload decomposition and parallelisation. A molecular dynamics application often makes the decomposition on domain level. The computational work is generated by calculation of properties for particles. Particles are often randomly present in a domain which can cause load imbalance by domain decomposition.

We have strived to develop a model that works on application level for portability on different platforms. The only need should be a C-compiler and support for MPI 2.0. There are two main goals with the development.

• The first goal is to preserve the method of domain decomposition

• the second goal is to preserve the method of border communication

(14)

Chapter 2

Theoretical Reference

This chapter gives a brief theoretical background to the area of research. There are three main areas of interest.

• Molecular dynamics

• Execution environment

• Measurement concepts for parallel applications

The chapter gives also a brief introduction to related work in the area of load balance.

2.1 Molecular Dynamics

Molecular dynamics has been a research area for almost sixty years with many contributors who have improved the methods and performance of computer based simulations involving particles and molecules [14]. The basic algorithm is still the same for a simulation using the Monte Carlo method [2]. Particles becomes a potential in interaction with other particles. The potential creates a force. The interaction with other particles in the simulation generates a result vector which is the sum of all forces affecting the particle. The resulting force creates a velocity which affects the movement of the particle. With the created velocity, a new position is calculated based on the previous position for a short interval. The algorithm has a complexity of O(n²) since the interaction is calculated for all particles in the simulation [14]. The calculation cycle is shown in figure 2.1 on page 6.

There are methods that can increase the performance for calculation of particles.

One of the most common method is the linked cell method.

2.1.1 The Linked Cell Method

The linked cell method builds on the fact that the force calculated from the potential of a particle in interaction with another particle is short ranged. The distance

(15)

CHAPTER 2. THEORETICAL REFERENCE

Figure 2.1. The figure shows the calculation cycle for the force F , velocity v and position x of a particle i in a molecular dynamics simulation. The distance rij is calculated as the difference between the position of two particles k xj− xi k. The potential of a particle Uij creates a force depending on the distance. The force affecting a particle creates a velocity. A new position can be calculated after a time interval ∆t. Particles leaving a cell is moved to a new cell. The calculation of particles is the core functionality of a molecular dynamics application.

between the particles where the calculated force is regarded as negligible is referred to as the cutoff radius r_{cutof f} [14].

Molecular dynamics using the linked cell method have a number of cells of uniform dimension arranged as a grid over the simulated domain. A particle in the simulation is represented by mass, velocity, position and force. Particles are assigned to cells depending on their current position in the domain. The dimension of a cell is typically at least the cutoff radius. Setting the cell dimension according to this criteria makes it possible to limit the number of cells included in the force calculation of a single particle. This is shown in the figure 2.2 on page 7 where only the neighbour cells are included in the force calculation.

Calculations are made cyclically with a time interval ∆t according to figure 2.1.

When all positions are calculated, the assignment to cells are checked. If a calculated position corresponds to another cell in the grid than the currently assigned, the particle is moved to the new cell. A particle leaving one side of the grid is reentered from the opposite side of the grid to maintain a constant number of particles in the

(16)

2.1. MOLECULAR DYNAMICS

Figure 2.2. Short range refers to the maximum distance r between two particles x_iand xjwhere the force F generated by the potential of the particles still has an impact on the resulting force vector. The calculation can be more efficient using a grid to subdivide particles according to position and only calculate forces between the particles within the radius rcutof f in the neighbour cells. The gray area indicates the cutoff radius of the particle in the centre. The size of each cell is equal or larger than the cutoff radius. The force generated from interaction between the particle in the centre and particles outside the gray area is regarded as negligible.

simulation [14].

2.1.2 Domain Decomposition

The usage of the linked cell method divides the simulation domain into a large grid with many cells. The linked cell method also narrows the force calculation to only include a cell and the closest neighbour cells. This makes it possible to parallelise the application by subdividing the grid into several smaller grids as shown in the figure 2.3 on page 8. Several copies of the application is running a smaller part of the particle calculation, each on separate core. The drawback is that the border cells of the neighbour instances’ grid have to be included in the force calculation.

The instance must also transfer particles leaving cells in the vicinity of the borders to neighbour instances. The border cells can either be communicated using message passing, shared memory or as replicated data [8][12].

A molecular dynamics application spend approximately 90 percent of the execution time in force calculation according to performance profiling. The execution time can benefit much from domain decomposition and parallelisation. A drawback is the increasing border communication among instances competing for shared resources. Communication using shared resources has an impact on execution time when the number of instances grow large.

(17)

Figure 2.3. The figure shows a subdivision of the grid structure containing the simulation scenario. (1) The complete simulation scenario executed on a single processor.

(2) The multi core version where the grid is subdivided into instances running on separate processor cores calculating a part of the total workload and communicating borders [14]. The concept is known as domain decomposition [8][12].

2.2 Execution Environment

The implementation of an application has to take advantage of the architecture of the execution environment to achieve good performance. The design is dependent of the memory architecture, processor architecture, file system and network topology.

Other options of essence are operating system and available software for doing IO.

2.2.1 Hardware Description

The execution environment is Lindgren [26] at the Royal Institute of Technology in Stockholm, a Cray XE6 machine [9] which consist of 1516 computing nodes connected to each other with a Torus 3D network. Each computational node consist of two AMD Opteron twelve-core processors (Magny Cours with a clock speed of 2.1 GHz) connected to each other via two Hyper Transport links as figure 2.4 on page 9 shows. Each processor has two dies with six execution cores sharing a 12 MB L3-cache. Each processor is connected to a separate 16 GB DDR3 memory module, a total of 32 GB memory per computing node with non uniform memory access (NUMA). This gives a total of 24 cores per computing node.

(18)

2.2. EXECUTION ENVIRONMENT

Figure 2.4. Schematic view of a computing node. The node consist of two physical CPU units, each with two dies of six cores sharing the L3 cache. The whole machine consist of 1516 similar computing nodes connected via a Torus 3D network.

2.2.2 Software Description

The Cray XE6 machine consist of 1516 computing nodes running Compute Node Linux (CNL) and 20 service nodes running SUSE Linux Enterprise Server 11 SP1 as operating system. The PGI compiler pgi/12.5.0 and environment Pgi-Env/4.0.46 is used for compilation of the software. The MPI is using MPICH 2.0 as a platform for the communication system.

2.2.3 Measured Network Characteristics

When the number of processors grows large, communication among instances of an application running in parallel is one option that has a large impact on the execution time. Using a test application based on the MPI [1] to measure communication time among computing nodes under a communication intensive session gives the characteristics according to figure 2.5 on page 10. The test application is sending 500000 messages, each with 10000 bytes of data from all even core numbers to all uneven core numbers and measures the communication time between core eight on one computing node and core eight on all other computing nodes. The communication time differ significantly between two runs with the same configuration (white and gray bars). The number of computing nodes involved in communication has also a great impact on the execution time (shown by the dark bars). The computing nodes location in the network has also an impact on the measured result. Since the

(19)

0 20 40 60 80 100

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Execution Time for 500000 Messages in Seconds

Computing Unit as Sequence Number Communication Time Among Computing Units

Measurement 1, 10 nodes Measurement 2, 10 nodes Measurement 3, 20 nodes

Figure 2.5. The figure shows the communication time among computing nodes on the network. All cores with even core numbers are sending 500000 messages of size 10000 bytes to cores with uneven core numbers. The communication time is measured for core eight to the corresponding core on each other computing node.

The measurement is performed for ten computing nodes (white and gray bars) and twenty computing nodes (dark bars). As shown in the figure, the execution time can vary significantly depending of computing node configuration and message intensity.

computing nodes for execution of a job is assigned by the scheduler, performance measurement result can differ from different runs of an application. The deviation in measurement result also increases with the number of computing nodes (shown by white and gray bars compared to dark bars).

The test shows that performance is affected by heavy communication which could have an impact on the measured result. It also shows that the location of computational nodes in the network has an impact on the measured result since the configuration is different for measurement 1 and 2. This is a more extreme scenario to show the impact of heavy communication in the network.

2.3 Measurement Concepts

There are some basic concepts included in performance measurement that describes the behaviour of a parallel application. Parameters of importance are speedup, efficiency and scalability. The following section gives a brief introduction to these parameters.

(20)

2.3. MEASUREMENT CONCEPTS

2.3.1 Speed versus Efficiency

Running an application in parallel by subdividing the computation workload can make an application faster according to wall clock time but increases the amount of computational resources. Communicating particle data cyclically between instances means that a sequential part is introduced into the computation through synchro- nisation. This leaves the application with a minimum execution time which can not be improved by parallelisation. The execution time of the sequential part can also be increased by an increasing communication and management overhead due to a large number of running instances. The sequential part is what prevents the speedup of the application to scale with the number of processing elements.

Amdahl’s law [4][24] makes the following definition of the speedup S that can be gained from parallelising an application distributed over p numbers of processing elements with regard to the sequential part f :

S_p = 1

f + ^1−f_p < 1 f

Measured in execution time, the speedup is the time t₁it takes for the best sequential version [24] to execute compared to the execution time t_p for a parallel version on p processors:

Sp = t1

t_p

The efficiency E is defined by the speedup of the application gained from in- creasing the number of processors compared to the execution time of the sequential version of the application distributed over the number of processors used for computation [24]. The following connection between the speedup S gained from parallelisation and the number of processors p exist:

E_p = S_p p

As the formula indicates, an efficiency of one is achieved from a sequential application. Since there is almost always a part of the application that has to be run sequentially, an efficiency lesser than one can be expected from a parallelised version of the same application.

2.3.2 Scalability

The scalability describes the behaviour of an application when the number of processing elements is increased. The speedup becomes lesser the more processing elements that are added but can increase temporarily if the size and workload of the simulation scenario are increased [24].

Gustafson’s law [24][15] defines the speedup S_pgained from increasing the prob- lem domain for a specific number of processors p with regards to the sequential part f as:

S_p= p − f (p − 1)

(21)

Gustafson’s law makes the assumption that the sequential part f is constant and independent of the size of the problem [24][15]. The speedup increases with the time the application is running and goes to p when execution time goes to infinity because the sequential part f becomes negligible. Compared to Amdahl’s law, the sequential part still leaves a minimum execution time but becomes less significant the longer the application runs.

The scalability is measured with two basic methods, weak scaling and strong scaling.

Weak scaling

The weak scaling measurement assumes that the workload per processor is constant while the number of processors increase [24]. The method reveals overhead in computation that can appear due to a larger number of instances. Increased border communication and instance matching are two options that affects the execution time under this condition.

Strong scaling

The strong scaling measurement assumes that the total workload for the application is constant and decreases per instance with an increased number of processors [24].

The method reveals limits due to constraints in the problem domain. The size of the grid and the possibility to split the grid further is one of them. The amount of calculation workload compared to the amount of border communication is also a source that can increase execution time measured under this condition.

2.4 Related Work

This section gives a brief description of the different techniques which were used to solve the problem with load balancing in several known molecular dynamic applications. Some of the methods are suitable for a large number of processors. The section also gives a description of the work that has been done in other areas of interest.

2.4.1 In The Area of Load Balancing

Most modern applications for molecular dynamics provide some form of load balancing where workload is distributed among instances of the application running in parallel. One approach is the application NAMD2 [18] where the workload is represented as patches and force calculations distributed among the instances. The technique is implemented using a programming library known as CHARM++ [13]

which has been in development since 1980 and is available in many multi core en- vironments. The application NAMD2 uses one initial load balancing and dynamic load balancing through the simulation.

(22)

2.4. RELATED WORK

Another approach is the technique used by the earlier versions of GROMACS [16]

with staggered grids. The application measures the execution time of the force calculation and adjust the size of the grid in appropriate direction. Adjustment of the grid means that a part of the workload close to the grid border is distributed to another instance on another processor. The workload is balanced dynamically.

A similar technique with flexible domain borders which trace and divides particle congestion is also described in the article [11].

The EulerGromos described in the [8] uses a technique of domain decomposition to balance the load. The workload is calculated after a predefined number of steps and compared globally among all instances. The balancing is achieved by adjusting the size of the domain.

The PMD application [27] uses Voronoi decomposition instead of more conven- tional cubic decomposition to balance the workload. Each processor is assigned a position in the grid which creates a polyhedra to the other processor positions. The polyhedra can be varied depending on the time each processor spends in potential evaluation to provide both static and dynamic load balancing.

The MarDyn [21] is using a k-d-tree to decompose the domain into cubes con- sidering the cost for communicating the divided section. The k-d-tree [7] is an algorithm which uses a binary tree to search for the closest point to a reference point. The MarDyn application scales extremely well according to the report for up to 100000 processor cores.

The usage of a binary tree controlled flexible grid to adapt to particle congestion is also described in the article [20] where the size of an area in the grid is adjusted to minimise the workload. This method is more based on shared memory locally and communicating border adjustments globally.

A similar spatial domain decomposition along all axis is also used in Sigma described in the article [23] where border adjustments are based on the estimated future work calculated from the previous measured.

The usage of subdivided domains for load balancing is described in the article [10]. The implemented application uses one fine grained grid for domain decomposition and one coarse grained to which the processors are connected. The balancing is performed by assigning domains to processors by adjustment of borders in the coarse grained grid to include or exclude the location of domains.

Load balancing has also been implemented in the centre for high performance computing (PDC) at the Royal Institute of Technology in Stockholm as a runtime system, scheduling programmer selected computational tasks on other processors according to the technical report [25]. The system consist of three main compo- nents for gathering, analysing and monitoring the execution of an application. The implementation has been tested together with a molecular dynamics application and measured for performance.

A common thread regarding load balancing is the movement of computational workload from a processor with high workload to a processor with lower workload.

Load balancing is also performed both statically, to provide an initial even workload distribution and dynamically, to respond for a changing scenario.

(23)

2.4.2 In Other Areas

The usage of a tree controlling the domain decomposition is a method to balance the workload used in [20][23]. The method of recursive tree decomposition by a split of the grid area or grid volume has been described in the article [6]. The split technique is used to subdivide an area or volume orthogonally over all dimensions.

Using a split technique subdivides an area or a volume faster than common binary division.

The domain decomposition for parallelisation builds on [8] which saves the complete simulation on each processor and replicates data and the geometric parallelisation described in [12] with cubes assigned to specific processors. Both the models are refined in [14] which present a domain decomposition using domain factors for the number of processors used in the simulation. Only the data that corresponds to the subdivided domain are stored in the memory used by the instance for the domain.

A theoretical model of a finite automata is described and implemented in the book [17] for interpretations of languages. The theoretical model of an instruction generating machine is described in the book [22] where instructions are translated to their meaning and executed on an abstract machine. Parsing, interpretation and context free grammar is also described in the book [5] where expressions and statements are expanded and replaced with new expressions and statements.

(24)

Chapter 3

Design and Implementation

This chapter gives an introduction to the theoretical model that was developed from the research. The chapter starts with discussion about some limiting elements that affects the design. It continues with a description the reference model used for measurement comparison and the functionality it suffers. The new and extended functionalities of the developed model are presented. Finally, the main issues regarding the implementation of the application is presented.

Two models are presented. One reference model where the core functionality for molecular dynamics and domain decomposition is implemented. Improvements for reading and writing data are added as well as functionality for domain factor calculation.

The other model, which is referred to as the movable domain model because of the ability to distribute parts of a subdomain to other subdomains, is an extension to the reference model. It contains the core functionality but a number of new functionalities are added as well as some are extended.

3.1 Preliminaries

This section describes the prerequisites that create restrictions to the choice of design for the application model. It is basically two elements that have an impact on the implementation; the architecture of the hardware and the way the application communicate.

3.1.1 Communication Restrictions

The architecture of the execution environment is one major option to take into consideration regarding the design of the application. The application uses domain decomposition as a method to parallelise the execution. Each subdivided domain executes on a separate core and exchanges particle data using border communication.

(25)

CHAPTER 3. DESIGN AND IMPLEMENTATION

Thread Based Communication

The communication among threads [3] is limited to shared data structures which in some cases have to be accessed through critical sections. Within a computing node, the possibility of access is provided by the shared memory. To extend the communication beyond the computing node, other technologies such as message passing using the MPI or shared memory SHMEM, have to be used. There is a restriction in the execution environment of Lindgren that prevents simultaneous access of the MPI from multiple threads. This limits the usage of threads for communication between computing nodes.

There are other limitations regarding threads. Using threads forces the usage of barriers which have proved through tests to be very slow. The barriers are used to protect common data structures such as the linked cell structure. It is also used to synchronise each step in the calculation cycle to eliminate the possibility that a thread is using old data for calculation.

MPI Based Communication

The MPI provides a mechanism for communication which is independent of the location of an application in the execution environment. It is also common interface standard that exist on most high performance computers using multi-cores. MPI has own mechanism to preserve data integrity both on the same computing node as between computing nodes. It also defines blocking operations for synchronous communication as well as non blocking operations for asynchronous communication.

3.1.2 Other Restrictions

Since the research method builds on comparison between the developed model and a reference model, some basic functionality have to be implemented in a similar way.

The functionalities affected by the restriction are the calculation part, the particle distribution part and the selection of parameters.

Particle Distribution

Particle data used in the simulation are read from a separate file in order to make it possible to easily change the simulation scenario. The particle data are read once by the instance with the lowest rank and distributed to the other instances based on the position x_j of the particle j. The rank of the destination instance is calculated based on the total number of instances and how domain decomposition is performed taking into account the size of the subdomain ss,the size of the cell cs

and the dimension dim of the subdomain.

i= ⌊x_j c_s⌋ · 1

s_s

(26)

3.2. REFERENCE MODEL

and the index i calculates the rank of the instance as follows rank = i_[1]+

dim−1X

d=1

(np_[d]· i_[d+1])

where np is the instance mesh and^Q^dim_d=1np_[d] are the number of subdomains.

Domain Factors

The domain subdivision factors are calculated by the application based on the dimension and are a multiple of the number of instances running in parallel. The selection of factors from possible candidates tries to create a divisible grid with the shortest possible side. This functionality is further described in the movable domain model.

3.2 Reference Model

This section gives an introduction to the reference model used for comparison of measured data. The model is used as a base model where the techniques for calculation and communication are kept for every derivative.

3.2.1 General Description

The reference model builds on the chapter 3 and 4 from the book by Griebel et al. [14]. The model is chosen because it is adapted to modern computer architecture using domain decomposition. The book also contains a profound description of how to build the basics of molecular dynamics.

The basic implementation of the algorithms for force, velocity and position calculation using the linked cell method is used. The vector structures for storing and handling particle data are also used. The model is parallelised using the described technique for domain decomposition and the border communication is also kept from the algorithms in the book.

The functionalities that have been added to the reference model which are not described in the book are the following.

• automatic selection of domain factors

• reading and distribution of particle data

• reading and distribution of parameter data

• gathering of simulation data to be written

The functionalities are the same in both the reference model and the developed model to avoid differences in measured execution time.

(27)

3.2.2 Calculation of Particle Data

The calculation of particle data is performed cyclically using the linked cell method.

The particle data consist of mass m, force F , velocity v and position x for a particle.

The data are stored as linked lists in a vector. A linked list represent a cell and the vector represents the domain grid of cells.

Force Calculation

The force calculation is based on the Lennard-Jones 12-6 potential with the pa- rameters epsilon ǫ and sigma σ. The formula for the potential with regards to the distance r_ij of the particles i and j is given as follows.

U (rij) = αǫ(( σ

r_ij)¹²− ( σ r_ij)⁶)

where α = 1/(12 − 6)(12¹²/6⁶)^1/(12−6) = 4. The σ gives the zero crossing of the potential function and the ǫ gives the depth of it. The resulting force affecting a particle is calculated as the sum of the potential from all other particles which are located within the cutoff distance. The force is the sum of the gradient of the potential for each interacting particle with regards to the position [14].

F_i = XN j=1,j6=i

−∇_x_iU (r_ij)

This gives the formula for force calculation as follows.

F_i= 24ǫ XN j=1,j6=i

( 1 r_ij² )(( σ

r_ij)⁶(1 − 2( σ r_ij)⁶)r_ij

Velocity Calculation

The calculation of the velocity is performed using the Velocity-Störmer-Verlet algorithm [14]. The velocity is calculated from force using the following formula.

vⁿ_i = (Fⁿ_i + Fⁿ⁻¹_i ) ∆t 2m_i Position Calculation

The position is updated using the Velocity-Störmer-Verlet algorithm [14]. The update of the position is calculated according to the following formula.

x_i = xi+ ∆t(vi+F_i∆t 2m_i )

(28)

3.2. REFERENCE MODEL

Figure 3.1. The domain index of the subdomain implemented in the reference model. (1) lower global index indicates the location of the subdomain in the original grid. (2) start index of the simulation area local to the instance grid. (3) stop index local to the instance grid. (4) number index indicates the width of the simulation area including borders for communication. Using the indices in this way makes it possible to use a vector to implement the cell system.

The reference model is using the domain decomposition method described figure 2.3 on page 8 as the only method to gain computational speed. The domain is subdivided into smaller subdomains based on the domain factors. The subdomains are built as vectors of linked lists with indices defining the simulation space and the border space according to figure 3.1 on page 19. The index of the cell system in the grid consist of a tuple representing the subdomain space in each dimension. The lower global index is the lower left corner of the simulation space and is mapped to the index in the original domain which corresponds to the subdomain. The start and stop indices gives the boundary of the simulation space. The number index gives the boundary of both the simulation and the border space and the product of the dimensions dim represents the whole vector of linked lists of particle data.

To reference an index in the vector from an index in the subdomain grid g, the following mapping formula is used.

index = g_[1]+

dim−1X

d=1

number_[d]· g_[d+1]

(29)

Figure 3.2. The figure shows the copy of particles leaving the subdomain. Particles leaving the domain enters the gray border area from which they are transferred to the neighbour simulation area. The copy is performed in both directions.

3.2.4 Border Exchange

The borders of a subdomain are the continuation of the simulation space to neighbour subdomains. The subdomain keeps a table of which border is mapped to which neighbour subdomain. Particles entering the border space from the simulation space are copied to the corresponding neighbour subdomain. This is shown in figure 3.2 on page 20.

Molecular dynamics using the linked cell method needs the particles in the en- closing cells to calculate the result force of a particle. This means that particles residing in cells close to the border must have the particle data from the corresponding neighbour subdomain cells. Particle data to support force calculation is copied from the vicinity of the border to the corresponding neighbour subdomain border. The particles copied to the border space is only existing during the force calculation phase and is then removed to avoid redundant particle data. This is shown in figure 3.3 on page 21.

Since the number of particles copied in border communication is not known in advance, the number has to be negotiated when communicating. The number of cells in the border space is known due to an even sized domain decomposition.

This limits the number of messages sent at one sending to two. The first message contains a vector of the same size as the number of cells in the border with the number of particles in each cell at each index. The next message contains a vector with the particle data. The first vector is used when unpacking the particle data and assigning each particle to the corresponding cell.

(30)

3.3. MOVABLE DOMAIN MODEL

Figure 3.3. The figure shows the copy of particles to support force calculation.

Particles to support force calculation are copied from the vicinity of the border in the subdomain to the neighbours border. The copy is performed in both directions.

3.3 Movable Domain Model

The section describes the important outlines of the movable domain model and the characteristics that make it differs from the reference model. The movable domain model contains new functionality and extensions to the reference model.

The movable domain model is comprised by a number of functionalities that together makes the model work. The most important functionalities are listed below.

• The tree management (new functionality)

• The domain decomposition (extension to reference model)

• The buffer management (new functionality)

• The border exchange (extension to reference model)

• The workload scheduling (new functionality)

• The domain factors (common functionality)

The concept of each functionality is described separate sections.

3.3.1 General Description

The adaption of molecular dynamic application to modern multi core machines often builds on the concept of subdividing the simulation domain into several smaller subdomains running in parallel as part one and two in figure 2.3 on page 8 shows [14][23].

The workload for a complete simulation executed on a single processor consist to

(31)

Figure 3.4. A schematic view of the functional principle of the movable domain model. The domain of an instance with high workload can be subdivided into smaller subdomains. From the subdomain, a distributed grid can be created and sent to an instance with lesser workload. Shown in the figure is the tree controlling the subdomains with boundary information of the grid. The background shows an instance hosting a distributed grid. The distributed grid must still communicate the borders with the borders of the empty area in the original grid.

the largest part of calculations. According to profiling measurements, an application for molecular dynamics running on one processor core spends approximately 90 percent of the execution time in force calculations. By using domain decomposition and creating instances of an application running on separate cores of a processor, this workload can theoretically be lowered by a factor n⁻¹. It also requires that the workload is evenly distributed among all instances to benefit from this method, because the execution time of the application is not shorter than the slowest instance. This is true because execution time of an application contains the execution time of all its instances and so even the slowest. It also means that the execution time of an application using domain decomposition and suffering from an uneven distribution of workload, can benefit from a decrease of the execution time for the slowest running instance.

Certain simulations in molecular dynamics such as simulation of liquids and firm bodies contain scenarios where the particles allocates a part of the simulation domain. In such cases, the simulation can suffer from a type of load imbalance that can be difficult to balance by the method of moving edges of an instance grid to a neighbour instance. The method has only a local effect of load balancing.

(32)

Movable Subdomains

The movable domain model is a model for molecular simulation which builds on the concept that a domain assigned to an instance of an application can be further subdivided into several smaller subdomains. Smaller grids can be created from the subdomains as shown in figure 3.4 on page 22 and distributed to other instances as shown in figure 3.5 on page 24. This concept can be especially useful to reduce the impact of a load imbalance where the particles allocates half the simulation scenario. An instance started with a large number of particles compared to other instances can cut a part of the workload by dividing the domain into several smaller subdomains and distribute one or several of these to instances with less workload.

A criteria that needs to be fulfilled is that the distributed grid communicates the borders with the original grid in order to keep the subdomain updated as well as keep the own borders updated with the current simulation situation. Another criteria is that distributed grids can be efficiently distributed among the instances in order to avoid loss of data or data redundancy. The grid of a subdivided subdomain is either separated and distributed to another host or merged in the original subdomain grid during the same time step in a calculation cycle.

3.3.2 Tree Management

The movable domain model has a tree managing the subdivision of the local grid [20][23].

The tree is of the type described in the article [6] with recursive subdivision, but subdivides the grid on cell boundaries instead of calculated boundaries. The root of the tree consist of a subdomain containing the boundary of the local grid, the location of the local grid in the simulation domain and the neighbour instances to each border. The subdomain also contains the instance identity when the subdomain is hosted by another instance. A subdomain can be further subdivided into smaller subdomains with new boundaries within the original domain. The subdomains creates branches in the tree from the parent domain and inherit the characteristics.

Unique Code

Each subdomain is assigned a unique code identifying the location of the subdomain in the tree. The method of creating a unique code is based on the method used in chapter eight in the book [14] for identifying a unique search path. The code consist of a path to the location and the level which corresponds to the depth of the subdomain. The root has no path and the level is zero.

Tree Type

The domain is subdivided based on the dimension of the cell system in the grid. A two dimensional grid generates a quad tree and a three dimensional an octet tree.

The quad tree uses two bits in a long integer data structure to create a unique path per level of depth. An octet tree uses three bits for the same purpose. The

(33)

Figure 3.5. The figure shows the movable domain concept. (1) The simulated scenario is decomposed into several instances running a part of the calculation on separate cores. The dark area represent a solid body of particles generating workload.

(2) Each instance can be further subdivided and distributed to instances with lower workload but still communicating the borders.

Figure 3.6. The identification code of a subdomain. The code consist of a unique path together with the level of the subdomain in the tree. The path is used as a search path to find the subdomain at the specified depth.

(34)

bits create a code for each subdomain in the tree which is unique to each instance according to figure 3.6 on page 24. There is also a smallest size for the cell system in the grid which still makes it possible to divide. The smallest size sets a limit to the maximum depth of a tree. The usage of bits causes the search for a domain to cost log n in average but not more than the height of the tree.

Subdomain State

A subdomain also contains a host identifier that indicates whether the corresponding grid is distributed to another instance or not. The identifier of the own instance indicates that the particle data resides in the own grid while a foreign identifier indicates a distributed grid managed by another instance.

A domain of an instance with a workload above the average can be selected for subdivision. The subdivision results in an even split of the original boundary which creates new subdomains in the tree. Any of the newly created subdomains can be selected for distribution. When a subdomain is selected for distribution, a new local grid is created with a size that corresponds to the selected subdomain.

The boundaries are recalculated to local boundaries matching the created grid.

Particle data is collected from the original grid corresponding to the boundaries of the selected subdomain which leaves an empty hole in the original grid.

An instance of the application with a workload below the average can be matched to an instance with higher workload and become the receiver of subdivided grids.

The subdivided grid is transfered while the domain is marked as external. The responsibility for particle calculations within the boundary of the domain is transferred to the receiving instance. The external domain is only responsible for receiving cyclical updates of border data and keeping track of hosting instances.

The domain model take advantage of the computer architecture by domain decomposition. Each instance is started with a domain corresponding to the local grid.

The boundaries of the domain is mapped directly to the grid and consist of a start and stop index for the simulation space with coordinates local to the instance as in the reference model. The number index gets the value of the total number of cells in the local grid and a lower global index indicates the lower index where the local grid is located in the original grid as shown in figure 3.7 on page 26.

When a domain is split, the local grid remains intact while the start and stop indices of the new subdomains are recalculated to global start and stop indices for the local grid. The subdomains are by this calculating the whole area of the local grid without overlapping each others new boundaries. Each new subdomain has its location in the grid indicated by a lower offset to the first index of a virtual border to the subdivided domain (marked as 5 in the figure 3.7).

(35)

Figure 3.7. The domain index of the different data structures implemented in the application. (1) lower global index indicates the location of the subdomain in the original grid. (2) start index of the simulation area local to the instance grid. (3) stop index local to the instance grid. (4) number index indicates the width of the simulation area including borders for communication. (5) offset index local to the subdomain grid indicating a distributed grid’s location in the subdomain grid. (6) outer start index local to the subdomain based on offset. (7) inner start index local to the subdomain and separates the border area from the inner area. (8) inner stop local to the subdomain and (9) outer stop local to the subdomain. The lower global index and the offset are constant reference values for both subdomain grid and distributed grid.

The Distributed Subdomain

When a grid for distribution is created from a subdomain, the start and stop indices of the domain (2 and 3 in figure 3.7) is recalculated to be local for the new distribution grid while the number index (4 in figure 3.7) is recalculated to reflect the size of it. As a transform index, the offset index (5 in figure 3.7) of the subdomain is used. When moving particles in the grid, the position of a particle is used as a reference to which cell it belongs. Dividing and rescaling grids causes some problem because the application has to keep track of the cell to which a particle is assigned independent of the local index of the cell. For this purpose the lower global index together with the offset is used(1 and 5 in figure 3.7).

(36)

The Border Domain

A distributed grid must communicate the borders with the original grid to maintain the current simulated situation. The border containing an outer and an inner row is used for this purpose. The outer section is used to send particles leaving the distributed grid while the inner section is used to provide the owner of the distributed grid with particles for calculation. In the case of communicating data from the local original grid to the distributed grid, the purpose of the sections is reversed. The border contains start and stop indices for both outer section and inner section (6, 7, 8 and 9 in figure 3.7) which are recalculated from local to global coordinates and the other way using the offset index(5 in figure 3.7). Sending borders from a distributed grid requires recalculation from local to global coordinates while sending from the local original grid requires recalculation from global to local coordinates.

Border Policy

Particles leaving the boundaries of the original grid are reentered from the other side in order to not loose any particle data in the simulation [14]. Particles reentered from the opposite side must have the position recalculated before they are moved.

Particles in the border neighbourhood of the original grid which are copied to the opposite border must also have their position recalculated. In order to keep track of whether the position for particles close to the border neighbourhood should be recalculated, a combination of the lower global and the offset indices are used.

3.3.4 Buffer Management

Besides the cyclic communication of border data with the closest neighbours, each instance has to request any other instance to exchange a distributed grid. The asynchronous message reception requires each instance to have a server part waiting for incoming messages. Sending messages asynchronously also require that each instance has a requester part which operates independently of the cyclic particle calculation.

Asynchronous Communication

The transfer of a subdivided grid for distribution must be performed in several steps using synchronous communication. The reason for this is that the size of the grid and the number of particle assigned to each cell is not known by the receiver. The size and numbers have to be negotiated. The grid must also be distributed between two instances during the same step in the calculation cycle to not become inconsistent. The model uses non-blocked communication in parts where asynchronous message distribution is necessary. The non-blocked communication is tested cyclically but only receiving a message when a message is available. This allows the cyclic calculation and communication to be performed in sequence with a sporadic message reception.

(37)

Figure 3.8. A schematic view that shows order instructions subdivided into a set of operational instructions and stored in the buffer. ORDER instructions generates a subset of operational DO instructions depending on the current state of the application. The buffer content is read and executed in sequence with the cyclic calculation.

Different Message Types

An incoming message contains an order to perform a specific task. The steps necessary to carry out the task is dependent on the state of the receiving instance. An order to merge a subdomain could mean that the instance has to recall a number of distributed grids, but it could also mean that the subtree can be removed directly.

A split order could mean that the instance have to inform other instances to recall distributed grids hosted by the instance, but it could also mean that the own grid can be subdivided directly. The reception of orders must also be performed in sequence with the cyclic calculation due to the lack of support for multi threaded MPI based communication. The receiving instance has to interpret an order based on the current state and generate appropriate do-instructions. Several messages could also be received at the same time with different orders.

The Abstract Machine

A way to solve the problem with adaption to current state is to use a reception buffer as an abstract machine [17]. Received orders are interpreted and instructions generated according to figure 3.8 on page 28. One order or instruction is read from the buffer for each calculation cycle. It is only the instructions that could change the state of an instance. The instructions are entered to the buffer in a sequence and executed, leading to new states for the instance [22]. New orders are received continuously during execution and entered into the buffer on arrival. The received orders are interpreted when read from the buffer and more instructions are generated.

(38)

Order Interpretation

The instructions that can be generated are the following listed below. The instructions are followed by a brief description.

• ORDER_SPLIT is given to request a domain to be subdivided into several smaller domains. The command could generate the following instructions:

– DO_REQUEST is generated to invite another instance to communicate using blocking operations.

– DO_RECALL is generated if the domain to be split is distributed and needs to be recalled to the owning instance.

– DO_SPLIT is generated to split the domain in four or eight subdomains depending on dimension.

• ORDER_MERGE is given to request a subdomain which has subdivided domains to be merged into one subdomain. The command could generate the following instructions:

– DO_RECALL is generated if any domain in its subtree is distributed and needs to be recalled.

– DO_MERGE is generated to merge the subdomain by removal of the subtree.

• ORDER_MOVE is generated to distribute a grid to another host. The command could generate the following instructions:

– DO_MOVE is generated to extract content of the native grid and distribute the created grid to a matched host.

• ORDER_RECALL is given to recall a distributed grid to the owning instance.

The command could generate the following instructions:

– DO_REQUEST is generated to invite another instance to a communicate using blocking operations.

– DO_RECALL is generated to recall a distributed domain from its host and merge it into the original grid.

The DO_REQUEST instruction for all orders is necessary to organise the movement of a distributed grid through blocking communication. The instruction is executed together with the DO_MOVE or DO_RECALL during the same calculation cycle.

(39)

Message Reception

All instances of the application have a server part which tests for non-blocking receptions in sequence with the cyclic calculation. There are two different types of non-blocking reception.

• The request reception receives a question to either send or receive a distributed grid.

• The order reception receives an incoming order to split, merge, move or recall a domain.

The server part is protected by two barriers to ensure the access order of requests and receptions. The request reception is executed before the order reception and after the second barrier. The request check for blocked communication is performed after the first barrier but before the borders are communicated, to ensure that the recipient has the request in the reception queue before the second barrier is executed.

Order Request

When having a DO_REQUEST instruction on top of the buffer and performing the request check, the buffer pointer is incremented which reveals the next instruction on the buffer. If a DO_RECALL instruction lies in the buffer, the instance issues a get grid request to the instance hosting the distributed grid. This causes the receiving instance to execute a blocking MPI send operation while the requesting instance executes a blocking MPI receive operation to exchange the grid.

A DO_REQUEST instruction followed by a DO_MOVE instruction on the top of the buffer causes the requesting instance to make a request to put a grid. This activates the receiving instance to execute a blocking MPI receive operation while the requesting instance executes a blocking MPI send operation to send a subdivided grid.

Order Transfer

An order to split, merge and move which is intended for another instance of the application can also be generated and entered into the buffer. The order could have been issued by the instance itself, received from another instance or received as an order from an external interface. Before the order is interpreted, the instance rank number in the order is matched against the own instance rank number and if they differ, the order is sent to the intended instance using an ordinary MPI send operation.

Order Move

A move order can also be issued to transfer a distributed grid from the hosting instance back to the owner instance. The instance which receives the move order

(40)

checks if it has distributed grids and issues a recall order for the owner of the distributed grid. The order is transferred to the owner of the grid which interprets the order, generates a do-request instruction for communication and a do-recall instruction for the distributed grid. The message exchange could seem a bit circumstantial but is necessary to ensure that each distributed grid is managed by the owning instance.

Order Split

A split order sent to an instance which has distributed grids is interpreted as the instance’s workload is to high and a recall order is issued for the last received distributed grid. If a new split order is received the instance continues to issue recall orders for the hosted grids until all distributed grids are sent back to their owners. The instance is then prepared to split the own grid and move the divided subdomains to other instances with lower workload.

3.3.5 Border Exchange

In order to maintain the connection to a distributed grid, border particle data has to be exchanged between the instance hosting the distributed grid and the owner instance. This is done cyclically for each time step together with the ordinary border particle data exchange among neighbour instances. Border data for distributed grids are sent by each instance hosting grids. After a send is performed, a receive for border data is executed by all instances having one or several subdivided grids distributed.

Border data are also distributed from the instance owning a distributed grid to the instance hosting the grid. The owning instance sends particle data from the original grid according to the boundaries of each distributed grid’s subdivided domain. The border data are received by each hosting instance which updates the border of the corresponding grid. The code according to figure 3.6 on page 24 and the host identifier from the subdivided external domain is used to identify the distributed grid.

Restrictions

The proceeding works as long as an instance does not host any distributed grids and at the same time have own grids distributed. Such a case could lead to deadlock in the border communication. A restriction is therefore put on each instance; the subdomain of an instance can not be split as long as it is hosting distributed grids.

A request to split the subdomain of an instance hosting distributed grids causes the instance to issue a recall order for the latest received grid. The recall order is sent to the owner instance of the distributed grid.

There is also another situation which leads to the same problem. An instance can distribute grids to another instance which already has own grids distributed.

This problem is prevented by the scheduling system which calculates the workload

Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large Multi-Core Systems

Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large Multi-Core Systems

PATRIK SEIM

Load-balancing of Molecular Dynamics by Subdivision And Distribution of Domains in Large

Multi-Core Systems

Abstract

Referat

Lastbalansering av molekylär dynamik genom nerdelning och distribution av domäner i stora multprocessorsystem

Acknowledgements

Contents

Definitions

Chapter 1

Introduction

1.1 Background

1.2 Purpose

1.3 Scope

1.4 Method

1.5 Problem Description

Chapter 2

Theoretical Reference

2.1 Molecular Dynamics

2.2 Execution Environment

2.3 Measurement Concepts

2.4 Related Work

Chapter 3

Design and Implementation

3.1 Preliminaries

3.2 Reference Model

3.3 Movable Domain Model