A patch-based partitioner for Structured Adaptive Mesh Refinement

(1)

IT 08 007

Examensarbete 30 hp

Februari 2008

A patch-based partitioner for

Structured Adaptive Mesh Refinement

Implementation and evaluation

Abbas Vakili

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A patch-based partitioner for Structured Adaptive

Mesh Refinement

Abbas Vakili

To increase the speed of computer simulations we solve partial differential equations (PDEs) using structured adaptive mesh refinement (SAMR). During the execution of an SAMR-application, finer grids are superimposed dynamically on coarser grids where a more accurate solution is needed in the computation area. To further decrease the computation time, we use parallel computers and divide the computational work between the processors. This gives rise to challenging load balancing problem. The arithmetic workload, the communication volume and data migration should simultanously be kept as low as possible.

In this work we implement a load balancing algorithm that specifically targets the arithmetic workload. Efforts are also made to restrict the amount of communication using space filling curves. The algorithm is evaluated using a number of SAMR applications. The load imbalance is very low for all applications. The communication volumes are generally large but not prohibitively large. Overall, the results are very satisfactory compared to other algorithms. The implemented algorithm can significantly reduce the run-time for parallel SAMR applications.

(4)

(5)

6 Performance metrics 20 6.1 Number of blocks . . . 21 6.2 Load imbalance . . . 21 6.3 Volume of communications . . . 22 7 Performance results 23 7.1 Number of blocks . . . 23 7.2 Load imbalance . . . 24 7.3 Intra-level communication . . . 26 7.4 Inter-level communication . . . 26 7.5 Total communication . . . 27 7.6 Impact of SFC . . . 28

7.7 Scalability of the implemented patch-based partitioner . . . 30

(6)

1 Introduction

A computer simulation is a process of imitating real phenomenons that can be hard to describe analytically. Computer simulations have become a useful tool to modell of many systems in astro physics [5, 11], chemistry [14] and biology [1]. An con-crete example is car construction to calculate the robustness and safety of the car in collision situations [6].

Many phenomenas can be approximated by computer simulations. The phenom-enas are often described mathematically by partial differential equations (PDE). A PDE is a differential equation; a relation involving an unknown function of several independent variables and its partial derivatives with respect to those variables.

Most methods used for solving PDEs employ a discrete computational grid and the solution is then computed on this grid. The grid can either be structured, where it is described with Cartesian coordinates, or unstructured, where it is usually built with triangels or tetrahedrons. The resolution of the grid has great importance for the accuracy of the solution. A higher resolution generally yields a more accurate solution but it is computationally more expensive. Many PDEs are, at a given time, only in need of high accuracy on limited parts of the grid. In this case, it is a waste of computational resources to compute a solution with a high and uniform accuracy on the whole grid. Instead the grid can be refined at the parts that are in need of a greater accuracy. At these parts, overlapping grids with higher resolution are added. The remaining pieces of the grid are left with a lower resolution. The acronym for this method is SAMR - Structured Adaptive Mesh Refinment [3]. SAMR can under favorable conditions result in a runtime reduction in vicinity of a factor 1000 [15].

However, even with SAMR, a single computer is often not capable to compute the solution of a PDE within a realistic time interval. To decrease the simulation time further, we divide the computations between a number of processors. We thus paral-lelize the computations. The parallel computational time for each time step is deter-mined by the processor with the largest amount of work. A badly balanced workload results in poor utilization of the computational resources and a longer execuation time. We also need to consider that the solution can depend on data that resides on other processors. In such cases, the involved processors must communicate partial solutions with each other before the computations can proceed. High communication volumes also result in longer execution times.

To divide the workload between the processors, three common types of algorithms are used:

(7)

refine-ment between processors without considering the workload on other refinerefine-ment levels [10,23]. These algorithms can result in low load imbalance but high com-munication.

• Hybrid algorithms combine ideas from both patch-based and domain-based al-gorithms to balance the impact of load imbalance and communication [18]. In this thesis we implement a patch-based partitioner that aim to acheive the lowest possible load imbalance, while we try to limit the amount of communication.

2 Background

2.1 Differential Equations

A differential equation is an equation where the derivatives of a function appear as variables. They express relationships involving rates of change. The solution of a differential equation is a function whose derivatives satisfy the equation.

The mathematical theory of differential equations has developed together with the sciences from where the equations originate and where the results find applica-tion. Diverse scientific problem often give rise to similar differential equations. In such cases, the mathematical theory can unify distinct scientific fields.

The order of a differential equation is the same as the highest derivative that it contains. For instance, a first order differential equation contains only first order derivatives. In most differential equations there are room, but also time derivatives can exist. Differential equations are divided into two types, based on the number of variables they depend on. These two groups are called Ordinary Differential Equa-tions (ODE) and Partial Differential EquaEqua-tions (PDE).

2.1.1 Ordinary Differential Equations

An ordinary differential equation contains functions of only one independent vari-able, and one or more of its derivatives with respect to that independent variable. A simple example is Newton’s second law of motion, which leads to the differential equation

md_dt2x2 = f (x)

for the motion of a particle of mass m. Here, the force f only depends upon the position of the particle x. Thus, the unknown variable x appears on both sides of the differential equation, as is indicated in the notation f (x).

2.1.2 Partial Differential Equations (PDE)

(8)

equations are used to formulate and solve complex problems such as the propagation of sound or heat, electromagnetics, electrodynamics, fluid flow, and elasticity. Com-pletely distinct physical problems may have similar mathematical formulations..

An example partial differential equation with two independent variables is ∂u

∂x+ ∂u∂y = f (x, y)

A solution of a partial differential equation is generally not unique. Additional conditions must generally be specified on the boundary of the region where the solu-tion is defined. Also, inital values might be needed.

2.2 The Finite Difference Method (FDM)

Solving PDEs analytically is generally impossible. They are therefore solved approx-imately or numerically using computers. The numerical solution of partial differ-ential equations involves a discrete domain where approximations of the PDEs are computed. One standard method is to use a cartesian grid and estimate the values of the unknowns at the grid points. The resolution of the grid influences the local error and hence the accuracy of the solution. The resolution also influences the number of computations to be made. The properties of the grids are thus important for the behaviour of a given method. To solve PDEs numerically, the finite difference method

Figure 1: A structured grid, described with Cartesian coordinates. Note that ∆x and ∆y can have different size.

(9)

grid points and at neighbouring grid points. By substituting the difference operators into the PDE, an approximation of the solution can be obtained. For time-dependent PDEs, the FDM can also be used in the time dimension.

A forward difference operator, denoted by D+, is an expression of the form D+f (x) = f (x+h)−f (x)_h

where h is the step size. This operator approximates the first derivative. A backward difference operator, denoted by D−, is an espression of the form

D−f (x) = f (x)−f (x−h)_h

where h is again the step size. This operator also approximates the first derivative. A second order derivative can be approximated by the following operator:

D+D−f (x) = f (x+h)−2f (x)+f (x−h)_h2

For the approximate solution to converge to the correct solution of the PDE, two condition must be fulfilled:

• Consistency, the local truncation error should go to zero as the stepsizes go to zero. This means that the discrete problem approximates the continous prob-lem.

• Stability, which means that the approximated solution should remain bounded. When solving a PDE using FDM and a parallel computers, the processors need to exchange data at every processor boundary. A FDM-operator always involves neigh-boring grid points. If those neighneigh-boring points resides on different processors, grid data is exchanged between the processors. Figure 2 illustrates this situation.

As an example of FDM, consider the heat equaton

u_t= cu_xx, 0 ≤ x ≤ 1

u(0, x) = f (x), u(t, 0) = α, u(t, 1) = β We let uk

(10)

Figure 2: Two grid patches are assigned to processors P0respectively P1. To compute

the value of the point (i, j) using the D+operator, P0 needs the data at point (i, j + 1).

Thus, P0 must communicate with P1 to receive the values contained at the point

(i, j + 1). uk+1_j −uk j ∆t = c uk j+1−2ukj+ukj−1 (∆x)2 , j = 1, ..., n or uk+1_j = uk j + c_(∆x)∆t2(uk_j+1− 2uk_j + uk_j−1), j = 1, ..., n (1)

The boundary conditions give us uk₀ = α and uk_n+1 = β, and the initial conditions provide the starting values uj₀ = f (x_j) for all j, so that we can march the numerical solution forward in time using the difference scheme.

2.3 Structured Adaptive Mesh Refinement

For many problems, a grid with uniform resolution generally gives satisfactory re-sults. However, there are classes of problems where the solution is more difficult to estimate in some regions due to e.g. discontinuities, steep gradients, and shocks. One could use a uniform grid that have a spacing fine enough so that the local errors in these regions are acceptable. This approach is computationally very costly and it is difficult for most problems to predict a grid resolution that will give good results while keeping the amount of computations low.

(11)

is only fixed for the base grid and is determined locally for the subgrids according to the properties of the problem. Adaptive mesh refinement can under favorable condi-tions result in a runtime reduction in the vicinity of a factor 1000 [15].

Using parallel computers, AMR give rise to a challenging load balancing problem. The level of refinement can increase or decrease at any given time. The area occupied by the refinements can also grow larger or smaller. In short, the grids are constantly changing. This has the consequence that a partitioning of the grid that yields a good load balance at a certain time step almost certainly results in a bad load balance later on.

2.3.1 Grid Description

The basis for SAMR is a large uniform and structured grid that encompasses the whole computational domain. The resolution of the grid conforms to the lowest ac-ceptable accuracy of the solution. Every refined grid is uniform and rectangular in shape. A refined grid can overlap several coarser, lower level grids, but it must always be inside the domain of these grids.

In the adaptive mesh refinement technique, we start with a coarse grid. As the solution proceeds we identify the regions requiring more resolution by some parame-ter characparame-terizing the solution, say the local truncation error. We superimpose finer sub grids only on these regions. Finer and finer subgrids are added recursively until either a given maximum level of refinement is reached or the local truncation error has dropped below the desired level. Thus, in an adaptive mesh refinement computa-tion grid resolucomputa-tion is fixed for the base grid only and is determined locally for the sub grids according to the requirements of the problem. See Figure 3 for an illustration of the grids.

2.3.2 Integration Algorithm

The SAMR integration algorithm consists of two main components:

• The actual time integration step which is performed with a finite difference scheme (FDM).

• Error estimation.

Time integration

(12)

(a) Illustration of the basic concept of SAMR, sub-grids lying above have higher resolution than those lying under.

(b) Illustration of a real problem, a shock-wave hitting a wedge. Each rectangle is a patch. Smaller rectangles have generally a higher resolution. Note how the refined patches follow the area in need of high ac-curacy.

Figure 3: Refinement levels on a computational grid

computational domain. This means that they can not use the boundary conditions supplied by the PDE. Thus the information must be exchanged with grids on the next lower level in order to obtain the boundary conditions. Also, when a finer grid is integrated, the result should be projected down to the coarser grids before the com-putations are resumed on the coarser levels, as this gives the most accurate solution since a finer grid uses a smaller time step. Taking both of these factors into account, the order of integration is described in Algorithm 1.

(13)

Algorithm 1 Integration algorithm

Begin advance(l , k):

take one step of size k on level l

if l=L then

return

else

interpolate from level l to l+1

for i = 1 to 2 : do

advance(l+1, k/2)

end for

project from level l+1 to l

end if

End advance

size k/4. The solution on level 2 is finally passed down to level 1, and the new solution on this level in turn is passed down the base grid. If we did not update the coarser levels, as described above, the solution might become both dispersed and dissipated. As a consequence, the apparent need for refinement could gradually vanish, leading to a grid with a lack of refinement. Also, the update operations prevents bad values on the coarse approximation of the boundary conditions.

Error Estimation

Error estimation is performed every several time step and grid is then adjusted ac-cording to the resulting estimation. A common method is to use Richardson extrap-olation. In order to increase the interval between the error estimation, each grid is made larger than needed. Because of this, a moving phenomenon does not necessary mean that we must re-grid each time step.

2.3.3 Grid Generation and Clustering

(14)

we have - the more communication is needed. The optimal efficiency is depended on the application but it is often in the range of 75-90%. Also, if the solution has a moving phenomena, a very high efficiency will lead to more frequent re-gridding operations, as there is a higher probability of the phenomena moving outside the grid.

3 Load balancing algorithms

Solving large PDEs numerically on a computer with a single processor can take too long to be feasible. Instead, we divide the computation between a number of com-puters, even when SAMR is used. Most SAMR existing SAMR frameworks supports parallel execution [7, 13, 23]. Efficient use of parallel SAMR typically requires that the dynamic grid hierarchy is repeatedly partitioned and distributed over the partic-ipating processors. Several performance issues arise during the process of partition-ing of the workload. As information flows in the grid hierarchy, processors need to exchange data. Intra-level communication appears as grid patches are split between processors and data are exchanged along the borders. Inter-level communication can occur for overlaid patches when the solution is projected down to the coarser level and a finer patches lacks boundary data. Both types of communications can result in longer execution times.

A syncronization delay may occur when a processor is busy computing, while hold-ing data needed by other processors. Until the processor has finished its computa-tions, other processors might be unable to proceed as they lack data. Syncronization delay can be severe – the time spent waiting for data can be of the same magnitude as the actual computational time [19]. The number of delays often grows as the number of processors is increased.

To get optimal performance, the partitioner needs to simultanously minimize data migration, load imbalance, communication volumes and syncronization delay. Typ-ically, this is impossible. Instead, the partitioner needs to trade-off the metrics in accordance with the characteristics of the application and computer. Ultimately, par-tition quality is determined by the resulting application execution time.

The common algorithms for partitioning SAMR hierarchi can be categorized as patch-based, domain-based or hybrid.

3.1 Patch-based approach

(15)

In theory, the patch-based approach results in perfect load balance. In practice, some load imbalance can be expected due to sub-optimal patch aspect ratios, integer divisions and constraints on the patch size. Partitioning can be performed incre-mentally, as only patches created or altered since the pervious time step need to be considered for repartitioning. However, patch-based algorithms often results in high communication volumes and communication bottlenecks. The communication vol-ume is generally increased when a patch is subdivided into blocks to create a lower load imbalance.

3.2 The domain-based approach

For domain-based algorithms, only the base grid is partitioned [12, 16, 21]. Initially, the workload of the refined patches are projected down onto the base grid, reducing the problem to partitioning a single grid with heterogenous workload. Thus, the domain is partitioned along with all contained grids on all refinement levels. The minimum block size determined by the size of the computational stencil on the base grid. As the base grid stencil corresponds to many grid points on highly refined patches, the workload of a minimum sized block can be large. The domain-based approached is illustrated in Figure 4.

As overlaid grid blocks resides on the same processor for domain-based algorithm, inter-level communication is eliminated. A complete re-partition might be necessary when the grid hierarchi is modified. Because the imbalance is often high for deep grid hierarchi it is hard to divide patches from many levels at once. A problem with domain-based algorithms is ”bad cuts”: many and small block with bad aspect ratios. These blocks occur when patches are cut in bad places, assigning only a tiny fraction of a patch to one processor while the majority of it resides on another processor.

Domain-based algorithms often result in low communication volumes and high load imbalance.

3.3 The hybrid approach

For some applications states, neither the patch- or the domain-based approaches can create good-performing partitions. For these states, a combination of the patch- and domain-based algorithms can be used. This type of algorithm is called hybrid [18]. Hybrid algorithms tries to inherit the good properties of both the domain-based and the patch-based techniques at the same time as they aim to avoid their main draw-backs. We do not describe the details of any hybrid algorithm in this thesis.

3.4 Ordering grids to increase locality

(16)

(a) In patch-based partitioning algorithm a grid hierarchy is divided between the participating processors on each level of refinement separately.

(b) In domain-based partitioning algorithm a grid hierarchy is divided between the participating processors on the all levels of refinement at the same time.

Figure 4: Dividing a grid hierarchy between four processors in patch-based partition-ing algorithm (a) and domain-based partitionpartition-ing algorithm (b).

(17)

The German mathematician David Hilbert (1862-1943) was the first to give the algebraically described space-filling curves a geometric interpretation [4]. He divided the unit square into four sub squares. This splitting is recursively repeated for the emerging sub spaces, leading to the curves in Figure 5. An important property of SFCs is called full inclusion: whenever we further split a sub interval, the mapping to the corresponding finer sub squares stays in the former interval. SFCs can be used

Figure 5: Hillbert curves, first and second order. The numbers in the Figure is the SFC-index.

to order the patches in a grid hierarchy. They are computationally efficient, locality preserving recursive mappings from d-dimensional space to 1-dimensional space i.e., Rd_{→ R}1_{, such that each point in R}d_{is mapped to a unique point or index in R}1_{. Two}

properties of space filling curves make them particularly suitable for partitioning SAMR grid hierarchies. Digital causality implies that the points close together in d-dimensional space will be mapped to points close together in 1-d-dimensional space, i.e. the mapping preserves the locality. Self-similarity implies that, as a d-dimensional region is refined, the refined sub-regions can be recursively filled by curves having the same structure as the curves filling the original region, but possibly a different orientation. Figure 6 illustrates this property for a 2-dimensional region. In the case of SAMR grid hierarchies, SFC mapping is exploited to maintain locality on the same refinement level for the patch-based approach and on both the same and different refinement levels for the domain-based approach.

4 Implementation of the partitioner

(18)

(a) level 1 (b) level 2 (c) level 3

Figure 6: An illustration of the Self-similarity of Space Filling Curves

input and it is assumed to be two dimensional (the partitioner is easily expanded to also work for grids in 3D). The output is a partitioned grid hierarchy were each re-finement level is divided into p parts, where p is the number of processors. Each part is assigned to a specific processor.

The partitioner initally receives the grid hierarchy from a SAMR framework. Each grid patch in the hierarchy is transformed to an object of type BBoxInfo. BBox-Info is a custom made data structure that stores all properties of a grid patch. Table 1 presents a specification of the BBoxInfo class. The use of BBoxInfo makes the par-titioner independent of any data structure used by the SAMR-framework.

The first time the partitioner is invoked during a simulation, it reads the size of the base grid hierarchy. To order the patches, the partitioner then creates an empty square grid that is used to assign SFC-indices to the patches.

The size of this SFC-grid is determined by computing the first power of two that is equal or larger than the longest size of the base grid, up to a maximum size of 512 × 512 grid points. For example, if the largest side is 240 grid points, the SFC-grid will be of size 256 (28_{) grid points. Each cell in the SFC-grid is then indexed according}

to Hilbert space filling curve.

For each grid patch, the partitioner maps its coordinates onto the SFC-grid and finds a corresponding cell. It then assigns the SFC-index in this mapped cell to the SFC-index field in corresponding BBoxInfo instance. The SFC-grid is created once and it is reused at every subsequent repartition. Next, the partitioner orders the BBoxInfo objects by their SFC-indices. For the ordering, we use the quicksort algo-rithm.

We limited the resolution of the SFC-grid to 512×512 because the time complexity for creating a SFC-grid grows exponentially. The resolution 512 × 512 was shown to produce good results for all our test applications.

(19)

BBoxInfo

Type Name Description

int nx, ny Number of grid point on the x and y dimension

int lbx, lby Lower bound coordinates

int ubx, uby Upper bound coordinates

int stepx, stepy Number of steps between two grid points in the x and y dimension, computed in the coordinate system

of the highest refinement level

int level Refinement level

int index SFC-index

Table 1: Instance variables in class BBoxInfo. All variables except the last one (level) are used to transform a grid patch into a BBoxInfo object. index is used to order an array of BBoxInfo objects.

other when they are ordered (see Section 7).

Next, the partitioner starts to assign grid patches to the processors. If the optimal processor workload is exceeded when we assign a grid patch to a processor, the parti-tioner divides the offending grid patch into two parts. The patch is cut perpendicular to its longest side. One of new parts has the size that the current processor needs to reach at least the optimal processor workload. This part is assiged to the current processor and the other part is saved for the next processor.

If dividing a grid patch results in a patch with a size smaller than a specified threshold, the whole patch is assigned to the current processor. The threshold is determined by the size of the FDM-operators. By restricting the minimum patch size, we avoid the creation of many small patches that potentially can cause a lot of unneccessary communication. If the current processor is the last processor, the partitioner assigns the whole grid patch (and any remaining grid patches) to it.

(20)

computational time, this is a serious issue. As a remedy, we try to scatter this excess workload between all processors, rather than letting it be accumulated at the last processor. To achieve this, we aim to make the processor workload around 5% larger than the optimal processor workload. The factor 1.05 was found through experiments to consistently produce good results. Algorithm 3 illustrates the mechamism of divid-ing the grid hierarchy between the processors.

Algorithm 2 Read grid patches

Read the size of base grid in the input grid hierarchy Create SFC-grid

while patches left in hierarchy do

read next patch create BBoxInfo object store object in array

if no more patches on this level then

compute the optimal procWorkLoad on this level sort the BBoxInfo array based on SFC

end if end while

goto algorithm Assign grids

Algorithm 3 Assign grids

for i = 0 to maxLevel do

while patches left on this level do

// Add patch to proc

if procWorkLoad+workLoadPatch <= 1.05*optProcWorkLoad OR lastProcessor then

assign current patch to processor procWorkload += workLoadPatch

// current processor is full

if procWorkLoad >= optProcWorkLoad AND !lastProcessor then

procWorkload = 0; processorID++;

end if

// Need to divide this patch

else

goto Algorithm Divide

(21)

Algorithm 4 Divide

Determine direction of cut

if Xsize > Ysize then

direction of cut = y compute x-coord for cut

else

direction of cut = x compute y-coord for cut

end if

if resulting patches < minSize then

procWorkload = 0; processorID++;

else

assign the first (the “left”) part to the current processor replace the original grid by the second (the “right”) part procWorkload = 0;

processorID++;

end if

5 Applications

To evaluate the performance of the implemented partitioner, we use four applications from the Virtual Test Facility (VTF) [8, 20]. VTF, developed at the California Insti-tute of Technology, is a software environment for coupling solvers for compressible computational fluid dynamics (CFD) with solvers for computational solid dynamics (CSD) [8]. The purpose of VTF is to simulate highly coupled fluid-structure inter-action problems. The selected applications are restricted to the CFD domain of VTF, as the CSD solver is implemented with unstructured grids and the finite element method.

For the CFD problems in VTF, the framework AMROC [2, 9] is used. AMROC is an object-oriented SAMR framework that conforms to the algorithm formulated by Berger and Colella [3]. AMROC is based on DAGH (Distributed Adaptive Grid Hierarchies), a data storage package for parallel grid hierarchies [13]. In AMROC, separate grid levels are allowed to use different degrees of refinement.

(22)

for the base grid ((2 × 2) × (4 × 4)). Because we generally also refine in time, several iterations are performed on a refined patch during one time step on the coarsest level. Using our example and assuming equal refinement in both space and time, we will perform two iterations for each patch on the first refinement level and eight iterations for the patches on the second level. Thus, for each computation on a grid point on the base grid we will perform 64 × 2 × 4 computations on the highest refinement level. The refinement factors are set by the user before the application is executed. The metric workload measures the aggregate number of grid points calculated on during a time step on the coarsest level. Because of the refinement in time, a refined grid has a larger workload than the number of grid points.

5.1 Ramp — Mach reflection at a wedge

Ramp simulates the reflection of a planar Mach 10 shock wave striking a 30 degree wedge. A complicated shock reflection occurs when the shock wave hits the sloping wall.

Both the workload and the number of grid patches grow almost linearly during execution, due to the growing reflection pattern behind the shock wave. Both the in-cident and the reflected shockwave have a sharp and unscattered refinement pattern. The initial grid size is 480x120 grid points and the application uses three levels of refinement with refinement factors 2,2,4. The maximum number of grid points is 722,924. A density plot for time t=0.2 is shown in Figure 7.

Figure 7: Density plot for Ramp at t=0.2.

5.2 ShockTurb — Planar Richtmyer-Meshkow instability

(23)

the result is called a Richtmyer-Meshkov instability. The Richtmyer-Meshkov insta-bility finds applications in stellar evolution and supernova collapse, pressure wave interaction with flame fronts, supersonic and hypersonic combustion and in intertial confinement fusion.

In the simulation, an incident Mach 10 shock wave causes vortices along a si-nusoidally perturbed gas interface (five symmetric pertubations). The geometry is rectangular and closed, except at the left-most end. The gases are air and SF6 (sul-fur and fluoride). The simulation is motivated by physical experiments [22].

The initial grid size is 240x120 grid points and and the application uses three levels of refinement with a constant refinement factor of two. The maximum number of grid points is 787,076. A density plot for time t=0.5 is shown in Figure 8.

Figure 8: Density plot for ShockTurb at t=0.5

5.3 ConvShock — Converging/diverging Richtmyer-Meshkov insta-bility

ConvShock simulates a Richtmyer-Meshkov instability in a spherical setting. The gaseus interface is spherical and sinusoidal in shape. The interface is dis- turbed by a Mach 5 spherical and converging shock wave. The shockwave is reflected at the origin and drives a Richtmyer-Meshkov instability with reshock from the apex.

(24)

Figure 9: Density plot for ConvShock at t = 0.6.

5.4 Spheres — Cylinders in hypersonic flow

In the Spheres application, a constant Mach 10 flow passes over two spheres placed inside the computational domain. The flow results in steady bow shocks over the cylinders. This is a realistic flow problem with complex boundaries.

The initial grid size is 200x160 grid points and the application uses three levels of refinement with a constant refinement factor of two. The maximum number of grid points is 689,688. A density plot for time t=3 is shown in Figure 10.

6 Performance metrics

(25)

Figure 10: Density plot for Spheres at t=3.

Applica- Initial Levels of Refineme- Max Total

wo-tion grid size refinement nt factors grid size rkload

Ramp 480x240 3 {2,2,4} 722,924 1.78 ∗ 109

ShockTurb 240x120 3 {2,2,2} 787,076 1.21 ∗ 109

ConvShock 200x200 4 {2,2,4,2} 695,244 1.42 ∗ 109

Spheres 200x160 3 {2,2,2} 689,688 0.60 ∗ 109

Table 2: Application data. The maximum grid size denotes the number of grid points at the time step when the grid was at its largest. The total workload also considers refinement in time.

6.1 Number of blocks

To achieve low load imbalances, grid patches are often divided into smaller blocks. Splitting patches into many parts generally results in larger faces between the blocks, inducing more communications. Having many blocks also result in other types of overhead, e.g. larger start-up costs, more ”book- keeping” and often higher cache miss rates. For the number of blocks metric, we compute the maximum number of blocks assigned to a processor.

6.2 Load imbalance

(26)

0 50 100 150 200 0 500 1000 1500 Number of patches 0 50 100 150 200 0 5 10 15 x 106

Application data, ConvShock

Time step Workload Workload Number of patches (a) ConvShock 0 50 100 150 200 250 300 0 100 200 300 400 500 Number of patches 0 50 100 150 200 250 300 0 2 4 6 8 10 x 106

Application data, Ramp

Time step Workload Workload Number of patches (b) Ramp 0 50 100 150 200 250 300 350 0 200 400 Number of patches 0 100 200 300 0 5 10 x 106 Time Step Workload

Application data, ShockTurb

Number of patches Workload (c) ShockTurb 0 50 100 150 200 250 300 350 0 1000 2000 Number of patches 0 100 200 300 0 2 4 x 106

Application data, Spheres

Time step

Workload

Number of patches Workload

(d) Spheres

Figure 11: Workload and the number of grid patches for the four test applications.

Load imbalance (%) = 100 ∗M ax{processor workload}_{Average workload} − 100

Since we generally also refine in time, the workload is larger than the total grid size. We use the workload of the most loaded processor, as all processors must finish their computations before the solution can be advanced to the next time step.

6.3 Volume of communications

(27)

processors. Therefore, processors have to communicate the value of these grid points before they can compute the value of the current grid points. The communication be-tween processors can be time consuming and can slow down the execution, especially if the interconnect is slow. The communication metric is divided into three groups:

• Intra-level communication

Exchange of data residing on the same level of refinement. • Inter-level communications

Exchange of data residing on different levels of refinement. • Total communications

The sum of the intra- and inter-level communications.

7 Performance results

We have evaluated the implemented patch-based partitioner using four real-world SAMR-application. For each SAMR-application (see Section 5), we use a trace file that contains all information about the resulting grid hierarchies. The trace files were partitioned by both the patch-based and a domain-based partitioner from the SAMR-framework AMROC [2, 9]. The resulting partitions were used as input to a SAMR simulator [17]. The SAMR simulator mimics the execution of the common Berger-Colella SAMR algorithm [3]. For each time step, the simulator calculates metrics like number of blocks, arithmetical load imbalance and communication volumes. The metrics (see Section 6) are independent of computer characteristics. All results were obtained using 16 processor configurations.

To evaluate the scaling characteristics of the patch-based algorithm, we also present results from representative 32 processor configurations.

7.1 Number of blocks

It is important to limit the number of blocks because a smaller number of blocks po-tentially results in smaller communication volumes. When a lower number of blocks are assigned to each processor, it is less likely that a processor will need data from grid points assigned to other processors. The results are presented in Figure 12.

(28)

result in a smaller number of blocks. However, since the patch-based partitioner is implemented to only divide the patches that actually cause the load on a processor to overflow, the number of blocks are kept low.

0 50 100 150 200 0 50 100 150 200 250 Time step Number of blocks

Number of blocks, ConvShock Patch based Domain based (a) ConvShock 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 90 100 Time step Number of blocks

Number of blocks, Ramp Patch based Domain based (b) Ramp 0 50 100 150 200 250 300 350 0 20 40 60 80 100 120 140 160 Time step Number of blocks

Number of blocks, ShockTurb Patch based Domain based (c) ShockTurb 50 100 150 200 250 300 350 0 50 100 150 Time step Number of blocks

Number of blocks, Spheres Patch based Domain based

(d) Spheres

Figure 12: Number of blocks, p=16, patch-based vs domain-based

7.2 Load imbalance

(29)

pro-duce a larger load imbalance than five procent. As we described in Section 4 we allow a load imblance of five procent to avoid that excess workload are accumulated on the last processor. This mechanisms seems to work flawlessly.

The domain-based partitioning algorithm produced high load imbalances because of restrictions on where to place the cuts that divide the hierarchy. The cuts can only be placed at grid points on the base grid. Thus, large amounts of load imbalance can arise because the larger workloads on higher level of refinements can not be evenly divided. 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 90 100 Time step Load imbalance (%)

Load imbalance, ConvShock Patch based Domain based (a) ConvShock 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 100 Time step Load imbalance (%)

Load imbalance, Ramp Patch based Domain based (b) Ramp 0 50 100 150 200 250 300 350 0 5 10 15 20 25 30 35 40 45 50 Time step Load imbalance (%)

Load imbalance, ShockTurb Patch based Domain based (c) ShockTurb 0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 100 Time step Load imbalance (%)

Load imbalance, Spheres Patch based Domain based

(d) Spheres

(30)

7.3 Intra-level communication

For the intra-level communication, the patch-based partitioning algorithm gives bet-ter results for ShockTurb (see Figure 14). The patch-based and domain-based parti-tioning algorithms give comparable results for Spheres, while the domain-based par-titioning algorithm result in lower communication for ConvShock and Ramp. Thus, the patch-based partitioning algorithm generally is comparable to the domain-based partitioning algorithm with regard to the intra-level communication.

Looking at the number of blocks, we expected that the patch-based partitioning al-gorithm would perform best for the ConvShock and the Ramp applications. For these two applications, the patch-based algorithm produced a significantly lower number of blocks than the domain-based algorithm. However, the best performance by the patch-based algorithm was instead achievied for the Spheres and ShockTurb appli-cations, were both algorithms produced a similar number of blocks. From these re-sults, we draw the conclusion that the behavior of the intra-level communication is dependent on many other properties than the number of blocks.

We expected the domain-based partitioning algorithm to produce a substancially lower volumes of intra-level communication than the patch-based partitioning algo-rithm. For the patch-based algorithm, the patches on each level of refinement are or-dered by mapping them onto a small SFC-grid (see Section 4). Becuase of the limited size of the SFC-grid, it is possible that different patches are mapped to the same loca-tion on the SFC-grid, and thus assigned to the same SFC-index. The data structure used by the domain-based partitioning algorithm allows for each patch to be assigned a unique SFC-index, resulting in a more accurate mapping. Despite that the domain-based partitioning algorithm in theory has better locality preserving properties, the two algorithms resulted in similar amounts of intralevel communication.

7.4 Inter-level communication

We only present inter-level communication for the patch-based partitioning algo-rithm in Figure 15. For the domain-based partitioning algoalgo-rithm, all overlapping areas of a grid hierarchy are assigned to the same processor. Hence, there is no inter-level communication between the processors when the domain-based partition-ing algorithm is used.

(31)

0 50 100 150 200 0 1 2 3 4 5 6 7 8x 10 4 Time step

Number of grid points

Intra level communications, ConvShock Patch based Domain based (a) ConvShock 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3 3.5x 10 4 Time ste p

Intra level communications, Ramp Patch based Domain based (b) Ramp 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 12000 14000 16000 Time step

Intra level communications, ShockTurb Patch based Domain based (c) ShockTurb 0 50 100 150 200 250 300 350 0 2000 4000 6000 8000 10000 12000 14000 Time step

Intra level communications, Spheres Patch based Domain based

(d) Spheres

Figure 14: Intra-level communication, p=16, patch-based vs domain-based have larger amounts of communication because of the larger workloads that result from their higher degrees of refinement.

To reduce inter-level communication, overlapping areas of the grid hierarchy should be assigned to the same processor. This is a difficult task that requires advanced al-gorithms and large amounts of post-processing after the initial assignment of blocks to the processors. To our knowledge, this issue has not been seriously addressed in the context of patch-based partitioners and it is also outside of the scope of this thesis.

7.5 Total communication

(32)

0 50 100 150 200 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5x 10 5 Time step

Inter level communications, ConvShock

(a) ConvShock 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3x 10 5 Time step

Inter level communications, Ramp

(b) Ramp 0 50 100 150 200 250 300 350 0 0.5 1 1.5 2 2.5x 10 5 Time step

Inter level communications, ShockTurb

(c) ShockTurb 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 14 16 18x 10 4 Time step

Inter level communications, Spheres

(d) Spheres

Figure 15: Inter-level communication, patch-based vs domain-based

inter-level communication since the amount of intra-level communication is much lower than the inter-level communication (see Figure 16). The domain-based algo-rithm results in a very low amount of total commmunication since the algoalgo-rithm does not have in any inter-level communication. Thus, to reduce the amount of com-munication for the patch-based algorithm, the effort should be directed to decrease the inter-level communication.

7.6 Impact of SFC

(33)

SFC-0 50 100 150 200 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5x 10 5 Time step

Total communication, ConvShock Patch based Domain based (a) ConvShock 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3x 10 5 Time step

Total communication, Ramp Patch based Domain based (b) Ramp 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5x 10 5 Time step

Total communication, ShockTurb Patch based Domain based (c) ShockTurb 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2x 10 5 Time step

Total communication, Spheres Patch based Domain based

(d) Spheres

Figure 16: Total communication, patch-based vs domain-based mapping on the Ramp application.

Using SFC to Increase the locality significantly reduces the intra-level communi-cation volumes, with up to 75% (see Figure 7.6). The SFC-ordering assigns blocks that lie close to each other to the same processor. We also hoped that the amount of inter-level communication would be reduced since the SFC order the blocks in a similar pattern for each refinement level. This can potentially result in overlapping patches being assigned to the same processor. However, the reduction in inter-level communication is very small and can not be regarded as significant. Turning to load imbalance and the number of blocks, we see that the use of SFC has minor impact on these metrics.

(34)

generally is the most performance inhibiting factor for patch-based partitioners, the better locality will help reduce the computational times.

7.7 Scalability of the implemented patch-based partitioner

In this section we present the results from 32 processor configurations to examine the scalability characteristics of the patch-based partitioner. We limit ourselves to the Ramp application, since the other applications have a similar behavior. For each metric we also present the corresponding result for 16 processors. The results are presented in Figure 18.

The amount of load imbalance is limited to 5% for the both processor configura-tions. The partitioner can thus generally be assumed to produce a load imbalance lower than five procents.

The number of blocks grows when the number of processors is increased. This was expected because since we subdivide the patches that cause a processor to over-load. The more processor we got, the more patches will be subdivided. However, the increase in the number of blocks is less then the increase in the number of processors. The intra-level communication is marginally affected by the increase from 16 to 32 processors, eventhough the number of blocks is almost doubled. This means that the implemented SFC-mechanism scales very well with regard to intra-level commu-nication.

For inter-level communication, we observe a growth in the communication vol-ume. The increase is proportionally less than the increase in the number of proces-sors. Since the inter-level communication often is the most performance inhibiting factor, this is a good result. We believe that the SFC-ordering slightly decreases the communication volumes.

Because of the good properties of the intra-level communication, the total commu-nication does not grow as much as one could expect. However, since the inter-level communication is much larger than the intra-level communication, the amount of communication is still high.

(35)

0 50 100 150 200 250 300 0 10 20 30 40 50 60 Time step Number of blocks

Number of blocks, Ramp With SFC Without SFC (a) Ramp 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3x 10 5 Time step

Inter level communication, Ramp With SFC Without SFC (b) Ramp 0 50 100 150 200 250 300 0 1 2 3 4 5 6x 10 4 Time step

Intra level communication, Ramp With SFC Without SFC (c) Ramp 0 50 100 150 200 250 300 0 1 2 3 4 5 6 7 Time step Load imbalance (%)

Load imbalance, Ramp With SFC Without SFC (d) Ramp 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3 3.5x 10 5 Time step

Total communication, Ramp With SFC Without SFC

(e) Ramp

(36)

0 50 100 150 200 250 300 0 10 20 30 40 50 60 Time step Number of blocks

Number of blocks, Ramp

p=16 p=32 (a) Ramp p=16 vs p=32 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3x 10 5 Time step Number of points

Inter level communication, Ramp p=16 p=32 (b) Ramp p=16 vs p=32 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3 3.5x 10 4 Time step Number of points

Intra level communication, Ramp p=16 p=32 (c) Ramp p=16 vs p=32 0 50 100 150 200 250 300 0 1 2 3 4 5 6 7 8 9 10 Time step Load imbalance (%)

Load imbalance, Ramp

p=16 p=32 (d) Ramp p=16 vs p=32 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3x 10 5 Time step Number of points

Total communication, Ramp p=16 p=32

(e) Ramp p=16 vs p=32

(37)

8 Summary and conclusion

To maintain good performance for parallel SAMR applications, the resulting dynamic grid hierarchies are repeatedly partitioned and distributed over the participating processors.

In this thesis, we implemented a patch-based partitioning algorithm for SAMR applications. We specifically targeted the load imbalance while using space filling curves to limit the amount of communication.

The algorithm was evaluated using four real-world SAMR-application. The re-sults are compared to rere-sults from a common domain-based algorithm. From the evalutation we make the following conclusions:

• The patch-based algorithm produce a very low and stable load imbalance com-pared to the domain-based algorithm.

• The partitioner produce a comparable numbers of blocks with regard to the domain-based partitioner.

• For intra-level communication, the patch-based partitioning algorithm produce comparable communication volumes with regard to the domain-based partition-ing algorithm.

• For the patch-based algorithm, the volume of inter-level communication is sub-stancially larger than the amount of intra-level communication. The domain-based algorithm do not produce any inter-level communication since all over-lapping areas are assigned to the same processor.

• The total communication volume are much larger for the patch-based partition-ing algorithm compared to domain-based algorithm. This was expected since the domain-based algorithm only produce intra-level communication.

Experiments with different numbers of processors show that the implemented patch-based algorithm scales well when the number of processors are increased. For large numbers of processors, communication might become a limiting factor — de-spite the good scaling properties.

(38)

References

[1] Rebecca C. Wade Adrian H. Elcock, Razif R. Gabdoulline and J. Andrew McCammon. Computer simulation of protein-protein association kinetics: acetylcholinesterase-fasciculin. Department of Chemistry and Biochemistry, De-partment of Pharmacology, University of California at San Diego, La Jolla, CA 92093-0365, USA, European Molecular Biology Laboratory, Meyerhofstrasse 1 69117, Heidelberg, Germany Received 5 February 1999; revised 2 June 1999; ac-cepted 2 June 1999. ; Available online 2 May 2002., 291:149–162, 6 September 1999.

[2] AMROC - Blockstructured adaptive mesh refinement in object-oriented C++. http://amroc.sourceforge.net/index.htm, Oct. 2006.

[3] M.J. Berger and P. Colella. Local adaptive mesh refinement for shock hydrody-namics. Journal of Computational Physics, 82:64–84, May 1989.

[4] Greg Breinholt and Christoph Schierz. Algorithm 781: Generating hilbert’s space filling curve by recursion. Swiss Federal Institute of Technology, 24:184– 189, June 1998.

[5] Greg L. Bryan. Fluids in the universe: Adaptive mesh refinement in cosmology. Computing in Science and Engineering, pages 46–53, Mar-Apr 1999.

[6] Niklas Bylund. Aimulation driven product delevopement applied to car body design. Doctoral thesis, Lund university, 1:172, Aug. 2004.

[7] Chombo - Infrastructure for adaptive mesh refinement.

http://seesar.lbl.gov/ANAG/chombo/, Dec. 2006.

[8] R. Deiterding, R. Radovitzky, L. Noels S. Mauch, J.C. Cummings, and D.I. Me-iron. A virtual test facility for the efficient simulation of solid material response under strong shock and detonation wave loading. To appear in Engineering with Computers, 2006.

[9] Ralf Deiterding. Detonation simulation with the AMROC framework. In Forschung und wissenschaftliches Rechnen: Beitr ¨age zum Heinz-Billing-Preis 2003, pages 63–77. Gesellschaft f ¨ur Wiss. Datenverarbeitung, 2004.

[10] Zhiling Lan, Valerie E. Taylor, and Greg Bryan. Dynamic load balancing of SAMR applications on distributed systems. In Proceedings of 30th International Conference on Parallel Processing, 2001.

(39)

[12] Manish Parashar and James C. Browne. On partitioning dynamic adaptive grid hierarchies. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, 1996.

[13] Manish Parashar, James C. Browne, Carter Edwards, and Kenneth Klimkowski. A common data management infrastructure for adaptive algorithms for PDE solutions. In Proceedings of Supercomputing, 1997.

[14] ˚A. Petersson, H. Karlsson, and S. Holmgren. Predissociation of the Ar-12 van der Waals molecule, a 3D study performed using parallel computers. Technical report, Department of Quantum Chemistry, Uppsala University, Sweden, 2001. [15] James J. Quirk. A parallel adaptive grid algoritm for computational shock

hy-drodynamics. Applied Numerical Mathematics, 20:427–453, 1996.

[16] Jarmo Rantakokko. Data Partitioning Methods and Parallel Block-Oriented PDE Solvers. PhD thesis, Uppsala University, 1998.

[17] Mausumi Shee, Samip Bhavsar, and Manish Parashar. Characterizing the per-formance of dynamic distribution and load-balancing techniques for adaptive grid hierarchies. In Proceedings IASTED International conference of parallel and distributed computing and systems, 1999.

[18] Johan Steensland. Efficent Partitioning of Dynamic Structured Grid Hierar-chies. PhD thesis, Department of Scientific Computing, Information Technology, Uppsala University, Oct. 2002.

[19] Johan Steensland. Irregular buffer-zone partitioning reducing synchronization cost in SAMR. International Journal of Computational Science and Engineering (IJCSE), 2006. Special issue, to appear.

[20] The Virtual Test Facility. http://www.cacr.caltech.edu/asc/wiki, Oct. 2006. [21] M. Thun´e. Partitioning strategies for composite grids. Parallel Algorithms and

Applications, 11:325–348, 1997.

[22] M. Vetter and R. Stuartevant. Experiments on the Richtmyer-Meshkov insta-bility on a air/SF6 interface. Shock Waves, 4(5):247–252, 1995.

A patch-based partitioner for Structured Adaptive Mesh Refinement

Examensarbete 30 hp

Februari 2008

A patch-based partitioner for

Structured Adaptive Mesh Refinement

Implementation and evaluation

Abbas Vakili

Abstract

A patch-based partitioner for Structured Adaptive

Mesh Refinement

Contents

1 Introduction

2 Background

3 Load balancing algorithms

4 Implementation of the partitioner

5 Applications

6 Performance metrics

7

Performance results

8 Summary and conclusion

References