Parallel implementation of the projected Gauss-Seidel method on the Intel Xeon Phi processor – Application to granular matter simulation.

(1)

the Intel Xeon Phi processor – Application to granular matter

simulation.

Emil Rönnbäck

18th August 2014

Master’s Thesis in Computing Science, 30 credits

Supervisor at Algoryx and UMIT Research Lab: Claude Lacoursière Supervisor at CS-UmU: Stefan Johansson

Examiner: Fredrik Georgsson

Umeå University

Department of Computing Science

SE-901 87 UMEÅ

(2)

(3)

corn, and snow. Research and development of new, more accurate, and faster methods to simulate even more complex materials with millions of particles are needed. In the work of this thesis a typical scene containing thousands of particles has been used for analysing simulation performance using the iterative Gauss-Seidel method adapted to the specifications and capabilities of the Intel Xeon Phi coprocessor. The work began with analysing the performance (wall-clock time and speedup) of a method developed by Algoryx Simulation.

The work continued with finding the parts in the code causing bottlenecks and implementing improvements such as a distributed task scheduler and vectorization of operations. In the end, this resulted in shorter execution time and linear speedup using more than40 threads, compared to 20 in the initial state. We also investigated the benefit of other techniques, such as cache prefetching and usage of huge page sizes, but found no performance gain from these. It is well known that the Xeon Phi coprocessor performs well when executing highly parallel applications, but overload may occur if excessive amount of data is requested by many threads simultaneously. To tackle this issue, the convergence rate of the Gauss-Seidel method during simulation has been measured and suggested modifications of the method decreasing data flow have been implemented and analysed.

(4)

ii

(5)

Abstract i

1 Introduction 1

1.1 Previous work . . . 2

1.1.1 Related work . . . 2

2 Algoryx Simulation 3 3 Problem Description 5 3.1 Problem statement . . . 5

3.2 Goals . . . 5

3.3 Purposes . . . 5

3.4 Methods . . . 6

4 Theory 7 4.1 Multibody dynamics simulation . . . 7

4.2 Kinematic constraints . . . 7

4.3 Iterative Methods . . . 9

4.4 The Gauss-Seidel Method . . . 10

4.4.1 Projected Gauss-Seidel . . . 11

4.4.2 Parallel Gauss-Seidel . . . 12

4.5 Spatial partitioning . . . 12

4.6 Algorithm description . . . 13

4.7 Hardware: The Intel Xeon Phi Coprocessor . . . 13

4.8 Current implementation . . . 17

4.8.1 Cache prefetching . . . 17

4.8.2 Vectorization . . . 17

5 Results 19 5.1 Initial implementation . . . 19

5.2 Improvements . . . 19

5.2.1 Task Scheduling . . . 21

(6)

iv CONTENTS

5.2.2 Vectorization . . . 21

6 Discussion 25 6.1 Results in relation to previous work . . . 25

6.2 Iterative methods in relation to hardware . . . 25

6.3 Convergence of the projected Gauss-Seidel method . . . 25

6.4 Examined methods not giving any improvements . . . 26

7 Conclusions 27 7.0.1 Building for the Intel Xeon Phi Architecture . . . 27

7.1 Future work . . . 27

8 Acknowledgements 29

References 33

(7)

4.1 Explanation of variables used in the multibody dynamics equation presented in (4.2). . . 8 4.2 Specification of the Intel Xeon Phi Coprocessor architecture compared to the

host architecture. . . 16

(8)

(9)

3.1 A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks. . . 6 4.1 Two objects,B1andB2, at distancex1+ r1− x2− r2from each other. . . . 9 4.2 Spatial partitioning of a scene to be able to solve the motions of particles in

parallel, where each region corresponds to a job for one thread. . . 13 4.3 Showing a bipartite graph where objects inU are connected through constraints

inV . . . 14 4.4 A tree of dependence for calculations to be done within each box generated

by spatial partitioning. . . 14 4.5 Overview of the Intel Xeon Phi architecture. . . 16 5.1 Wall clock time graph of the initial implementation of the GS solver run on

the Intel Xeon Phi coprocessor. . . 20 5.2 Speedup graph of the initial implementation of the GS solver run on the Intel

Xeon Phi coprocessor. . . 20 5.3 Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor,

modified to use Intel Threading Building Blocks’ distributed task scheduler. . 21 5.4 Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor,

modified to use Intel Threading Building Blocks’ distributed task scheduler. . 22 5.5 Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor,

modified to use Intel Threading Building Blocks’ distributed task scheduler and vectorization instructions. . . 22 5.6 Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor,

modified to use Intel Threading Building Blocks’ distributed task scheduler and vectorization instructions. . . 23 6.1 Convergence rate for the Gauss-Seidel method. . . 26

(10)

(11)

Introduction

Granular materials are ubiquitous in nature. A few examples of this type of matter are gravel in a bucket, coffee beans in a container and even icebergs. Because of its frequent occurrence, it is also the most common material in industry after water [20]. This is one reason why it is of great interest both from industry and the academy to be able to do simulations of granular matter; for research in how to improve industrial tools and methods, and to get further knowledge of its behavior. Technically, this kind of matter is a collection of distinct macroscopic particles behaving as solid bodies when compact and more like liquid or gas otherwise. The particles are always in frictional contact, and friction is one of the most important contribution to the overall properties. To simulate a granular it is required to solve a large system of linear equations. To increase the speed of processing and grant the possibility to simulate a greater amount of particles, a parallelizable method is a necessity. In this thesis an initial implementation of a method called Parallel Projected Gauss-Seidel (PPGS), developed at Umeå Unversity[12] and Algoryx Simulation AB [17], is analysed and improved. More information and motivation why PPGS is the best choice for solving very large systems of linear equations is given in Section 4.1. Improvements are sought in sense of general speed increase of the algorithm and also performance improvement on a specific hardware - the Intel Xeon Phi coprocessor. Compared to a common x86 processor, the coprocessor consists of many cores and utilizes high bandwidth and vector instructions for high parallelization capability. This combination of properties has never been seen before, that is why it is relevant to find out if this type of processing unit is of interest for physics simulation software. In the simulation software as a whole, there are several tasks to be performed, some sequentially and others in parallel. This thesis focus on the most time consuming step in the simulation algorithm where the motions are calculated. To get an efficient solution as possible, studies of the hardware, the theory of the PPGS method and how to combine those factors into an optimal product have been made.

The improvements are primarily results from using Intel Threading Building Blocks[3] for distributed job scheduling and use of assembly-coded functions for vectorized operations specifically for the coprocessor. For performance measurements, simulations are done on a rolling drum containing a large amount¹ of particles, see Figure 3.1. The final results for the solver shows linear speedup using more than 40 threads, compared to linear speedup up to only 20 threads with the initial implementation.

1scenes with up to 500 thousand particles.

(12)

2 Chapter 1. Introduction

1.1 Previous work

In this section, previous work on simulation of granular matter is presented with focus on scalability of the solver used. There exists several methods that can be used to simulate granular matter and therefore it is of interest to investigate what has been done before and which results were accomplished. In a simulation, there are many parameters that influences the computational performance, such as physical properties of the matter simulated which directly affects the mathematical problem that needs to be solved and how the data structures are stored in memory. Considering the parallelization aspect of the algorithms in the simulation, even though some algorithms are easier to parallelize than others, they might not perform better overall. Also, the properties of the platform – such as ordinary clusters, GPUs, Intel Xeon Phi make some types of calculations easier than others. All of these mentioned topics are currently active fields of research. Below a slection of three papers addressing the exact same problems as addressed by this thesis, will be discussed briefly. It is beyond the scope of this thesis to present a comprehensive review of literature.

Focus was put on porting an existing method to the Intel Xeon Phi coprocessor.

In [19], M. Renouf, F. Dubois and P. Alart" investigated a parallelization approach of the Non Smooth Contact Dynamics method for granular matter simulation, where OpenMP were used for thread communication. The simulations were run on up to 6 processors and according to the speedup graph presented, relative speedup stayed above 90% of linearity.

In another investigation [18], M. Renouf and P. Alart used a Conjugate Gradient(CG) type algorithms to solve frictional multi-contact problems in application to granular matter.

In that result it was concluded that the number of iterations a CG algorithm needed to reach reasonable accuracy was only a third of that of a comparable Gauss-Seidel solver. On a multi-threaded implementation though, there were problems with ill-conditioning which degraded the performance. The CG methods are easier to parallelize than others but, as discussed in the paper, to date their convergence aren’t very good.

A recent paper [23] by V. Visseq, P. Alart and D. Dureisseix, presented good results for granular simulations where a domain decomposition approach were used. To solve the contact problems, Gauss-Seidel iterations were applied and OpenMP was the mean of communication between threads. The results included simulations of 200k particles were linear speedup was observed for up to around 20 threads, and satisfactory performance after that too.

Performance can always be better, and investigating new platforms is of interest to explore new possibilities for better performance and results generally. In this thesis, performance of the relatively new platform Xeon Phi will be investigated. It is a many core architecture, and that is why it might be suitable for highly parallelizable large scale granular matter simulations. Compared to usual clusters it is cheaper which is attractive, and even though it is slower at its current state it has better performance per Watt.

1.1.1 Related work

A prototype solution of granular matter simulation using PPGS has been developed at Umeå University[12] and further improved at Algoryx Simulation AB [17].

(13)

Algoryx Simulation

Algoryx Simulation is the company at where I, the author, have been situated during the work of this thesis. The company is a leading provider of software and services for visual and interactive physics based simulation. The products they currently provides are AgX Dynamics and Dynamics for SpaceClaim for the professional market, and Algodoo for the education market. Algoryx Simulation collaborates closely with UMIT Research Lab at Umeå university, which in turn was the provider of the opportunity and task of writing this thesis.

(14)

4 Chapter 2. Algoryx Simulation

(15)

Problem Description

Given a large amount of particles interacting almost exclusively with frictional contact forces, by which means can the Parallel Projected Gauss-Seidel Method (PPGS) be implemented to get good scalability while still keeping the physical correctness? This is the main question of this thesis, which is answered by taking the Intel Xeon Phi coprocessor into account. Other questions that this thesis will try to clarify are: how to parallelize the Gauss-Seidel algorithm for best performance and where are the bottlenecks and how can these be taken care of.

Taken into account in this work is that the PPGS method is not naturally parallelizable, and also communication intensive.

3.1 Problem statement

In fast and efficient simulations of large systems of granular matter with millions of particles, the biggest challenge is to cope with all the involved computations within a given time constraint. Thus, the aim is to decrease the time needed for calculations. Algorithm optimization for speed increase is mainly what is desired, taking useful properties but also limitations of the Intel Xeon Phi coprocessor in consideration.

3.2 Goals

The goals of this thesis is to evaluate and find bottlenecks of an existing implementation of the PPGS method for solving systems of particle interactions, especially considering the hardware properties of the Intel Xeon Phi coprocessor. Further, the goal is to implement proposed improvements for increased performance when using Intel Xeon Phi coprocessor.

3.3 Purposes

The purpose of this thesis is to evaluate and improve the PPGS method used in simulation of massive number of particles. Improvements are sought in sense of increased parallelism and computational speed. As in a lot of areas including this, solving a task faster is advantageous to speed up the iteration loop of new research and development. Atop of this, the architecture of Intel Xeon Phi is of essential interest, being many core, x86-compatible and having much higher memory bandwidth than common workstation processors[10]. A performance

(16)

6 Chapter 3. Problem Description

Figure 3.1: A rotating drum containing 500 thousand particles. This is the example used throughout for tests and benchmarks.

evaluation of a non-‘embarrassingly parallel’¹ problem such as granular simulation is of interest to get indications of how well the architecture solves such problems. Whichever the result of the evaluation is, it will clearly state the difference to the standard fewer-core architecture and point out bottlenecks in performance, which might be useful for future implementations; both software- and hardware-related.

3.4 Methods

To be able to improve the PPGS method, test simulations must be performed and evaluated.

As an initial test, scenes consisting of a rotating drum containing a varying number of particles will be used. This is a scene where the involved particles are interacting as in common everyday scenarios, for example gravel in a bucket. The picture in Figure 3.1 illustrates the drum and the containing granular material. The number of particles within the scenes will range between50, 000 − 500, 000. This is more challenging than cases with only resting contacts because the neighbour lists change dynamically, and both stiction and sliding friction are present. When executing the simulations, a fixed number of time steps will be made, using 1 up to 228 threads for every scene. These tests will result in measurements of time which will be analyzed to get indications of where the bottlenecks are and what are causing them. In parts of the implementation, such as in the task management and in the Gauss-Seidel algorithm, where modifications may provide improvements, proposals of changes will be implemented and another iteration of tests will be made.

1A problem that requires no or little effort to separate into a number of parallel tasks.

(17)

Theory

When performing multibody dynamics simulations, one of the most time consuming steps is to calculate the forces of interacting objects. The non-smooth method we use requires the solution of linear systems of equations, which is done every time step. Basically what is sought is the solution to the linear equation

Ax= b. (4.1)

whereA is an n × n matrix. b is a vector of length n, and x = (x1, x2, . . . , xn) is the vector of unknowns. One way of doing this is to use a direct method (e.g., Gaussian elimination), which computes the exact solution in finite time. The complexity of this method is O(n²), which for large systems becomes very time consuming, and though it also got a high memory footprint it is not well suited for solving large systems of equations [22]. When considering solving for thousands of interactions every time step and being time constrained, using a direct method is accordingly not possible. Hence, another type of method must be used; an iterative method. Using an iterative method it is possible to get a good enough approximation within a set time constraint. Methods of this type iteratively improves the approximation, either until a target precision is met or a time limit is reached [13].

4.1 Multibody dynamics simulation

When simulating interacting objects, the interaction between the objects can be described by the linear system

M −G^T_k

Gk Σ

vk+1

λ

=

M vk+ hfk

−⁴_hΥgk+ ΥGkvk

(4.2) which is solved with regard to v and λ. Explanations of the variables shown in Equation 4.2 are summarized in 4.1. In Equation 4.2, each column ofM and G represents an interacting object and each row ofG is a connection between interacting objects.

4.2 Kinematic constraints

When simulating mechanical phenomenons, there is a need to mathematically describe different constraints acting upon objects in a scene. A constraint could, for example prevent

(18)

8 Chapter 4. Theory

M

Mass matrix: This matrix is a block diagonaln × n matrix, where each block is a 3 × 3 (in 3d) mass matrix of an interacting object

G

Jacobian matrix: This matrix is sparse, containing gradients of functions describing kinematic constraints between objects in the scene (such as collision constraints).

This is described in detail in Section 4.2.

Σ

Diagonal matrix containing regularization parameters.

v

^velocity

λ

Lagrange multiplier

Σ

(= _h⁴2diag(₁₊₄¹τ1 h ,₁₊₄²τ2

h , . . . ,₁₊₄^mτmn h

) Regularization parameter [14, p. 100]

Υ

(= diag(₁₊₄¹τ1 h ,₁₊₄¹τ2

h , . . . ,₁₊₄¹τmn h

)) Regularization parameter [14, p. 100]

f

Force

k

Current time step

h

Time step size

g

Constraint violation

Table 4.1: Explanation of variables used in the multibody dynamics equation presented in (4.2).

objects from moving through each other during collision, act as friction between objects, or lock one object to another in a specified way such as with a hinge or joint. What follows is an example of how a contact constraint is derived and transformed into a useful expression for the rest of the simulation. This is the type of constraints used in granular matter simulations.

Consider two objects at positionx1andx2in the world space, as seen in Figure 4.1. The point of collision forB1andB2is

p1= x1+ r1 andp2= x2+ r2, (4.3)

where the surface normals at the location of contact is n1 andn2 for object B1 and B2, respectively. From this, the penetration depth can be computed as

C = (p1− p2) · n1= (x1+ r1− x2− r2) · n1 (4.4)

C is the position based penetration constraint for the contact between B1 andB2, which is satisfied as long as the two bodies are separated;C ≥ 0. In the computations done in the simulations of this thesis, the constraints are velocity based. To form such constraintC, it

(19)

x B1

B2

x1

x2

r1

r2

Figure 4.1: Two objects,B1andB2, at distancex1+ r1− x2− r2 from each other.

must be derived with respect to time:

C =˙ d

dt((x1+ r1− x2− r2) · n1)

= d

dt(x1+ r1− x2− r2) · n1+ (x1+ r1− x2− r2) · d dt(n1)

= (v1+ ω1× r1− v2− ω2× r2) · n1+ (x1+ r1− x2− r2) · d dt(n1)

≈ (v1+ ω1× r1− v2− ω2× r2) · n1

= v1· n1+ ω1· (r1× n1) − v2· n1− ω2· (r2× n2)

= n^T₁ (r1× n1)^T −n^T₁ −(r2× n2)^T

| {z }

J





 v1

ω1

v2

ω2







| {z }

v

(4.5)

The approximation on line four in Equation 4.5 can be done since the added on line three is zero when the constraint is satisfied. TheJ part of ˙C is called a Jacobian matrix and is what G matrix in Equation 4.2 consists of; one Jacobian for every pair of objects in contact. [1]

4.3 Iterative Methods

There exists a large number of iterative methods for solving systems of linear equations.

Many of these methods (e.g. Jacobi, Gauss-Seidel, Successive over-relaxation) have the form

M x^(k+1)= N x^(k)+ b (4.6)

where A = M − N is a splitting of the matrix A. For a method to be practical, it must be computationally cheap to solve a linear system with M as the matrix. Whether (4.6) converges tox = A⁻¹b is dependent upon the eigenvalues of M⁻¹N . Specifically, the size of the spectral radius of ann × n matrix G defined by

ρ(G) = max{|λ| : λ ∈ σ(G)} (4.7)

whereσ(G) is the set of all eigenvalues of G. In this case G = M⁻¹N , and its spectral radius

(20)

10 Chapter 4. Theory

Theorem 4.3.1. [7] Suppose b ∈ Rⁿ and A = M − N ∈ Rⁿ^×n is non-singular. If M is non-singular and the spectral radius ofM⁻¹N satisfies the inequality ρ(M⁻¹N ) < 1, then the iterates x^(k) defined by M x^(k+1) = N x^(k)+ b converge to x = A⁻¹b for any starting vectorx⁽⁰⁾.

The consequence of Theorem 4.3.1 is that for the system of linear equations shown in Equation 4.6,ρ(M⁻¹N ) < 1 must be met for convergence to be guaranteed. It should be noted that even if this property is fulfilled forM⁻¹N , the convergence rate may be very slow if the spectral radius is close to1. Also, every iterative method of the type mentioned above uses its particular way of splitting matrixA. This results in differing properties of the matrix operated upon which in turn influences the convergence rate. One condition that guarantees convergence for these methods is strict diagonal dominance of matrixM⁻¹N , meaning

|aii| ≥X

j6=i

|aij|, for all i (4.8)

whereaij denotes the entry on rowi and column j of the matrix [9, p. 511]. Unfortunately, this property is not fulfilled for the matrices involved in computing contact forces.

4.4 The Gauss-Seidel Method

The iterative method that is used in this thesis to solve the linear system of equations Equation 4.1 is the Gauss-Seidel method [7]. In this section, the method is introduced and motivations why this method is chosen are presented.

For the Gauss-Seidel method the splitting (4.6) of the matrix A is M = D + L

N = −U (4.9)

where L, D, and U represents the strictly lower triangular, diagonal, and strictly upper triangular parts of A, respectively [9]. In matrix form, the Gauss-Seidel method is expressed as

x^(k)= (D − L)⁻¹(U x^(k−1)+ b), (4.10) where x^(k) denotesx at the kth iteration. The equations are solved one at a time, using previously computed valuesx^(k⁻¹⁾ as soon as they are available. At iterationk, the next value ofx^(k)_i , element i of x in the sequence of updates is obtained from the equation

x^(k)_i =bi−P

j<iaijx^(k)_j −P

j>iaijx^(k_j ⁻¹⁾ aii

. (4.11)

For Gauss-Seidel, there is an additional property other than strict diagonal dominance which guarantees convergence, which the follow theorem states.

Theorem 4.4.1. If A ∈ Rⁿ^×n is symmetric and positive definite (SPD), then the Gauss- Seidel iteration converges for anyx⁽⁰⁾. [7]

The matrix to solve for multibody dynamics simulation, given in (4.2) is positive definite but not symmetric, and thus, cannot be processed by Gauss-Seidel. To solve this problem, the Schur complement matrix is used instead, defined as

S= GM⁻¹G^T + Σ, (4.12)

(21)

Sλ = q − GM⁻¹p. (4.13) Solving (4.12) is equivalent to solving (4.13), and then substitution the results. This means that, afterλ has been computed from (4.2), v can be obtain as

v = M⁻¹(G^Tλ + p). (4.14)

Using Gauss-Seidel iteration, it is possible to obtainλ and v approximately without going through the time consuming task of computingS. For this the needed computations are of the form

y ← Gx,

z ← M⁻¹G^Tw. (4.15)

HereG and M⁻¹G^T do no have to be computed explicitly either. Instead, these operations are carried out by traversing the constraints and using a packed format for the entries in the matrices. Using the Schur complement, the main stepping equation can be written as v = v⁽⁰⁾+ hM⁻¹f⁽⁰⁾+ M⁻¹G^Tλ, which is interpreted as

v⁽¹⁾= v⁽⁰⁾+ hM⁻¹f⁽⁰⁾ f^(c)= G^Tλ

v = v⁽¹⁾+ M⁻¹f^(c)

(4.16)

where v⁽¹⁾is the predicted final velocity if no constraint forces are active,f^(c)is the impulse caused by constraints, andv is the final velocity. The main linear problem to solve is thus

Sλ = q − Gv⁽¹⁾. (4.17)

Gauss-Seidel is one out of several methods which may be used to solve systems of equations.

Jacobi, Conjugate Gradient and Successive Over Relaxation is some other methods to name a few. These methods improve the solution further with more iterations – not delivering the perfect answer, but rather a sufficiently good approximation within given time constraints.

Going further; arguments for using Gauss-Seidel as the iterative method to use follows.

Because the solution of Gauss-Seidel improves smoothly for every iteration run, it is possible to terminate the solver when time is up, and also, it is easy to implement in software.

Even though other methods can converge faster in theory, PGS is the only known method which gives satisfactory results given the current techniques. For a thorough discussion and analysis, see [14, p. 318].

4.4.1 Projected Gauss-Seidel

Projected Gauss-Seidel is a modified version of the original, which solves linear complement- arity problems (LCP), or rather mixed linear complementary problems (MLCP), which is a generalization of LCP to include free variables. In context of this thesis, MLCP is connected to physical properties in the scene, which will be explained below.

LCP is used to solve for a discontinious relationships between two variables such as those during contacts between granules in the simulations done in this thesis. In the following, an

(22)

for each element in the vectors. Precisely thatu = (u1, u2, ..., uN)^T ≥ (v1, v2, ..., vN)^T = v holds, if and only ifui≥ vi for1 ≤ i ≤ N . Given a symmetric matrix A ∈Rⁿ^×n, a vector b ∈Rⁿ, findx ∈Rⁿ andw ∈Rⁿ such that

w = Ax − b ≥ 0, (4.18)

x ≥ 0, (4.19)

x^Tw = 0. (4.20)

The idea to reach a solution is to use the splitting mentioned in (4.9) with a projection operation. Given a vectorx ∈Rⁿ a vector of lower limitsxlo≤ 0, a vector of upper limits xhi≥ 0, where xlo, xhi∈ Rⁿ, the projection operation on x is(x)⁺ and means that for each j = 0 to n − 1

x⁺_j = max(min(xj, xhi_j), xlo_j). (4.21) as shown by [8, p. 629].

4.4.2 Parallel Gauss-Seidel

The P arallel Gauss Seidel method is an extended version of the standard Gauss-Seidel method, having the capability to be run in parallel, thus solving problems faster.

For many mechanical systems and especially for systems of granules, the jacobians in G (4.2) has a simple block structure

G^T =h

G¹^T G²^T . . . Gⁿ^c^Ti

(4.22) where each block row Gⁱ^T is of size ni× n, withP

ini = m. nc is the total number of constraints. In every block rowGⁱ^T there are only a few number of non-zero column blocks – usually within the range[1, 2] as is common for contact constraints. Most commonly, each block column corresponds to a single physical object, which makes the system parallelizable for non-related objects and constraints. Figure 4.3 gives a descriptive view of this, a bipartite graph containing nodes which consists of either objects or constraints, where groups of nodes not connected through edges are independent from others. This is what makes it possible to bypass the computation of the Schur complement in (4.17) mentioned above. [13]

4.5 Spatial partitioning

When simulating many thousands of granules, as in our simulations, there is a need for other improvements too, other than identifying independent groups of constrained objects as stated above. The following describes how the simulation software divides the simulation volume spatially into boxes, which ensures that constraints in one box only reaches neighbouring boxes, establishing a method for parallel execution. Consider the right picture in Figure 4.2 which depicts a spatial partitioning of a volume containing particles, where the main partitioning generates equally sized cubes. The constraints between particles within each box is solved locally by different threads resulting in the ability of parallel execution. For places where particles shares a constraint between boxes, the volumes is split up more fine-grained.

Doing this creates new boxes, also containing constraints independent of information outside their enclosures. When calculations are carried out, the resulting data from each box not directly adjacent to another can be computed in parallel. Each of the boxes in the picture is

(23)

Figure 4.2: Spatial partitioning of a scene to be able to solve the motions of particles in parallel, where each region corresponds to a job for one thread.

eight particle-diameters wide, which provides for optimization of the partitioning algorithm, knowing that it can only exist one contact constraint along a straight path between two blue sub-boxes. Each of the boxes in the picture is eight units wide, for a simulation where one unit diameter sized particles are simulated. Having this size of the boxes restricts the number of contact constraints which fits in some dimension(-s) of inner sub-boxes depending on type which grants opportunity for optimization. The left picture in Figure 4.2 shows the partitioning in 2D, giving a more clear view than the 3D-view to the right. It should be noted that the splitting in 3D compared to 2D differs in the way that it contains an additional type of sub-box. Resulting out of this splitting is a tree of dependences where calculations from some boxes needs to be done before others can be handled. This dependency graph is displayed in Figure 4.5, where calculations from each type of box is done in stages. Relating the partitioning to the software, each box of calculations represent an independent task which are enqueued in task scheduler, having the responsibility to hand out work to unoccupied threads.

4.6 Algorithm description

When performing physics simulations, there are a number of tasks which must be performed for every object which is in contact with another to resolve the motion. These tasks are given in Algorithm 1. The variables used in the algorithm are the ones described in Table 4.1, where the values ofv0 andλ0 are those ofv and λ from the previous time step, respectively.

4.7 Hardware: The Intel Xeon Phi Coprocessor

The targeted hardware for the computations is the Intel Xeon Phi coprocessor[4, 5], which is a so called ‘Many integrated core’-architecture product, for large-scale parallel processing.

(24)

1

2

3

4

5

6

7

8

9 U

V

Figure 4.3: Showing a bipartite graph where objects inU are connected through constraints inV .

Start

Internal1 Internal2 Internal3 Internal4

edge1 edge2 edge3 edge4 edge5 edge6

corner1 corner2 corner3

Stop

Figure 4.4: A tree of dependence for calculations to be done within each box generated by spatial partitioning.

(25)

Given b, M , G, ˜, h, f0

Initializev ← v⁽⁰⁾+ hM⁻¹f0,λ ← λ0. Compute blocksdii ←P

bGkbM_bb⁻¹G^T_kb, fork = 1, 2, . . . , nc

repeat

fori = 1, 2, . . . , nc do

r ← −bi+ ni× vb₁+ ˜λ^(ν)_i if b26= 0 then

r ← r − ni× vb₂

end

z ← max(0, −r/dii+ λ^(ν)_i )

∆λ^(ν+1)_i ← z − λ^(ν)_i λ^(ν+1)_i ← z

vb₁ ← vb₁+ M_b⁻¹₁_b₁G^T_ib₁∆λi

if b26= 0 then

vb₂ ← vb₂+ M_b⁻¹₂_b₂G^T_ib₂∆λi

end end

until time is up

Algorithm 1: Algorithm used when simulating particle interactions. Running this algorithm is equivalent of performing GS iterations on the Schur complement stated in (4.17).

(PCIe) add-on card. In Table 4.2, a summary of the main properties of Intel Xeon Phi coprocessor compared to properties of Intel Xeon host processor are presented. This hardware is aimed at achieving high throughput performance where the available space and power are constrained. An advantage of this hardware is that the programming is done very similarly to standard x86 architecture, while still having the ability of high parallelization. Standard C, C++, and Fortran source code can, without any modifications, be compiled and run on the coprocessor. This is in contrast to other architectures such as various GPUs, where existing implementations needs to be totally rewritten because of significant differences in how the programming is done. The coprocessor is also supported by a rich development environment which includes compilers, numerous libraries for tasks such as threading and math computation and tools for performance tuning/debugging. Although, there is no real IDE available on the coprocessor, which puts constraints on the code development.

The coprocessor is suitable for highly parallel applications where the computation to data access ratio is high. Applications can either be run natively on the coprocessor or, by using instructions in the source code, instruct the application to run some parts on the host processor and send highly parallelizable parts to the coprocessor. The gain of doing this is faster execution in total, since the clock frequency of the host processor is more than two times higher than on the coprocessor.

The Xeon Phi coprocessor combines 57 cores on a single chip where the cores are connected to each other with the bidirectional Interprocessor Network (IPN) ring. Each core got 512 bit registers for Single Instruction, Multiple Data (SIMD) instructions, each core bidirectionally

(26)

every group of four hardware threads residing in every core, but the L2 memory as a whole is kept coherent among all cores. This means that the total L2 cache memory size sums up to almost 30MB. If threads in different cores performs work on the same data, each core will have its own local copy of the data. The cores have dual in-order execution pipelines which is more simple than out-of-order pipelines, resulting in threads stalling when data is fetched from memory in comparison with the latter as in the host cores where it is possible to execute other instructions while waiting. [15].

Intel Xeon processor: Coprocessor (model: 3120) Host (model: E5-2670)

Core frequency 1.1 GHz 2.6 GHz

Number of cores 57 8

Hardware threads per core 4 2

L1 Cache Size 64 + 64 kB per core 256 + 256 kB

L2 Cache Size 31 MB (512kB*62) 2 MB

Max. Memory Bandwidth 240 GB/s 51 GB/s

Table 4.2: Specification of the Intel Xeon Phi Coprocessor architecture compared to the host architecture.

[11]

Intel Xeon Processor

System Memory

PCIe client

logic L2

Core L2 Core

L2 Core

GDDR MC TD TD TD TD GDDR MC

L2 Core

L2 Core PCIe

Figure 4.5: Overview of the Intel Xeon Phi architecture.

Figure 4.5 shows a graphical description of the coprocessor architecture and how it is related to the rest of the computer system. First,to the left is the Intel Xeon host processor, which the coprocessor is connected to through a PCIe bus. There is a lightweight Linux distribution installed on the card which is accessed through Secure Shell (SSH) from the host processor. There is also a TCP/IP stack implemented over PCIe bus, which allows users to access the coprocessor as a network node. To the right, the main parts of the coprocessor can be seen which are processing cores, caches, GDDR memory controllers and a very high bandwidth, bidirectional ring which interconnects the system. The PCIe client logic and the memory controller provides direct access the memory and the PCIe bus respectively, without any intervention with the host processor. Each core possesses their own L2 cache which is kept coherent by a globally distributed tag directory (TD). The bidirectional ring used for communication actually consists of five independent rings in each direction. The widest, most expensive is the data block ring. This is 64 bytes wide to support the high

(27)

two used for sending read/write commands and memory addresses, and the other two for acknowledgement messages. For the cores to process work as fast as possible it is required to work around the problem of delays in the cache when new data is fetched. In the coprocessor, this is done by allowing four hardware threads to run simultaneously which reduces the wait time [6].

4.8 Current implementation

Using the Xeon Phi architecture, parts of an implementation can be run on a host chip with single-thread performance while other parts, more suitable for parallelization can be executed on the coprocessor. The main factors influencing performance are scalability of algorithms, vectorization of instructions and memory utilization. These properties are of particular interest in the following analysis and discussion regarding the initial implementation and future improvements of the PPGS method used for particle simulations. What follows is explanations of different areas within the context of the PPGS algorithm where proposals for improvements have been stated.

4.8.1 Cache prefetching

There are possibilities to improve performance by applying cache prefetching for certain data in the code. This hides the latencies arising when data is read from memory. Prefetching means that explicit instructions are given to the processor to fetch data from the main memory to the cache just before it is needed. Higher performance may result from this because threads is not forced to stall due to memory delay. By default, hardware prefetching is used, meaning that mechanisms in the hardware are responsible for prefetching. This functionality is triggered when the software tries to access data not found in the cache in which case multiple cache lines is fetched from the main memory. Instead of letting the hardware do the prefetching automatically, instructions in the software code can be used to make prefetching events be based upon knowledge of future memory accesses instead of cache misses.

4.8.2 Vectorization

To get good performance out of the Intel Xeon Phi coprocessor it is necessary to use vectorization instructions. This means taking advantage of the 512 bit wide SIMD registers residing in the cores of the coprocessor. Ways of doing this ranges from using optimized library functions, writing assembly code or calling intrinsic functions that mimic assembly – of which the last method is used in this work. This corresponds to, in the code, instead of using the usual mathematical operators such as ∗ and+, calling functions which are executed on specialized hardware. Input arguments to these functions are 512 bits of aligned data which corresponds to 8 doubles, or 16 floats. Operations are then carried out pairwise on the values of the input, i.e, for the argumentsA = {a1, . . . , an} and B = {b1, . . . , bn} and the return valueC = {c1, . . . , cn}, the n operations ai∗ bi= ciare done simultaneously, wheren is8 for doubles or 16 for floats. Since these operations are executed on specialized hardware, less clock cycles are required, which in this case is equal to faster execution. To give the reader understanding and insight in what intrinsics are, a few examples out of many is given

(28)

Listing 4.1: Multiply instruction __m512d _mm512_mul_pd (__m512d a , __m512d b ) ;

Listing 4.2: Multiply-add instruction

__m512d _mm512_fmadd_pd (__m512d a , __m512d b , __m512d c ) Listing 4.3: Reduce-add instruction

d o u b l e _mm512_reduce_add_pd (__m512d a )

The functions in Listings 4.1 to 4.3 operates on packed double-precision vectors taken as arguments. In Listing 4.1, the function multiplies the elements from a and b and returns the result. The function in Listing 4.2 multiplies elements from a and b, adds elements from c to the product and returns the result. In Listing 4.3, the function sums all the elements in a and returns the result.

(29)

Results

What follows in this section is the results of what has been accomplished during the work of the thesis. Discussion about the results, the Gauss-Seidel method and iterative methods follows. For analysis and testing of the modifications of the Gauss-Seidel-iteration code, a reference scene consisting of a drum containing 50,000-500,000 particles has been used.

Performance graphs are presented with both time and speedup measurements. For speedup, Equation 5.1 is used, whereS is the speedup and TsandTp are the times it take to run the simulation on the coprocessor on a single thread and in parallel, respectively.

S = Ts

Tp

(5.1)

5.1 Initial implementation

The initial implementation of the granular simulation was already done by Algoryx Simulation.

For later comparisons with the improvements done in this thesis, the initial implementation was first run on the coprocessor without any modifications. In Figure 5.1 the initial measures of time time can be seen for simulations containing various numbers of particles. As seen, the coprocessor get saturated at 30 threads for the scene containing 500k particles and at 25 threads for the other scenes. Figure 5.1 shows the speedup graph for the initial simulation.

For all 3 test scenes, linear speedup is present up to 25 threads, but then decreases. A speedup graph like this was expected and similar results are seen in previous work by R.Meyer[16].

Most likely the reason why no improvements are seen with increased number of threads are because of the increased number of cache misses and the excessive amount of simultaneous data transfers.

5.2 Improvements

This section presents the examination of the improvements to be implemented suggested in Section 4.8, some of which have led to great results, while others did not give any significant improvements.

(30)

20 Chapter 5. Results

0 20 40 60 80

10⁴ 10⁵

#threads

wallclocktime(ms)

Wall clock time

50k 100k 500k

Figure 5.1: Wall clock time graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor.

0 10 20 30 40 50 60 70

0.0 5.0 10.0 15.0 20.0 25.0

#threads

speedup

Speedup

50k 100k 500k

Figure 5.2: Speedup graph of the initial implementation of the GS solver run on the Intel Xeon Phi coprocessor.

(31)

0 20 40 60 80 10³

10⁴ 10⁵

#threads

wallclocktime(ms)

50k 100k 500k

Figure 5.3: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks’ distributed task scheduler.

5.2.1 Task Scheduling

The initial implementation of the granular matter simulation used a centralized task scheduler which, on request by any of the many worker threads, sends additional work to be computed by the thread. During the analysis of the simulation process it was discovered that this approach created a bottleneck because a lot of threads requested work at a centralized point causing thread locks. By utilizing the library "Threading building blocks" developed by Intel, the problem could be solved. This was done by promoting all the threads as task schedulers. In this manner, as soon as a thread is finished with all its work, a global request for more work is made and any of the other threads having excess work are able to respond with new tasks. As seen in Figure 5.3 and 5.4 this change gave great improvements. The speedup graph in Figure 5.4 shows almost linear speedup to up to 40 threads but then a clear degradation. It is likely that the reason for this is that the maximum throughput between the processing cores is reached.

5.2.2 Vectorization

In Figure 5.5 and 5.6 measurements of the performance using vectorization instructions are presented. Comparing these results with the previous in Figures 5.3 and 5.4 show almost no improvements. This is explained by the reason that it were too few instructions vectorized

(32)

0 10 20 30 40 50 60 70

0.0 20.0 40.0 60.0 80.0

#threads

speedup

Speedup

50k 100k 500k

Figure 5.4: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks’ distributed task scheduler.

0 20 40 60 80

10³ 10⁴ 10⁵

#threads

wallclocktime(ms)

Wall clock time

50k 100k 500k

Figure 5.5: Wall clock time graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks’ distributed task scheduler and vectorization instructions.

(33)

0 10 20 30 40 50 60 70 0.0

20.0 40.0 60.0 80.0

#threads

speedup

Speedup

50k 100k 500k

Figure 5.6: Speedup graph of the GS solver run on the Intel Xeon Phi Coprocessor, modified to use Intel Threading Building Blocks’ distributed task scheduler and vectorization instructions.

(34)

(35)

Discussion

6.1 Results in relation to previous work

In the results it can be seen that simulation follow linear speedup up to around 40 threads in the case when simulating 500k particles. Compared to the work done in [19] mentioned in Section 1.1, on tests using friction coefficients where speedup deviated from linearity before using 5 processors, the results here are good. Presumably, it is not possible to blindly compare speedup graphs of the two results like this because of the big differences in the initial conditions. The simulations in [19] only used 1000 physical objects compared to 5, 000, 000 in this, meaning that the amount of computations in relation to communication is a lot less with fewer objects.

6.2 Iterative methods in relation to hardware

In previous work[21], it has been shown that the bottleneck occurring when solving non- trivially parallel tasks on the Xeon Phi architecture is caused by the constraints on the memory bandwidth. The problem occurs because a big amount of data must be transfered between the cores and storage space, meanwhile the operations performed on the data takes little time in comparison. This is especially true when running PGS, where solutions of tasks continuously must be passed from one thread to other threads performing related tasks.

When designing solutions to problems, great care must be taken to balance the processing to data access ratio, to have data set up in a Structure ofArrays(SOA) pattern, and preferably having data accesses aligned; fulfilling all of these concepts to get as high performance as possible. Further, the problem arising with the data access bottleneck tells that the system could afford a more computationally intensive solver given the same data access rate.

6.3 Convergence of the projected Gauss-Seidel method

A modification of the GS method that increases the calculational work and lessens the communication between threads would fit the Intel Xeon Phi coprocessor hardware better.

More research has to be done, but in the best case, this may also improve on the convergence rate, but currently, that is just vivid speculations. Figure 6.1 shows the convergence rate of the currently used GS method measured for the scene containing 500k particles. The x- and y- axes indicates number of iterations and residual of the contact forces, respectively, and it

(36)

26 Chapter 6. Discussion

100 120 140 160 180 200 220 10⁻²

10⁻¹

#iterations

residual

Convergence (500k particles) normal residual tangentU residual tangentV residual

Figure 6.1: Convergence rate for the Gauss-Seidel method.

is clear that a stagnation of the convergence rate has occurred. If ways are found to increase the rate of convergence, more accurate solutions for the simulations can be calculated within given time constraints.

6.4 Examined methods not giving any improvements

What are presented here are investigated ways to improve the simulation software that either did not give any significant improvements or just sporadic performance increase without any clear explanation for it.

– Cache prefetching

Some simulation tests were made where cache prefetching were applied for the particle properties (velocity, mass) and the constraints (Jacobians). Tests were made where all types of the mentioned values were prefetched and also were only some of them were. Speed increases of up to5% were seen for some tests but no consistency could be derived because a combination of prefetches caused speed up in some cases but not in others, and thats why no results of this are presented.

– Huge Page Size

According to [2] performance may be increased by using huge page sizes for memory allocations on the coprocessor. When an operating system allocates memory, a page is the smallest unit of data which may be allocated. The page size for computers is usually determined by architecture. For the x86-64 architectures, a standard page size is 4KB, and a huge page size is 2MB. By using huge page sizes, performance may be better since sometimes, variables and buffers are then handled more efficiently. Some simulation tests have been run to determine whether using huge page size gives any improvements, but no difference in performance was observed.

(37)

Conclusions

This work has taken a lot of work and effort, without reaching as far as was planned in the beginning. For applications to run natively on the coprocessor they have to be compiled from source, including all their dependencies. In the beginning, just getting the simulation software to run on the coprocessor was time consuming and took away much of time which instead may have been spent doing actual progress and getting results. In the end, performance improvements to the application have been made by optimizing the code, both on operation level and higher up, algorithmically. However, there were no time in investigating the algorithms further, and implement new methods which would have suited the Xeon Phi architecture better.

7.0.1 Building for the Intel Xeon Phi Architecture

In regard to the amount of work which have been spent in the process of compiling software for the coprocessor, it deserves its own section. There has been a lot of difficulty compiling the applications used in the work of this thesis. When building applications for common architectures such as x86-64, there is usually no problem, as a build script automatically solves most of the issues which may arise. In contrast, building for this new architectures has taken more time and effort than what could have been accounted for. The reasons for these troubles have mainly been because non-existent documentation, lack of knowledge of the inner working of build systems and no previous experience building and troubleshooting systems such as the AGX multiphysics application.

7.1 Future work

The Intel Xeon Phi Coprocessor (code-named Knights Corner ) used in the work of this thesis is the first version of the Intel Many Integrated Core Architecture (Intel MIC). This thesis has shown what to expect performance-wise from such hardware, but there will be others akin to this kind coming in the future. Later this year of 2014, the second generation (code-named Knights landing) will be released. The technical specification of Knights Landing is to date not publicly known, but regardless there is a sequel and only the future will tell the possibilities of it. At time of writing, in the ‘top 500’ table (which ranks the 500 most powerful supercomputers in the world) there are currectly 17 systems running the Xeon Phi, including the most powerful computer. This indicates that the Xeon Phi architure really stands by the top in performance.

(38)

28 Chapter 7. Conclusions

Going into the computations of the collision constraints, more research needs to be done to find improvements for it. There may be other ways to split the scene into tasks than the method currently in use; lowering the amount of data needed for communication between threads. Regarding the Gauss-Seidel iterations, modifications may be done to the method to better conform with the hardware; to perform more computations without increased requirement of data communication. These mentioned ideas are just speculations at this time, yet areas of research to be looked into. For any kind of computational problem there will always be a challenge to make the problem fit the hardware in terms of data communication and processing which indeed changes with new generations of computers.

(39)

Acknowledgements

Thanks for the opportunity to do this work. Thank you for your great knowledge and time teaching me..

(40)

30 Chapter 8. Acknowledgements

(41)

(42)

32 Chapter 8. Acknowledgements

(43)

[1] Erin Catto. Iterative Dynamics with Temporal Coherence. 5th June 2005. url: https:

//box2d.googlecode.com/files/GDC2005_ErinCatto.zip (visited on 27/04/2014).

[2] Intel Corporation. How to use Huge Pages to improve application performance on Intel xeon Phi coprocessor. url: https://software.intel.com/sites/default/files/

Large_pages_mic_0.pdf (visited on 09/08/2014).

[3] Intel Corporation. Intel Threading Building Blocks. url: https://software.intel.

com/en-us/intel-tbb (visited on 29/05/2014).

[4] Intel Corporation. Intel Xeon Phi Coprocessor. url: https://software.intel.com/

en-us/mic-developel (visited on 17/08/2014).

[5] Intel Corporation. Intel Xeon Phi Coprocessor: System Software Developers Guide.

url: http://www.intel.com/content/www/us/en/processors/xeon/xeon- phi- coprocessor-system-software-developers-guide.html (visited on 17/08/2014).

[6] Intel Corporation. Intel Xeon Phi Coprocessor, the Architecture. url: https : / / software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename- knights-corner (visited on 03/08/2014).

[7] James W. Demmel. Applied numerical linear algebra. SIAM, 1997. isbn: 978-0898713893.

[8] Erleben et al. Physics-based Animation. Charles River Media, Inc., 2005.

[9] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.) Baltimore, MD, USA: Johns Hopkins University Press, 1996. isbn: 0-8018-5414-8.

[10] Intel. Intel Xeon Phi Coprocessor 3120P. url: http://ark.intel.com/products/

75798/Intel-Xeon-Phi-Coprocessor-3120P-6GB-1_100-GHz-57-core (visited on 02/04/2014).

[11] Intel. Intel Xeon Phi product family performance. 30th Dec. 2013. url: http://www.

intel.com/content/dam/www/public/us/en/documents/performance- briefs/

xeon-phi-product-family-performance-brief.pdf (visited on 29/03/2014).

[12] Bo Kågström et al., eds. Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers. Vol. 4699. Lecture Notes in Computer Science. Springer, 2007, pp. 956–965. isbn: 978-3-540-75754-2.

[13] Claude Lacoursiere. A parallel Block Iterative Method for Interactive Contracting Rigid Multibody Simulations on Multicore PCs. Nov. 2013. url: http://www.tp.umu.se/

~nylen/PARA06/springer/46990956.pdf (visited on 27/11/2013).

[14] Claude Lacoursière. ‘Ghosts and machines : regularized variational methods for interactive simulations of multibodies with dry frictional contacts’. PhD thesis. Umeå University, Computing Science, 2007, p. 444.

(44)

34 REFERENCES

[15] David Mackay. Optimization and Performance Tuning for Intel Xeon Phi Cop-^R rocessors. Nov. 2012. url: http : / / software . intel . com / en - us / articles / optimization- and- performance- tuning- for- intel- xeon- phi- coprocessors- part-1-optimization (visited on 09/12/2013).

[16] R. Meyer. ‘Efficient parallelization of short-range molecular dynamics simulations on many-core systems’. In: Phys. Rev. E 88 (5 Nov. 2013), p. 053309. doi: 10.1103/

PhysRevE.88.053309. url: http://link.aps.org/doi/10.1103/PhysRevE.88.

053309.

[17] C. Lacoursière and M.Servin and A. Backman, ed. Fast and stable simulation of granular matter and machines. DEM5 The Fifth International Conference on Discrete Ele- ment Methods. 25th–26th Aug. 2010.

[18] Mathieu Renouf and Pierre Alart. ‘Conjugate gradient type algorithms for frictional multi-contact problems: applications to granular materials’. In: Computer Methods in Applied Mechanics and Engineering 194.18¿20 (2005), pp. 2019–2041. issn: 0045- 7825. doi: http : / / dx . doi . org / 10 . 1016 / j . cma . 2004 . 07 . 009. url: http : //www.sciencedirect.com/science/article/pii/S0045782504003652.

[19] Mathieu Renouf, Frédéric Dubois and Pierre Alart. ‘A parallel version of the non smooth contact dynamics algorithm applied to the simulation of granular media’. In: Journal of Computational and Applied Mathematics 168.1¿2 (2004). Selected Papers from the Second International Conference on Advanced Computational Methods in Engineering (ACOMEN 2002), pp. 375–382. issn: 0377-0427. doi: http://dx.doi.org/10.1016/j.

cam.2003.05.019. url: http://www.sciencedirect.com/science/article/pii/

S0377042703010021.

[20] Patrick richard. Slow relaxation and compaction of granular systems. Oct. 2013. url:

http://perso.univ-rennes1.fr/patrick.richard/nmat_compaction.pdf (visited on 07/10/2013).

[21] C. Rosales. Porting to the Intel Xeon Phi: Opportunities and Challenges. url: http:

//data1.gfdl.noaa.gov/multi-core/CRosales.pdf (visited on 13/08/2014).

[22] Solving sparse integer linear systems. ISSAC’06. Genova, Italy, July 2006, pp. 63–70.

[23] Vincent Visseq, Pierre Alart and David Dureisseix. ‘High performance computing of discrete nonsmooth contact dynamics with domain decomposition’. In: International Journal for Numerical Methods in Engineering 96.9 (2013), pp. 584–598. issn: 1097- 0207. doi: 10.1002/nme.4578. url: http://dx.doi.org/10.1002/nme.4578.