Parallelization of dynamic algorithms for electronic structure calculations

(1)

ACTA UNIVERSITATIS

UPSALIENSIS UPPSALA

Digital Comprehensive Summaries of Uppsala Dissertations

from the Faculty of Science and Technology

1992

Parallelization of dynamic

algorithms for electronic structure

calculations

ANTON G. ARTEMOV

ISSN 1651-6214 ISBN 978-91-513-1080-0

(2)

Dissertation presented at Uppsala University to be publicly examined in 2446 ITC,

Lägerhyddsvägen 2, Uppsala, Thursday, 21 January 2021 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Bo Kågström (Umeå University, Department of Computing Science).

Abstract

Artemov, A. G. 2021. Parallelization of dynamic algorithms for electronic structure calculations. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1992. 57 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-1080-0.

The aim of electronic structure calculations is to simulate behavior of complex materials by resolving interactions between electrons and nuclei in atoms at the level of quantum mechanics. Progress in the field allows to reduce the computational complexity of the solution methods to linear so that the computational time scales proportionally to the size of the physical system. To solve large scale problems one uses parallel computers and scalable codes. Often the scalability is limited by the data distribution.

This thesis focuses on a number of problems arising in electronic structure calculations, such as inverse factorization of Hermitian positive definite matrices, approximate sparse matrix multiplication, and density matrix purification methods. No assumptions are made about the data distribution, instead, it is explored dynamically.

The thesis consists of an introduction and five papers. Particularly, in Paper I we present a new theoretical framework for localized matrices with exponential decay of elements. We describe a new localized method for inverse factorization of Hermitian positive definite matrices. We show that it has reduced communication costs compared to other widely used parallel methods. In Paper II we present a parallel implementation of the method within the Chunks and Tasks programming model and do a scalability analysis based on critical path length estimation.

We focus on the density matrix purification technique and its core operation, sparse matrix-matrix multiplication, in Papers III and IV. We analyze the sparse approximate matrix multiplication algorithm with the proposed localization framework, add a prior truncation step, and derive the asymptotic behavior of the Frobenius norm of the error. We employ the sparse approximate multiplication algorithm in the density matrix purification process and propose a method to control the error norm by choosing the right truncation threshold value.

We present a new version of the Chunks and Tasks matrix library in Paper V. The library functionality and architecture are described and discussed. The efficiency of the library is demonstrated in a few computational experiments.

Keywords: parallelization, task-based programming, matrix algorithms, sparse matrices, inverse factorization, localized computations, density matrix methods, electronic structure calculations

Anton G. Artemov, Department of Information Technology, Division of Scientific Computing, Box 337, Uppsala University, SE-751 05 Uppsala, Sweden.

ISBN 978-91-513-1080-0

(3)

Dedicated to all hard-working doctoral students at Uppsala University

(4)

(5)

List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Emanuel H. Rubensson, Anton G. Artemov, Anastasia Kruchinina, and Elias Rudberg. "Localized inverse factorization." IMA Journal of Numerical Analysis (2020).

II Anton G. Artemov, Elias Rudberg, and Emanuel H. Rubensson. "Parallelization and scalability analysis of inverse factorization using the chunks and tasks programming model." Parallel Computing 89 (2019): 102548.

III Anton G. Artemov. "Approximate multiplication of nearly sparse matrices with decay in a fully recursive distributed task-based parallel framework." Submitted manuscript.

IV Anton G. Artemov and Emanuel H. Rubensson. "Sparse approximate matrix-matrix multiplication for density matrix purification with error control". Submitted manuscript.

V Emanuel H. Rubensson, Elias Rudberg, Anastasia Kruchinina and Anton G. Artemov. "The Chunks and Tasks matrix library 2.0". Preprint.

(6)

(7)

Part I:

(10)

(11)

1. Introduction

1

In recent decades, thanks to advances in computer hardware, electronic struc-ture calculations have played a significant role in different fields of science, such as materials science, chemistry and biology. What one aims to achieve with this computational machinery is to simulate complex structures of tissues or materials with millions of atoms at the quantum mechanics level.

The problem is that the governing equations are rather complex and compli-cated to be solved analytically and therefore some approximations are needed. The two main approximation techniques are computational approximations, which simplify the solution and reduce the amount of computations and ap-proximations of the model, which give simplified equations assuming certain conditions. The Hartree–Fock [30] method, hereinafter referred to as HF, and Kohn–Sham [45] density functional theory, hereinafter referred to as KS-DFT, are examples of model approximations.

From a computational point of view, HF and KS-DFT are similar. Tra-ditionally used numerical methods to solve the Hartree–Fock or Kohn–Sham equations have cubic complexity. A serious progress in this field has been done in the last 30 years, and in many cases it is possible to reduce the complexity of methods to linear, which means that the solution time grows proportionally to the size of the considered physical system [11, 35, 79].

The reduction of the complexity enables to apply the methods to relatively large systems, namely a few thousand atoms in a physical system, maybe tens of thousands, which might be enough for some real problems. In the need of solving large scale problems, i.e. problems with sizes too large to be accommodated on a conventional computer, we need scalable computational codes. Often the codes are limited to shared memory architectures [80, 88], and transition to distributed memory machines can be hindered by increasing complexity and development time. In most cases the transition is done using the Message Passing Interface (MPI) [28, 34, 91], but not necessarily [87]. With MPI, which provides a large set of basic routines, it is relatively easy to get a deadlock or a race condition problem in a complicated code, whereas other tools can minimize or completely eliminate the common problems and simplify the transition.

1_{The material of the introductory chapters is partially based on the author’s licentiate thesis:}

Anton G. Artemov. "Inverse factorization in electronic structure theory: Analysis and parallelization." Uppsala University, 2019, http://urn.kb.se/resolve?urn=urn:nbn: se:uu:diva-381333

(12)

This thesis focuses on the most important matrix operations of large scale electronic structure calculations. In particular, we consider the problem of computing an inverse factor of a Hermitian positive-definite matrix, which is an essential step in both HF and KS-DFT. We propose a new localization framework and a localized inverse factorization method. The term "localized" means that only small parts of matrices are used in computations. We com-pare the proposed algorithm with a few popular alternatives from a parallel perspective and demonstrate that localization properties lead to a reduction of the amount of data sent between the nodes of the computational system.

We consider also another crucial step in HF and KS-DFT, which is the computation of the density matrix with the purification technique [61]. We an-alyze the sparse approximate multiplication algorithm [10] with the proposed localization framework and make use of this algorithm in the purification process. Moreover, we propose an approach to control the error in the sparse approximate multiplication.

All methods we consider are based on multiplication of sparse matrices, which is an operation that is challenging to parallelize for computers with distributed memory. Often implementations rely on a static distribution of data, which implies a static distribution of work [13, 20]. We do not make assumptions about the distribution of data or work and explore it dynamically. In other words computations happen only if needed, i.e. if there is something to compute, it is computed.

In order to efficiently utilize available supercomputer resources, which are now counted in millions of cores, petaflops per second of performance and exabytes of memory, one also has to choose the right algorithm for the prob-lem. As we show in this thesis, using a well-known and popular algorithm in a distributed parallel environment might not be as efficient as one would expect.

1.1 Thesis outline

The thesis is organized as follows. Chapter 2 gives an introduction to the electronic structure calculations and the Hartree–Fock method, Chapter 3 fo-cuses on the problem of inverse factorization, Chapter 4 gives an overview on density matrix methods, Chapter 5 describes the hierarchical representation of matrices used in the thesis and their localization properties, Chapter 6 concerns the approximate multiplication of sparse matrices with decay and the error control in the approximate multiplication, Chapter 7 gives an overview of different parallel programming issues, models and, in particular, the Chunks and Tasks model. Chapter 8 contains notes on the software development process, which was a significant part of the author’s work during the PhD time.

(13)

2. Electronic structure calculations in brief

The material covered in this chapter is a summary of the main concepts in elec-tronic structure calculations. A detailed overview can be found, for instance, in [77] and [84].

2.1 The Schrödinger equation

The underlying physical system consists of N identical charged particles, which can move (electrons) and M point charges (nuclei), which have fixed posi-tions. The task is to find the positions of the charged particles such that the total energy of the system is minimized. This is modeled by the N-electron Schrödinger equation:

ˆ

Hψ = Eψ, (2.1)

where ˆH is a Hamiltonian operator, ψ is a wave function and E is an eigen-value. As one can see, this equation is an eigenvalue problem. The eigenfunc-tion ψ corresponding to the lowest energy E is the objective. The Hamiltonian operator is defined as ˆ H= −1 2 N

∑

i=1 ∇2_i − N

∑

i=1 M

∑

nuc=1 Znuc ri,nuc + N−1

∑

i=1 N

∑

j=i+1 1 ri, j , (2.2)

where ∇2_i is the Laplace operator, Znuc is the charge of nucleus nuc, ri,nuc is

the distance between the i-th electron and nuc-th nucleus, ri, j is the distance

between the i-th and the j-th electrons.

The wave function ψ = ψ(x1, x2, . . . , xN) is a function of the electron

co-ordinates xi, where xi= (ri, ωi), ri= (xi, yi, zi) are the electron spatial

posi-tions and ωi are their spins. The wave function does not have any physical

meaning, but its square is used to express the probability of finding electrons within some given volume, which is referred to as the electron charge density, expressed as: ρ(r) = N Z . . . Z |ψ(x1, x2, . . . , xN)|2dx1, x2, . . . , xN. (2.3)

Formally, the electron charge density determines the properties of the under-lying physical system.

(14)

According to the Pauli exclusion principle, the wave function ψ must be antisymmetric with respect to exchange of any two electrons, i.e.

ψ(x1, x2, . . . , xi, . . . , xj, . . . , xN) = −ψ(x1, x2, . . . , xj, . . . , xi, . . . , xN). (2.4)

2.2 The Slater determinant

One way to construct a wave function which satisfies equation (2.1) is to build it from one-electron functions. For instance, if only two electrons are present, then the wave function can be constructed as follows:

ψ(x1, x2) = φ1(x1)φ2(x2) − φ1(x2)φ2(x1) = φ1(x1) φ2(x1) φ1(x2) φ2(x2) , (2.5) where φi(x) is a one electron function of its coordinates and spin. As one can

see, ψ(x1, x2) = −ψ(x2, x1). This approach can be extended to any number of

electrons as follows: ψ(x1, x2, . . . , xN) = φ1(x1) φ2(x1) . . . φN(x1) φ1(x2) φ2(x2) . . . φN(x2) .. . ... . .. ... φ1(xN) φ2(xN) . . . φN(xN) , (2.6)

which is known as the Slater determinant [83]. Any linear combination of the Slater determinants for any choice of one electron functions φi(x) also satisfies

condition (2.4).

2.3 Basis functions

The one-electron functions used in (2.6) are usually written as a product of spatial and spin parts:

φi(x) = ϕi(r)γi(ω), (2.7)

where the spin part γi(ω) is described by two functions α(ω) and β (ω). The

spatial part, also known as orbital, is represented as a linear combination of functions bj(r): ϕi(r) = n

∑

j=1 ci, jbj(r), (2.8)

where the set of functions {bj(r)} forms a basis. In quantum chemistry, a

(15)

choice is a Gaussian type linear combination of atomic orbitals (GT-LCAO) with centering at nuclei positions. The exponential character of the Gaussian functions leads to the important localization property: at a certain distance from the center, the function becomes negligibly small. In practice one can assume that each function is zero outside a sphere of a certain radius centered at the corresponding nucleus position.

2.4 Computational approaches

2.4.1 The restricted Hartree–Fock method

In the Hartree–Fock method one tries to solve the Schrödinger equation (2.1) by varying one electron functions φi(x) in a single Slater determinant (2.6).

The approximation of the wave function in the Hartree–Fock method can be treated in various ways. If one assumes that the electrons are paired, then one uses the restricted Hartree–Fock method, whereas the unrestricted method is used when the electrons are not paired. We limit ourselves to a brief description of the restricted version only. A detailed description of both methods can be found, for instance, in [84].

The one-electron function φi(x) characterizes how a pair of electrons with

different spins moves and this function depends only on the position vector r. Assuming that the system consists of N electrons and n basis functions are used, one handles three important matrices: the basis set overlap matrix S, the density matrix D and the Fock matrix F.

The overlap matrix S is constructed as Si, j=

Z

bi(r)bj(r)dr. (2.9)

The density matrix D is related to the coefficients ci, jfrom equation (2.8):

Di, j= 2 Nocc

∑

k=1

ck,ick, j, (2.10)

where ci, j is the j-th coefficient of decomposition of the spatial part of φi(x)

onto the basis set. There are N electrons, but only Nocc= N/2 orbitals with the

lowest energy are occupied by electron pairs and and take part in assembling the Slater determinant (2.6). The total electron density ρ(r) can then be written as

ρ(r) =

_∑

i, j

Di, jbi(r)bj(r). (2.11)

The third ingredient, the Fock matrix F, is the sum of three terms

(16)

where T is the kinetic energy term, V is the electron-nuclear attraction term and G is the two-electron part. For simplicity, we omit the formulas for these terms. The terms T and V depend only on the basis set and on positions of nuclei, but G depends on the density matrix D. Each of the terms T,V, G corresponds to one of the sums in the expression for the Hamiltonian operator (2.2). T and V together form a one-electron Hamiltonian H1, i.e. H1= T +V.

The total energy of the system can be written as E= Tr (DH1) +

1

2Tr(DF). (2.13)

As one can see, the electron density and the total energy both depend on the density matrix D, and the aim is to find the minimal energy and the corre-sponding electron density. There are infinitely many candidates, and one has to use some iterative scheme, which finds the minimizing matrix and takes into account coupling between the Fock and the density matrices. This procedure is referred to as the self-consistent field procedure.

2.4.2 Kohn–Sham Density Functional Theory

Kohn–Sham DFT [45] has different theoretical grounds and shows that it is sufficient to know the electron density to describe a system of N electrons. Similarly to the Hartree–Fock method, it assumes the wave function in the form of a single Slater determinant. KS-DFT requires to compute a so-called exchange-correlation matrix in every iteration, which is an extra ingredient compared to HF. Otherwise, the methods are very similar from a computa-tional point of view.

2.5 The self-consistent field procedure

In quantum mechanics a self-consistent field is an effective field created by particles of a system, which is used to describe interactions between the parti-cles approximately. Instead of considering many particle-particle interactions, one treats every single particle as it is only affected by the effective field.

The field depends on the state of the system, which is determined by the field. This means that the field should be consistent with solutions of the equations, which depend on the field and so on. This is why one applies the "self-consistency" term.

There are two main steps, which are repeated until the density matrix D no longer changes:

1. Construction of the Fock matrix F for a given density matrix D, often referred to as(D → F). This step uses equation (2.12) to build the Fock matrix from the density matrix.

(17)

2. Construction of the new density matrix D for the computed Fock matrix F, which is referred to as (F → D). This step requires to solve the generalized eigenvalue problem

FC= SCΛ, (2.14)

and once the matrix of eigenvectors C is known, it can be used in equa-tion (2.10). This approach is expensive, namely it has O(N3_{) complexity}

and cannot be used if we aim at overall linear complexity of the compu-tations. Another way to compute the density matrix D for a given Fock matrix F is to use a different definition of the density matrix

D= θ (µI − F), (2.15)

where θ(x) is the Heaviside step function, and approximate the latter using some polynomials. The two most popular approaches are the recursive application of low-order polynomials, also known as density matrix purification [61], and approximation using Chebyshev polyno-mials [36, 59]. In Chapter 4 we focus on the density matrix methods. The convergence of the self-consistent procedure depends on the properties of the system, in particular on the gap between the eigenvalue corresponding to the highest occupied molecular orbital (HOMO) and the eigenvalue cor-responding to the lowest unoccupied molecular orbital (LUMO). If this gap is vanishingly small, then the convergence of the procedure is slow or even cannot be reached [77].

There are techniques to accelerate the convergence of the self-consistent field procedure [32]. Some of the most common methods construct the new Fock matrix as a linear combination of Fock matrices from previous steps. Examples of the methods which utilize this strategy are the so-called Direct Inversion of the Iterative Subspace (DIIS) [67] and energy-DIIS [50] methods.

2.6 A self-consistent field program

The main steps of the solution of the problem (2.1) with the Hartree–Fock method are combined in a computer program as shown in Algorithm 1. The pseudo-code is adopted from [76].

Lines 3 and 12 of the pseudo-code are written in bold. The following chapters of the thesis focus on these two steps of the computational procedure.

(18)

Algorithm 1 Hartree–Fock self-consistent field program

1: Read molecule and basis set information from input files.

2: Compute overlap matrix S.

3: Compute inverse factor Z : Z∗SZ= I.

4: Compute one-electron Hamiltonian matrix H1= T +V .

5: Generate starting guess density matrix D.

6: for i= 1, 2, . . . do

7: Compute two-electron part G= G(D).

8: Compute new Fock matrix F= H1+ G.

9: Compute energy E= Tr (DH1) +1₂Tr(DF).

10: Update F as linear combination of new and previous Fock matrices.

11: Do the congruence transformation ˆF= Z∗FZ.

12: Compute new density matrix ˆDfrom ˆF.

13: Compute D= Z ˆDZ∗.

(19)

3. Inverse factorization problem

An important part of the self-consistent field procedure for the Hartree-Fock method, as discussed in section 2.5, is the computation of the density matrix Dfor a given Fock matrix F. If the basis set is orthonormal, it is accomplished by solving the eigenvalue problem

FC= CΛ ⇒ D = CC∗. (3.1) In most cases the basis set is not orthonormal, and the corresponding eigen-value problem becomes generalized

FC= SCΛ, (3.2)

where S is the symmetric positive-definite basis set overlap matrix defined in (2.9). Derivations of this equation, also known as the Roothan–Hall equation, for the non-orthonormal case are given in [68, 40].

3.1 Congruence transformation

The generalized eigenvalue problem (3.2) can be transformed to standard form FC= SCΛ =⇒ ˆF ˆC= ˆCΛ (3.3) by a congruence transformation

ˆ

F = Z∗FZ, (3.4)

where Z is such that ZZ∗= S−1, i.e. some inverse factor of S, C= Z ˆC, and ˆC is the matrix of orthonormal eigenvectors.

The transformation (3.4) is applied in every iteration of the self-consistent field procedure (see line 11 of Algorithm 1 on page 18). As we mentioned above, in most cases the basis set is not orthonormal. The transformation changes it to orthonormal, where one can use established density matrix meth-ods with controlled forward error [74] and precise stopping criteria for the density matrix purification procedure [47]. The inverse factor Z is computed once before the start of the main loop of the program (line 3 of Algorithm 1). Computation of the inverse factor might not seem an important part of the machinery, because it usually occupies a minor part of the total computational time. However, as demonstrated in this thesis, for sufficiently large systems some popular algorithms may become a bottleneck in a distributed setting.

(20)

3.2 Choice of inverse factor

The inverse factor is not unique. In principle, any factor Z such that S−1= ZZ∗ can be used. Usually, in related studies the inverse square root (Löwdin) [43, 54] or inverse Cholesky factor [16] are used. These factors posess an important feature of decay of their matrix elements with increasing atomic separation [8]. However, classical methods to compute those factors exhibit cubic computational complexity.

In this work we aim not only to be able to compute an inverse factor with decaying elements, but also to do that in time directly proportional to the number of basis functions and to get an inverse factor sufficiently sparse in order to perform computations efficiently. One would also like to be able to compute an inverse factor in parallel and the amount of available parallelism plays an important role when choosing the algorithm.

3.3 Inverse Cholesky factorization

The Cholesky factorization is a representation of a Hermitian positive-definite matrix A as a product of a lower triangular matrix L with its conjugate trans-pose, i.e. A= LL∗. The inverse Cholesky factorization is a representation of the inverse of a Hermitian positive-definite matrix A in a similar form, i.e. A−1= ZZ∗_{. In this case, Z is called the inverse Cholesky factor of A.}

The inverse Cholesky factor computed with the AINV algorithm [6] is used in electronic structure calculations [16, 58]. There are several variants of the algorithm, including, for instance, a recursive [72] and a blocked [7].

For small and medium-size problems, i.e. problems which fit a conventional computer, the AINV algorithm is fast, but, as shown in Paper II, it has very limited parallelism and thus cannot be efficiently used in large scale calcula-tions in a distributed environment.

We include the recursive implementation of the inverse Cholesky factoriza-tion algorithm [72] with the Chunks and Tasks parallel programming model (see section 7.3) to the Chunks and Tasks matrix library, see Paper V.

3.4 Iterative refinement with a scaled identity

Iterative refinement with a scaled identity matrix as a starting guess is another approach which has been popular over the past decades. It computes the inverse square root of the matrix, i.e. the Hermitian positive-definite matrix Zsuch that Z2= A−1using the procedure of iterative refinement [62].

The iterative refinement procedure takes an initial guess Z0such that

(21)

and constructs a sequence of matrices Zi such that the factorization error δi=

I− Z_i∗SZi−→ 0 as i −→ ∞. The procedure can be derived from the

Newton-Schulz iteration for the sign matrix function [42].

The choice of starting guess Z0 plays an important role. Often a scaled

identity matrix is used as the starting guess. This approach is utilized by many researchers in the area [43, 92]. In Paper II we show that in a distributed setting this method has a high amount of data to be communicated, and its performance is limited compared to other methods.

In Paper I we use a localized refinement approach, which can be employed if one knowns that only a small part of matrix elements has to be updated, and computational costs can be reduced. We show that the localized iterative refinement procedure computes an inverse factor, which, depending on the starting guess, has the property of decay of matrix elements with respect to distance between atoms.

We include the Chunks and Tasks (see section 7.3) implementation of the localized iterative refinement algorithm to the Chunks and Tasks matrix li-brary, see Paper V.

3.5 Localized inverse factorization

The localized inverse factorization proposed in Paper I is a divide-and-conquer method based on a combination of localized iterative refinement and a recur-sive decomposition of the matrix [73].

The index set of the matrix is split into two parts, giving a binary principal submatrix decomposition. Inverse factors of the two submatrices lying on the main diagonal are computed and combined together as a starting guess for localized refinement. This procedure is applied recursively at all levels except the leaf one, where matrices are small enough to apply a direct method. The localized refinement procedure and the localized construction of starting guesses give the name for the method. Figure 3.1 demonstrates how the localized inverse factorization works.

In Paper I we show that the cost of combining inverse factors of the sub-matrices is negligible compared to the overall cost for systems of sufficiently large sizes. It is also shown that the resulting inverse factor does have the decay property of the matrix elements, similarly to the iterative refinement procedure.

One of the most important features of the algorithm (both the regular ver-sion in [73] and the localized verver-sion in Paper I) is that subproblems are absolutely independent on each level in the hierarchy and thus embarrassingly parallel. As demonstrated in Paper II, the localized inverse factorization al-gorithm has an advantage over the localized iterative refinement with a scaled identity as a starting guess in the amount of data sent during the computations.

(22)

Figure 3.1.The localized approach to inverse factorization: inverse factors ZAand ZC

are combined to get a starting guess for the localized refinement procedure, which is used to compute a localized correction to the inverse factor (shown on the right hand side).

This is crucial since the movement of data is considered to be expensive, and thus the reduction of data movement results in performance gain.

The localized inverse factorization implementation with the Chunks and Tasks parallel programming model is included to the Chunks and Tasks matrix library, see Paper V.

(23)

4. Density matrix methods

In this chapter we cover existing methods to compute the density matrix D from the Fock matrix F (line 12 of Algorithm 1 in section 2.5). More details can be found, for instance, in [11, 46].

4.1 Problem setting

Assume that the eigenvalues of F are sorted in ascending order

λ1≤ . . . ≤ λNocc< λNocc+1≤ λN, (4.1)

and that there is a non-zero gap between λNocc and λNocc+1, where Nocc is

a given problem-dependent number. The eigenvalue λHOMO = λNocc

corre-sponds to the highest occupied molecular orbital, whereas λLU MO= λNocc+1

— to the lowest unoccupied orbital. These two eigenvalues are particularly important.

In an orthonormal basis set, at which one arrives after applying the congru-ence transformation (3.4) with an inverse factor Z of the overlap matrix S, the Fock matrix ˆF becomes a part of the standard eigenvalue problem

ˆ

F ˆC= ˆCΛ (4.2)

with ˆC= [ ˆc1, . . . , ˆcN] being a matrix of orthonormal eigenvectors, and the

density matrix can be written as ˆ D= Nocc

∑

i=1 ˆ cicˆi∗. (4.3)

In equation (2.10) for the density matrix on page 15 there is a constant 2 in front of the sum, which is due to the restricted Hartree–Fock method, whereas equation (4.3) is for a general case. Once the density matrix is computed, the same inverse factor Z is used to go back to original basis set and the density matrix in that basis set is obtained as

D= Z ˆDZ∗.

The matrix ˆDhas a few important properties, which follow from (4.2) and (4.3):

(24)

1. eigenvalues of ˆD are either 0 or 1, thus the matrix is idempotent, i.e. ˆ

D2= ˆD;

2. if ˆciis an eigenvector of ˆF, then it is also an eigenvector of ˆD, hence ˆF

and ˆDshare a common set of eigenvectors;

3. Nocc eigenvectors of ˆF, which correspond to the smallest eigenvalues,

are eigenvectors of ˆD corresponding to eigenvalues 1. The rest corre-spond to 0 eigenvalues. Consequently, Tr( ˆD) = Nocc.

The aim is to compute ˆD for a given matrix ˆF. One can do that using equation (4.3). This equation requires explicit calculations of eigenvectors, which is expensive.

With a closer look one can see that the matrices ˆDand ˆF are different only in eigenvalues, and what one has to do is to move the Noccsmallest eigenvalues

of ˆF to 1 and the rest to 0 to get ˆD.

4.2 Overview of methods

4.2.1 Polynomial expansions

This family of methods is based on a different representation of the density matrix, namely

ˆ

D= θ (µI − ˆF) = ˆCθ(µI − Λ) ˆC∗, (4.4) where θ(x) is the Heaviside step function, µ is the chemical potential located inside the homo-lumo gap. The idea behind the name of this family of methods is to approximate the non-smooth step function with a polynomial.

Recursive expansion methods

In this group of methods [61] an approximation is built by recursive applica-tion of low order polynomials, i.e.

θ(µ − x) ≈ pn(. . . p1(p0(x)) . . .),

where p0(x) maps the spectrum of ˆF to[0, 1] in reversed order. Polynomials

p1(x), . . . , pn(x) move the eigenvalues towards the boundaries of the interval,

eigenvalues corresponding to occupied orbitals travel to 1, the rest — to 0. A strong point of this approach is that a very high overall polynomial degree can be reached with a few iterations. As shown in [61, 78], the number of matrix–matrix multiplications scales as O(log₂(w/ξ )) , where w is the spectrum width of ˆF and ξ is the size of the homo-lumo gap.

Chebyshev-based approximations

Another option is to use Chebyshev polynomials [56]. On a non-smooth function, which is the case, Chebyshev polynomials demonstrate so-called

(25)

Gibbs phenomenon, which leads to errors in the whole interval [3, 5]. One solution is to replace the non-smooth Heaviside function θ(µ − x) with a smooth function varying from 0 to 1 in the homo-lumo gap. Another solution is to use a special technique which damps the Gibbs oscillations [82]. It has been shown in [3] that the degree of the Chebyshev polynomial, which is necessary to reach the desired accuracy, is proportional to the inverse gap w/ξ , and the minimal number of matrix-matrix multiplications is O(pw/ξ) [65].

4.2.2 Energy minimization

Another family of methods takes into account the energy minimization prop-erty of the correct density matrix ˆD, i.e. it minimizes the functional

Ω( ˆD) = Tr( ˆD ˆF). (4.5) As noted in the problem setting section, the matrix ˆDis idempotent and Tr( ˆD) = Nocc, which means that minimization should be performed obeying these

con-straints.

Unconstrained minimization is applied in [19, 53] to a modified functional, which incorporates the idempotency constraint by using the McWeeny poly-nomial p(x) = 3x2_{− 2x}3_{. This method requires knowledge of the chemical}

potential µ and there is a certain limitation on the eigenvalue distribution. An alternative suggested in [58] uses a modified version of the functional from [53] and does not require knowledge of the chemical potential µ. Instead µ changes in every step so that the gradient of the functional is equal to zero.

A different view on treating the idempotency of the density matrix is used in [26, 81]. One starts with a density matrix ˆDi, which already satisfies the

idempotency and the trace conditions and refines it by computing an antisym-metric parameter matrix X such that the functional Ω(X) = Tr eX_D_ˆ

ie−XF isˆ

minimized. A new iterand is calculated using the Baker-Campbell-Hausdorff expansion [41].

4.3 Purification schemes

Density matrix purification is another name for the recursive polynomial ex-pansion technique described in section 4.2.1. Generally, purification means a process of getting something clean and pure.

There are many approaches to density matrix purification. Examples are the canonical purification [64] based on the McWeeny polynomial, the trace– resetting purification [63], which uses a parametrized combination of two polynomials and the parameter is chosen in a way that the trace is correct. Among the others are the trace-correcting purification [61], which relies on

(26)

spectral projection polynomials, accelerated expansions [69], which use in-formation about the homo-lumo gap to accelerate convergence of the scheme based on the spectral projection polynomials of 2nd order.

With certain simplifications all purification schemes can be written as in Algorithm 2. As one can observe, the dominant operation is evaluation of a matrix polynomial, i.e. multiplication of sparse matrices. In papers III and IV we exploit the decay properties of matrices and replace the multiplication with approximate multiplication described in Chapter 6 and in [10].

Algorithm 2 General purification scheme

1: procedure PURIFICATION(F, pi, n) 2: X0= p0(F) 3: for i= 1, . . . , n do 4: Xi= pi(Xi−1) 5: end for 6: end procedure

4.4 Error control

In order to achieve desired scaling properties of the algorithm and to maintain sparsity in iterands Xiin Algorithm 2, small matrix elements are removed after

(or before) each iteration. A typical approach to control the error in the final matrix is to bound the norm of the matrix consisting of removed elements in each iteration. It is shown in [74, 48] that with a proper choice of tolerances on each iteration, one can guarantee that the norm of the error in the final density matrix is below a predefined value.

A drawback of the described approach is that small elements have to be computed and then removed. In Paper IV we demonstrate that it is possible to avoid calculation of unnecessary small elements in matrices Xi using the

SpAMM algorithm [10] and at the same time control the error introduced by the approximate multiplication. We also consider the combination of SpAMM with regular truncation and show that it allows to reduce memory usage and improves computational time.

(27)

5. Properties and representation of matrices

5.1 Decay of elements

Linear scaling of the computational time with system size can be accomplished if matrices are sparse (or nearly sparse) and of large size. The sparsity is brought about by truncation of small matrix blocks or single elements to zero in a way that the error introduced by truncation remains below a predefined value [53, 75].

For linear scaling, the number of non-zero elements remained after trun-cation should grow at most as fast as the matrix size. The important feature of matrices in electronic structure calculations, which makes it possible to truncate them in the desired way, is the decay of the elements. We exploit this feature in papers I – IV.

Let I = {1, . . . , n} be the index set of a matrix A. Let d(i, j) be a pseudo-metric on I. A pseudo-pseudo-metric is different from a pseudo-metric in the sense that it can attain 0 value on distinct elements. The matrix A is told to have a decay of elements with respect to d(i, j) of algebraic character with constants c > 0, λ ≥ 1 if

|ai, j| ≤ c(d(i, j)λ+ 1)−1 (5.1)

and exponential decay with constants α > 0, c > 0, λ > 1 if

|ai, j| ≤ cλ−αd(i, j). (5.2)

The distance function d(i, j) corresponds to the physical distance between the centers of the basis functions bj(r) in equation (2.8). One way to look at a

physical system is as to a complete spatial graph, where the vertices are basis functions and the distance function is defined on the edges. If the centers of two basis functions are distant enough from each other, then the corresponding matrix element becomes negligible. In Figure 5.1 one can see two examples of exponential decay of matrix elements magnitudes with respect to two distinct distance functions.

The exponential decay with respect to a distance function is not sufficient to get the number of non-negligible elements growing linearly with the matrix size. For instance, with d(i, j) = |i − j| the non-zero elements are located along the main diagonal (see Figure 5.1, left panel), and their magnitudes decay away from that diagonal. If one fixes a truncation threshold and removes all the elements smaller than the threshold in magnitude from the matrix, the

(28)

-350 -300 -250 -200 -150 -100 -50 0 -5 -4 -3 -2 -1 0

Figure 5.1.Exponential decay of matrix elements magnitude with respect to distance function, c= 1, α = 1. Left panel: d(i, j) = |i − j|. Right panel: d(i, j) = | log(i/ j)|.

rest of the elements will form a band, which preserves its width if the matrix size is increased and the threshold does not change. However, with d(i, j) = | log(i/ j)| (see Figure 5.1, right panel) this is no longer true. The number of elements greater than a fixed threshold grows almost quadratically with the matrix size.

One way to guarantee that the number of non-negligible matrix elements grows linearly with respect to the matrix size is to restrict possible sparsity patterns. Assume that a sequence of matrices of increasing size n is consid-ered. Let {dn(i, j)} be a sequence of distance functions where each particular

function corresponds to matrix size n. Let Ndn(i, R) be the set of indices within

distance R from index i. In Paper I, we use a restriction which holds for underlying physical systems:

|Ndn(i, R)| ≤ γR

β_{, ∀i ∈ {1, . . . , n},} _(5.3)

where γ and β are positive constants independent of n, |Ndn(i, R)| is the

cardi-nality of the set Ndn(i, R).

A slightly different approach to localization is used in [8]. The distance function is the geodesic distance (the shortest path) on an undirected graph Gn,

which has the indices {1, . . . , n} as vertices. For each n a corresponding matrix An is constructed. The restriction on the sparsity is imposed by bounding the

maximum degree of any vertex in the graph:

|Ngn(i, 2)| ≤ p ∀i ∈ {1, . . . , n}, (5.4)

where p > 0 is independent of n and gn is the geodesic distance. If one sets

dn= gn, then the condition (5.4) is less restrictive than (5.3).

In Paper I we show that if a matrix sequence has the exponential decay prop-erty with respect to distance functions dn(i, j) and condition (5.3) is satisfied,

then the number of non-negligible elements in each row of a matrix is bounded by a constant independent of the matrix size. Similar results are shown in [8] for the geodesic distance. However, the approach developed in paper I is more general in the sense that it works for any pseudo-metric dn(i, j).

(29)

5.2 Hierarchical representation

What is a sparse matrix? There are several definitions [66]. One can say that a matrix is sparse if it has a few non-zero elements per row. This is probably the most widely used definition. Another definition says that a matrix is sparse if the number of non-zero elements is such that it makes sense to take advantage of it for a particular algorithm.

In linearly scaling electronic structure calculations matrices are not sparse in the most widely used sense, since the number of non-zero elements per row can be up to several thousands. This implies that popular sparse matrix formats like compressed sparse rows (or columns), which are developed for "standard" sparse matrices, are not suitable, because their strong point, i.e. compression, might not work. If a format favors rows or columns, then it becomes difficult to perform operations with transposed matrices.

A natural choice, successfully used by researchers [10, 72, 71], is the hier-archical matrix representation. This means that a matrix is viewed as a matrix of other matrices forming a hierarchical structure. In our work the simplest representation is used, where each level, except the last one (the leaf), of the hierarchy contains 4 matrices in a square array. The leaf level can be kept completely dense or use another sparse representation, see the next section for a discussion. An example of the hierarchical representation is shown in Figure 5.2.

5

7

12 8 3

9

1

8

2 N

N

0

N N N N N 5 7 8 3 0 12_{0 0} 0 0 0 9 0 1 8 0 0 2 0 0

Figure 5.2. An example of hierarchical matrix representation. Leaf matrices have sizes 2 × 2, the symbol N stands for null and, depending on the implementation, it might be a null pointer or some other empty object. Matrices marked with the symbol Nare not kept in the memory.

Large sizes of matrices imply that one has to operate them in a distributed setting. The hierarchical representation can be utilized in this case: the user has a handle to the root of the hierarchy, and nodes of the quad-tree reside on different processes, which can reside on different nodes of a computational system. An advantage of this approach is that processes do not have to be organized in any particular topological structure to perform multiplication, as opposed to, for instance, Cannon’s algorithm [52].

(30)

One should distinguish between the hierarchical matrix representation used here and so-called hierarchical matrices (H -matrices) by Hackbusch [39]. H -matrices are data-sparse approximations of matrices, which are not nec-essarily sparse by construction. The data-sparsity is introduced by low-rank approximations of some matrix blocks (admissible blocks), whereas the rest of the matrix (inadmissible) is kept dense. The blocks are organized in a tree-like structure giving rise to hierarchies, hence the name. Low-rank approximations of blocks become possible after applying some approximation technique to the underlying mathematical problem. So, in this sense, this class of methods be-longs to the same cohort as the Hartree–Fock and Kohn-Sham DFT methods.

The key feature ofH -matrices is that the admissible blocks are kept and manipulated in factorized format. This makes it possible to perform matrix operations like addition, multiplication and even inversion at complexity close to optimal. However, the construction ofH -matrices heavily relies on singu-lar value decomposition, which is expensive. By choosing a different number of singular values, one can vary the accuracy of aH -matrix approximation.

The major difference of the hierarchical matrix representation from H -matrices is that it is a data format suitable for distributed environments and not an approximation technique.

5.3 Leaf level matrix representation

The leaf level of the matrix hierarchy is where the numbers reside and where the heavy calculations are performed. It is important how the matrix is repre-sented at this level. Leaf matrices can have very different sparsity patterns or even be dense depending on which part of the large matrix they keep.

Two main considerations are to be taken into account when choosing the leaf matrix format:

1. it should be possible to take advantage of sparsity;

2. it should be possible to represent a dense matrix without huge extra costs.

A number of formats have been developed for needs of iterative solution methods for linear systems of equations, where matrix-vector multiplication is the dominant operation. These formats can be used for problems where matrix-matrix multiplication dominates, but, as discussed further, there are certain limitations.

5.3.1 Possible formats

Let us use nnz to denote the number of non-zero elements in the matrix and n for the matrix size.

(31)

Dense matrix

In this format the matrix is kept completely dense, i.e. all zeros are kept even if the matrix contains just one non-zero value. The whole matrix is stored as a contiguous block in the memory. It is a feasible format for a problem, where only dense matrices are involved, but not suitable for keeping sparse matrices. Coordinates

In this format the matrix is represented as an array of triplets, where each triplet describes a single element of the matrix and contains its row and column indices as well as the value. This data structure requires keeping 2 × nnz extra items. It is an option for sufficiently sparse matrices, but, as mentioned above, at the leaf level one can have completely dense matrices, and in this case the coordinate format becomes expensive in terms of memory.

Compressed Sparse Rows

The CSR format [27], also known as the Yale format, is the de-facto standard in the sparse matrix community. The matrix is stored in three one-dimensional arrays, where the first array of size nnz keeps the non-zero values, the second array of size nnz keeps the column indices corresponding to elements in the first array, and the third array keeps the number of non-zero elements in rows with indices up to i − 1 at position i. The CSR format needs nnz+ n extra items and, as one can observe, if nnz is large, then the format does not give compression. The CSR format provides efficient access to rows, but not to columns, and operations with matrix transpose are difficult to perform. Compressed Sparse Columns

The CSC format inherits all pros and cons from the CSR format, but favors columns instead of rows.

Compressed Sparse Blocks

This format [14] does not favor columns or rows. Compared to CSR, which stores non-zero elements contiguously in the memory, the CSB format stores blocks of the matrix contiguously. Each block is stored in the coordinate format, but indices are local to the block, therefore they require fewer bits. Pointers to blocks are stored in a separate array. The format shows good performance for matrix-vector operations, but inherits a weak point from the coordinate format: when the matrix is dense, it becomes expensive in terms of memory.

Block-Sparse format

In this case the matrix is split into small dense blocks, and the pointers to blocks are kept in a 2D array. If there is no non-zero element in a block, then the null pointer is kept instead. No favor is given to rows or columns and little

(32)

extra information is kept alongside the matrix. Multiplication of two matrices with matching block-sizes can be efficiently done in a batched way [71], but the sparse approximate multiplication (see Chapter 6) does not map naturally onto the format because of the flat structure. This format is used in [48] and in Papers I and II. The code of the library which implements this format is included to the release of the Chunks and Tasks matrix library, see Paper V. Hierarchical Block-Sparse format

In this format the matrix is again viewed as a quad-tree. Very little information is kept on each level, namely five pointers and a couple of integers, blocks containing non-zero elements are kept in a dense format at the lowest level of the hierarchy. Neither rows nor columns are favored.

This format naturally extends the representation at the distributed level and has a few strong points: when multiplying, it is possible to skip larger zero regions of a matrix compared to the block-sparse format and any hierarchical algorithm extends naturally to this format. The sparse approximate matrix multiplication algorithm (see Chapter 6) becomes feasible.

We implement this format in the form of a C++ template library and utilize it in Papers III and IV. Our library implementation uses bidirectional con-nections between the parent node and children nodes in the quad-tree and takes advantage of seeing the whole hierarchy at once and allows to avoid allocations of empty matrices when multiplying, which happens for certain sparsity patterns of child matrices.

The library can perform matrix multiplication in two different ways. The first option is to allocate the branches of the resulting matrix quad-tree on the fly when multiplying. The second is to pre-allocate the whole resulting matrix quad-tree and then populate it with the results of block multiplications done in a batched way. This approach utilizes the pointer from a child to its parent. The position of each block at the lowest level is coded with a sequence of digits, which represents a path from the particular block to the root of the hierarchy. This code is obtained using the pointer to the parent. An element of a multiplication batch consists of three such codes: two for multiplicands and one for the product. These elements are placed in an unordered multimap con-tainer, where the key of the multimap is the code of the product block. Then the whole multimap can be processed at once and results of multiplications are written to the locations in the product matrix indicated by the key. Different keys can be processed independently in parallel.

The code of the hierarchical block-sparse leaf library is included to the release of the Chunks and Tasks matrix library, see Paper V.

(33)

6. Sparse approximate multiplication

Multiplication of sparse matrices is a basic operation in electronic structure calculations, as shown in Chapters 3 and 4. Multiplication is a challenging problem, especially in parallel. One has to consider, for example, how to store matrices, how to multiply them in a way to achieve even utilization of computational resources, how to move as little data as possible. Many ways to address these issues have been suggested, including special data formats [17], space-filling curves [17, 89] and random permutations of rows and columns [13] for load balancing, communication avoiding [4] and locality-aware [71] multiplication algorithms.

As discussed in Chapter 5, linear scaling is achieved by truncation of small matrix elements. In a density purification scheme calculations are performed in a loop (see Algorithm 2 on page 26), and the result of the previous iteration is used in the current. Truncation, which is applied either in the end of the previous iteration or in the beginning of the current iteration, results in an approximate matrix. But, before the truncation is applied, the exact matrix product of (possibly truncated) matrices is computed. Some useless work is performed. Instead, one can avoid doing useless work by using the algorithm described in the following section.

6.1 The SpAMM algorithm

The sparse approximate matrix multiplication algorithm (or SpAMM) [10] is in some sense similar to the fast multipole method [37]. It exploits the decay properties of matrix elements and the representation of matrices as quad-trees (see Chapter 5) to skip multiplications of sub-matrices which would result in small contributions to the product matrix.

Let matrices A and B be represented as quad-trees:

At = A t+1 0,0 A t+1 0,1 At+1_1,0 At+1_1,1 ! , Bt = B t+1 0,0 B t+1 0,1 Bt+1_1,0 Bt+1_1,1 ! , (6.1)

where t is the number of the level in the hierarchy, t ∈ [1, dlog₂(n)e], and n is the matrix size. Level dlog₂(n)e + 1 contains single matrix elements.

(34)

Algorithm 3 SpAMM

1: procedure SPAMM(A, B, τ)

2: if kAkFkBkF < τ then

3: return C= 0

4: end if

5: if lowest level then

6: return C= AB 7: end if 8: for i= 0, 1 do 9: for j= 0, 1 do 10: T0= SpAMM(Ai,0, B0, j, τ) 11: T1= SpAMM(Ai,1, B1, j, τ) 12: Ci, j= T0+ T1 13: end for 14: end for 15: return C 16: end procedure

The Frobenius norm on each level can be conveniently computed from the underlying level as kAt_k F = v u u t 1

∑

i=0 1

∑

j=0 kAt+1_{i, j} k2 F. (6.2)

With these notations, the SpAMM algorithm in the recursive form can be written as in Algorithm 3. The parameter τ controls truncation of sub-matrix products.

6.2 Error analysis

In Paper III we consider two sequences of matrices of increasing size with exponential decay of elements with respect to common associated distance functions. We apply the localization framework developed in Paper I to the SpAMM algorithm and perform an asymptotic error analysis with respect to the matrix size n and the parameter τ. We show that the Frobenius norm of the error between the true product matrix and the approximate behaves as

kEnkF =

(

O(n1/2_), _n_{−→ ∞,}

O(τp/2_{), ∀p < 2, τ −→ 0.} (6.3)

We also analyze the standard approach to multiplication of sparse matrices with truncation of input matrices with the same localization framework and

(35)

we find out that the error norm behaves the same way as in (6.3). This leads us to an idea of combining the two approaches in one, which uses truncation of both input matrices and matrix sub-products. If truncations are done with linearly dependent parameters τ1 and τ2, then the asymptotic behavior of the

Frobenius norm of the error is preserved.

6.3 Behavior in a distributed setting

As the Frobenius norm of the error behaves the same way for the SpAMM al-gorithm, multiplication of truncated matrices and for their combination, it may seem that it does not matter which algorithm to use. However, the algorithms may have different pre-factors in the error terms and different computational costs. When going to large scale, and, consequently, to a distributed setting, the methods also differ in terms of communication.

In Paper III we show that in a distributed setting it makes sense to use the combination of methods, because it has the lowest average amount of data to be moved, which, in turn, helps it to outperform its competitors. The SpAMM algorithm shows the worst timings due to significant differences between processes in the amounts of data to send, which is a result of handling non-truncated matrices.

6.4 Error control

In Paper III we perform asymptotic error analyses, which is more of a theoret-ical interest and is not constructive. In Paper IV we develop ideas further and propose a method how to select a threshold τ for approximate multiplication with SpAMM so that the Frobenius norm of the error does not exceed a pre-defined value without computing the exact product and without performing the approximate multiplication. We exploit submultiplicativity of the Frobenius norm and do a parameter sweep for τ. We also demonstrate that this algorithm used in a density matrix purification loop (see Algorithm 2) allows to avoid calculation of unnecessary elements efficiently and bring down the number of non-zeros per row on all stages of the process compared to the previous approach with error control relying on bounding the Frobenius norm of the matrix of elements to be truncated.

Code

We include both the SpAMM implementation and the error control function-ality done with the Chunks and Tasks parallel programming model discussed in section 7.3 to the Chunks and Tasks matrix library, see Paper V.

(36)

7. Parallelization aspects

The methods used in electronic structure calculations have achieved linear scaling with respect to system size, i.e. the computational load increases in direct proportion to the size of the considered physical system. However, in modern conditions, methods also should be parallelizable in order to be able to explore large systems on supercomputers.

The development history of parallel computers first resulted in so-called shared memory machines, where a multi-core CPU or several CPUs have access to the same memory. Later, these machines came to be used as building blocks for larger machines as nodes. At this level, the global memory is no longer shared, but the information can be sent from one node and received at another. The trend of the last decade is to augment machines with accelerator boards such as Nvidia GPUs or Intel XeonPhi co-processors. These devices are also parallel machines, but with another paradigm: they use a bunch of relatively weak processing units to perform similar operations on data arrays. As the reader might guess, programming a modern parallel computer, which consists of nodes and accelerators, is difficult.

Parallel programming has a long development history, similarly to con-ventional programming, which started with machine languages, which were replaced by assembly languages because of code complexity. Later, high-level languages started to play the dominant role. It is important to remember that high-level languages are translated to assembly codes, and those are translated to machine codes. The reason is simple - machines understand only machine codes.

7.1 Parallel programming models

First of all let us decide, why parallel programming models are necessary and then what a parallel programming model is.

From our point of view, the necessity in models is not difficult to explain: they are required to simplify the creation of the code, increase the performance of the code and the programmer and make the code easier to understand. But a notion of a parallel programming model is a bit tricky. Are they models of parallel programming or models for it? We believe that the latter is correct, and that a model should represent the real-world concurrency and parallelism. Historically, the first and most known concept of concurrency is a thread. A similar concept of a process is utilized in distributed programming. The well-known definition of a process is "a program when running", and a thread is "a

(37)

lightweight process." These concepts do not resemble the real world concur-rency and they are rich sources of nondeterminism [51]. When working with threads or processes, a variety of specialized tools like mutexes, semaphores etc. is used to reduce the amount of nondeterminism.

The need of better representation of concurrency and alleviating the pro-grammer’s responsibilities brings us to other concepts and models based on those concepts.

MPI

The Message Passing Interface [90] is not a parallel programming model or a language, but it relies on the concept of processes. It is a standardized set of routines for passing messages between concurrent processes. However, it is worth mentioning in this context, because is has became something similar to assembly languages in conventional programming. Many implementations of other models rely internally on MPI.

Partitioned global address space models (PGAS)

This approach tries to extend the concept of shared memory to distributed machines by creating a global address space. Each process handles a part of that space and communication between processes is accomplished by updating the global memory. Examples of models which utilize this concept are Linda [15], Global Arrays [60], Hierarchical Place Trees [94], X10 [18] and Unified Parallel C [49]. To certain extent PGAS is supported in Charm++ [44] and OmpSs [25].

The idea behind the PGAS concept is simple: for the user the memory looks like a large continuous storage, but in fact, the runtime system decides where to place particular pieces of data and when and how to deliver it to the process which puts a request, and how to prevent race conditions. This brings certain overhead. Moreover, an implementation of a PGAS model might rely on a central node for managing the global memory. Such node is a weak point in the design.

Domain-specific models

In TensorFlow [1] a computation is represented as a directed graph, where the nodes are the operations, whereas the information floating along the edges is tensors. Most tensors live only between two nodes, but the user can also utilize a mutable tensor, which has a longer lifetime. The main purpose of this model and its implementation is machine learning.

Another example of a domain specific model is the MapReduce [21] model, which is designed for processing large amounts of data. In this model, a set of worker machines perform tasks of two types, either map (compute intermediate key/value pair) or reduce (merge values with the same key to a smaller set). The master node keeps the statuses of the worker’s tasks and re-distributes the work in case of failure.

(38)

Task-based models

A task can be defined in different ways. In the real world, it can be defined as a piece of work. In programming, it can be defined as a large enough computation.

Tasks represent the real world concurrency better than processes or threads, because tasks in programming do have a real-life counterpart, whereas pro-cesses and threads do not.

Historically, one of the first task-based models was Cilk [9]. Although it did not use the notion of a task and did use threads, conceptually it is a task-based model. It is worth mentioning that Cilk was developed for shared memory machines. The computation is represented as a dynamically unfolding directed acyclic graph. The authors of the Cilk model also suggest a model for the execution time in such environments, which is discussed further in section 7.4 and which is directly related to the concept of the critical path used in Paper II. Another milestone is that the load balancing mechanism based on work stealing became very popular after it had been used in Cilk.

Later, other task-based models were developed. Examples are Concurrent Collections [12], DuctTeip [95], Hierarchical Place Trees [94], StarPU [2], Sequoia [29], OmpSs [25], XKaapi [33] and SuperGlue [86]. Not all of these models are for distributed memory, for instance SuperGlue, XKaapi, Cilk, and not all of them support dynamic parallelism (i.e. running tasks inside tasks). But these models represent the concurrency of the real world better than threads or processes.

7.2 Some issues of parallel programming

Apart from the problem with choosing the right abstractions, there are other issues in parallel programming, which are not always obvious to anyone fa-miliar only with conventional (sequential) programming.

Load balancing

Load balancing is an important problem to solve when running the code on supercomputers. Not only is it important for saving the time of the user, but also economically, by minimizing the idle time of available machines and thus costs.

In PGAS models the load balancing implicitly follows the data distribution. In most task-based models the concept of work-stealing is utilized. Basically it means that an idle worker tries to steal a task from another worker’s queue of tasks. Most of the models do not consider locality issues in this approach, however, XKaapi [33] does so. In most domain-specific models a central node is responsible for load balancing.

(39)

Determinism

The notion of determinism means that a program always gives the same output provided that the input does not change. It is extremely important for scientific computing, since numerical experiments should be reproducible and programs should be deterministic. Lee [51] identifies nondeterminism (i.e. the lack of determinism) as the main problem of concurrent programming.

Nondeterminism usually comes from a race condition, i.e. simultaneous modification of shared memory. The typical cure is to use a mutex mechanism. In distributed memory it is usually caused by the messages arriving in a wrong order. With conventional tools like OpenMP and MPI it is very easy to end up with a nondeterministic code.

The models discussed above handle this issue in different ways. Some of them are provably deterministic like Concurrent Collections [12], some are deterministic if tasks are deterministic, see for instance StarPU [2] or MapReduce [21], and the rest have at least one source of non-determinism like user-defined access types, overlapping memory regions and so on. Deadlocks

A deadlock can be defined as a state of a program such that no process or worker can proceed further because of blocking by others. A rich source of deadlocks is global synchronization.

Only a few of the models mentioned above are deadlock-free, for instance MapReduce [21] and Sequoia [29]. Other models mentioned in the previous paragraphs are not deadlock-free due to global synchronization and possible out-of-order message handling.

Fault tolerance

Fault tolerance is a rather broad term, it includes robustness of the running program with respect to certain external conditions. These conditions include, for example, hardware problems such as short circuits or physical defects on discs and software issues caused by mistakes or bugs. Network issues fall in both groups.

Typical ways to tolerate hardware faults lie on the hardware side and mostly rely on redundancy and/or replication. From a software perspective, one of the mostly used mechanisms is checkpoints. This means that process resources are to be saved from time to time so that the process can be restarted should anything happen. From a parallel programming perspective, a failure of a single or several processes or nodes should not result in a failure of the whole program. Literally this means that the process or node which failed should be re-started on the fly and continue without any implications on others.

This issue is addressed explicitly by a few models. In the MapReduce model [21], if a worker fails, its task is completely re-executed by another worker. The master node saves own state at checkpoints. In TensorFlow [1],

(40)

the complete graph of computation is re-executed in case of failure. Charm++ [44] has a built-in checkpoint mechanism.

Scalability

If larger problems can be solved by the program with additional computational resources, i.e. if the Gustafson law [38] holds, then the program is scalable. Obviously, programs which use models for shared memory are not scalable. Programs with PGAS models are more scalable, but the overhead of manag-ing the global memory limits the scalability. Models for distributed memory should, in theory, provide a good scalability. Some of those do have a master node, like the MapReduce model [21], but, due to the nature of the model, its influence is minimized.

Easiness to use

The last, but not the least issue is how easy it is to write programs using a particular model. Programmers tend to use the tools they are used to and most of the models require a paradigm shift. However, if this shift moves the programmer not far from well-known tools or languages, then the model can be utilized successfully.

From this perspective, MPI is an ultimate tool similar to assembly lan-guages. It is powerful, but complicated. PGAS models are close to popular OpenMP, thus the shift is not a problem. The task-based models require another way of thinking and thus a certain effort from the programmer.

7.3 The Chunks and Tasks programming model

The Chunks and Tasks model, suggested by Rubensson and Rudberg [70], is defined by a C++ application programming interface.

The two basic abstractions, which gives the name for the model, are chunks and tasks. A piece of data is encapsulated as a chunk and a piece of work - as a task. It makes it possible to take advantage of parallelism in two dimensions, both in data and in work.

The user is responsible for preparing a chunk and registering it to the run-time library. Once it is ready, the user obtains the chunk’s identifier and the chunk becomes immutable, similarly to Concurrent Collections [12]. The chunk identifiers can be a part of other chunks creating a hierarchical structure or can be provided as input to tasks.

The work is split into tasks. A task is defined by a number of input chunk types, the work to be performed and a single output chunk type. Eventually all tasks are converted into chunks. Chunk content can only be accessed when the corresponding chunk identifier is provided as an input to a task, except in the main program. It is not possible to run a task without having all the input chunks computed. Dynamic parallelism is supported.