Parallelization of a thermal elastohydrodynamic lubricated contacts simulation using OpenMP

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Parallelization of a thermal

elastohydrodynamic lubricated

contacts simulation using OpenMP

(2)

Parallelization of a thermal elastohydrodynamic

lubricated contacts simulation using OpenMP

Ghassan Alrheis

KTH Royal Institute of Technology

29th of May 2019

Industry supervisor: Erland Nordin Academic supervisor: Carl-Magnus Everitt

Examiner: Erwin Laure

Sammanfattning

Datorer med flera kärnor som delar p˚a ett gemensamt minne (SMP) har blivit normen sedan Moore´s lag har slutat gälla. För att utnyttja den prestanda flera kärnor erbjuder s˚a behöver mjukvaruingenjören skriva programmen s˚a att de explicit utnyttjar flera kärnor. För mindre pro-jekt är det lätt att detta bortses fr˚an vilket skapar program som endast utnyttjar en kärna. Detta gör att det i s˚adana fall finns stora vinningar genom att parallellisera koden. Det här examensar-betet har förbättrat prestandan p˚a ett beräkningstungt simuleringsprogram, skrivit att utnyttja endast en kärna, genom att hitta omr˚aden i koden som är lämpliga att parallellisera. Dessa omr˚aden har identifierats med Intel´s Vtune Amplifier och utförts med OpenMP. Arbetet har ocks˚a bytt ut en speciell beräkningsrutin som var särskilt krävande, speciellt för större problem. Slutresul-tatet är ett beräkningsprogram som ger samma resultat som det ursprungliga programmet men betydligt snabbare och med mindre datorresurser. Programmet kommer att användas i framtida forskningsprojekt.

Abstract

Multi-core Shared Memory Parallel (SMP) systems became the norm ever since the performance trend prophesied by Moore’s law ended. Correctly utilizing the performance benefits these systems offer usually requires a conscious effort from the software developer’s side to enforce concurrency in the program. This is easy to disregard in small software projects and can lead to great amounts of unused potential parallelism in the produced code. This thesis attempted to improve the perfor-mance of a computationally demanding Thermal Elastohydrodynamic Lubrication (TEHL) simula-tion written in Fortran by finding such parallelism. The parallelizasimula-tion effort focused on the most demanding parts of the program identified using Intel’s VTune Amplifier and was implemented using OpenMP. The thesis also documents an algorithm change that led to further improvements in terms of execution time and scalability with respect to problem size. The end result is a faster, lighter and more efficient TEHL simulator that can further support the research in its domain.

(3)

Nomenclature

API Application Programmer Interface ARB Architecture Review Board cc-NUMA Cache Coherent NUMA DMP Distributed Memory Parallel DSM Distributed Shared Memory Parallel DS Direct Summation

EHL Elastohydrodynamic Lubrication FFT Fast Fourier Transform

IDE Integrated Development Environment ifort Intel Fortran Compiler

MKL Intel’s Math Kernel Library MPI Message Passing Interface NUMA Non-Uniform Memory Access OpenMP Open Multi-Processing SMP Shared Memory Parallel SMT Simultaneous Multi-Threading

TEHL Thermal Elastohydrodynamic Lubrication UMA Uniform Memory Access

(4)

Inneh˚

all

1 Introduction 4 1.1 Background . . . 4 1.2 Scope . . . 4 1.3 Outline . . . 5 2 Theory 6 2.1 Thermal Elastohydrodynamic Lubrication Simulation . . . 6

2.1.1 Modeling EHL . . . 6

2.1.2 Numerical setup . . . 7

2.1.3 Thermal model . . . 8

2.1.4 Overview of the simulation . . . 9

2.2 Parallelization and OpenMP . . . 9

2.2.1 Memory models . . . 10

2.2.2 OpenMP’s programming model . . . 11

2.2.3 OpenMP’s work-sharing constructs . . . 12

2.2.4 Potential performance . . . 12 2.2.5 Related work . . . 13 3 Methodology 15 3.1 Tools . . . 15 3.1.1 Hardware . . . 15 3.1.2 Software . . . 15

3.2 Performance data collection . . . 16

3.2.1 Single-step mode . . . 16

3.2.2 Simulation Test Cases . . . 17

3.2.3 Reference Step . . . 19

4 Execution 20 4.1 Initial performance . . . 20

4.2 Parallelizing VI . . . 21

4.3 Parallelizing TEMP CALC METAL . . . 22

4.4 Addressing libm powf l9 . . . 24

4.5 Finalizing the parallelization . . . 27

4.6 FFT VI . . . 28

4.6.1 Motivation . . . 28

4.6.2 Background . . . 29

4.6.3 Implementation . . . 30

4.6.4 Performance . . . 31

4.7 Final remarks and reflection . . . 31

5 Results and discussion 32 5.1 Concurrency scaling . . . 32

5.1.1 Using 1 x Intel Core i7-7800x . . . 32

5.1.2 Using 2 x Intel Xeon E5-2690 V2 . . . 34

5.2 Computational scaling of the elastic deformation subroutines . . . 35

5.3 Summary and final remarks . . . 36

6 Conclusions and future work 38 6.1 Summary and conclusion . . . 38

(5)

Tabeller

1 Specifications highlights of the two computers used. Obtained from [3] and [5]. . . 15

2 Software versions used on the two machines. . . 16

3 Summary of the initial code analysis. . . 20

4 Summary of the analysis of the code with parallel VI. . . 22

5 Summary of the analysis of the code with parallel TEMP CALC METAL. . . 23

6 Caller/callee report for libm powf l9. . . 25

7 Summary of the analysis of the code with concurrent usage of libm powf l9. . . 27

8 Summary of the analysis of the final threaded code. . . 28

9 Total CPU time and CPU hotspots of the final threaded version of the code. . . . 28

10 Summary of the analysis of the threaded code using the FFT method for elastic deformation. . . 31

Figurer

1 Conformal (a) versus non-conformal (b) contacts. . . 6

2 Illustration of the discretization of the contact metals and lubricant film. . . 9

3 Illustration of the execution of the simulation. . . 10

4 Illustration of SMP, DMP and DSM. . . 11

5 OpenMP’s fork-join model. . . 12

6 The effects of Amdahl’s law illustrated for different fractions of parallelized code. . 13

7 Illustration of the data dependency between time steps. . . 17

8 Single step process flow illustration. . . 18

9 The nested loop in TEMP CALC METAL. . . 23

10 The nested loop in TEMP CALC METAL post parallelization. . . 24

11 The calls to the demanding subroutines in LUBRICATION TEMP. . . 25

12 The parallelization of the nested loop that calls TEMP CALC IJ. . . 26

13 One of the segments of NEWTONIAN parallelized. . . 27

14 Illustration of the circular convolution in the deformation calculations. . . 30

15 Illustration of deformation calculations with expanded pressure. . . 30

16 Execution time changes on the PC. . . 33

17 Obtained speedups on the PC. . . 33

18 Threading efficiency on the PC. . . 34

19 Illustration of the NUMA node’s motherboard. . . 35

20 Execution time changes on the NUMA node. . . 35

21 Speedups on the NUMA node. . . 36

22 Threading efficiency on the NUMA node. . . 36

(6)

1 Introduction

This section provides the reader with a brief historical context for multi-core parallel computing. The scope of the thesis is also described here. Finally, a brief outline of the thesis is presented for the convenience of the reader.

1.1 Background

According to the popular and loosely defined Moore’s law, the number of transistors that can fit on a single chip should double every year or two. This prophecy was used, for a long time, to predict the improvement in the processing power of general purpose processors [16]. The way transistor count and performance were linked was through the attainable clock frequencies of said processors which corresponded with the size of the transistors. This relationship held for a long time but around the early 2000s the clock frequencies stagnated. This was not due to a slow down in the improvements to the transistor sizes and the scale of integration possible on a single die, but rather it was due to power and heat dissipation issues. Maintaining the trend of performance increase required using this possible increase in chip complexity in a different way. For that reason, hardware manufacturers moved to multi-core architectures which included several processors on a single chip [2]. This would lead to a potential increase in performance but only to software developed with that hardware in mind. Meaning threaded software which can distribute its workload over multiple processors in an effective manner.

Automatic parallelization of code is still not widely adapted which can be due to the focus of research on parallelization APIs and their implementations [15]. This implies that introducing concurrency to software requires a conscious effort from the developer. This, alongside the new concepts introduced by a concurrent programming model, can inhibit the development of parallel code when performance is not being prioritized. In turn, this leads to software with great amounts of potential parallelism that can be exploited in order to improve the performance.

In its initial state, the Thermal-Elastohydrodynamic Lubrication (TEHL) simulation software written by Carl-Magnus Everitt, a PhD candidate at KTH and one of the supervisors of this project, is entirely sequential and very computationally demanding. It can require more than a week of run time in order to complete some of the more complex cases. Concurrency was never explored during the development of the code and that represents the main reason why this thesis was proposed.

1.2 Scope

This thesis explored the parallelization of the TEHL simulation. Threading was done using OpenMP in an incremental manner. Meaning, OpenMP directives were used to parallelize a demanding seg-ment of the program resulting in a version which is then used to identify other demanding segseg-ments. When the remaining demanding segments had small execution times compared to the overhead parallelization would add, the parallelization was deemed complete. These demanding segments are later referred to as hotspots and were obtained from Intel’s Vtune Amplifier which is a parallel code profiler described more in Section 3.1.2.

This incremental approach produced many intermediate versions of the code with increasingly better performances. The decision to parallelize a new segment of the code is always justified by referring to the profiler’s results. Following that, the observed effect on performance is explained. The final version of the code is run with different numbers of threads with and without Hyper-threading in order to examine the efficiency of the Hyper-threading and how the attained performance scales with the amount of hardware used.

In addition to traditional parallelization, some optimizations to the original sequential code were done. Also, a more efficient elastic deformation calculation algorithm was implemented to speed up the final code and make it less computationally heavy. This approach was obtained from experts in this area and is considered a tangent to the original aim of the thesis. However, it is mentioned in this thesis because it led to significant improvements to the program.

(7)

1.3 Outline

The report starts with Section 2 which offers the reader an overview of the theory relevant to this work as well as some related prior work found by surveying published literature. Section 3 presents the hardware and software tools used throughout this thesis along with the general approach taken to collect representative performance data from the program. Following that, Section 4 documents the investigation carried out to determine how the code must be threaded and improved. The results of that effort were then analyzed and discussed in Section 5. Finally, the thesis is concluded with a summary and a few recommended areas of future work in Section 6.

(8)

2 Theory

This section provides the reader with the needed theoretical knowledge to understand the rest of the work. This mainly concerns the mathematical models used to numerically estimate the parameters of interest. Also, it surveys what has been done before within the area of parallelization using OpenMP. This information is used to identify the range of speed-up that is attainable when parallelizing the application to a certain number of threads. Alongside this, the literature was used to develop a suitable style to present the work in this thesis.

2.1 Thermal Elastohydrodynamic Lubrication Simulation

Elastohydrodynamic lubrication (EHL) is a type of hydrodynamic lubrication which occurs in concentrated contacts between solid materials. This concentrated nature of the interaction leads to high pressures which in turn cause significant elastic deformations to the solids. Concentrated contacts are called non-conformal, as opposed to conformal contacts in which the contact area is much larger and thus the pressure profile, over the surface of the solid within the contact area, is more uniform. For example, the contact between a plain bearing and its sleeve is conformal, while the contact between gears or between cam followers and cams is not [20]. Figure 1 illustrates the two types.

Figur 1: Conformal (a) versus non-conformal (b) contacts.

An EHL contact is described by the pressure profile over the interacting surfaces. Accurately approximating this can require the inclusion of many complex aspects of the contact. To limit the discussion to what is relevant to this thesis, this section will focus on the aspects of EHL described in Reference [8] which represent the model used in the subject simulation.

The rest of this section is organized as follows: The mathematical model of EHL as described in the simulation is presented; The discretization of the EHL model is shown; The thermal model added later is described; Finally, an overview of the code is given.

2.1.1 Modeling EHL

Carl-Magnus Everitt in Reference [8] describes EHL contacts with a system consisting of five equations. The first of which is Reynolds’ equation which can be used to obtain the pressure profile over the contact area. The equation relates the pressure to the thickness of the lubricant film, its density and its viscosity and it is formulated as follows:

∂ ∂x( ρh3 12η ∂ ∂xp) + ∂ ∂y( ρh3 12η ∂ ∂yp) − um ∂ ∂x(ρh) − ∂ ∂t(ρh) = 0, (1) where t is the time, x and y are the surface coordinates, p represents the pressure, h, η and ρ represent the thickness, viscosity and density of the lubricant respectively and um represents the

average speed at which the lubricant enters the contact area. The speed is called “entrainment speed¨and in most cases is the average speed of the two surfaces.

Correct pressure profiles should satisfy the load balance equation given by Z ∞

−∞

Z ∞

−∞

(9)

where f is the applied load normal to the contact surface. This load balance equation is the second equation in this EHL model. The thickness of the film is defined by the third equation of the system as h(x, y) = h0− ash(x, y, t) + x2 2rx + 2 πE0 Z ∞ −∞ Z ∞ −∞ pdx0dy0 p(x − x0 )(y − y0), (3) where h0 is the lubrication offset height, ash(x, y, t) is the expression defining the shape of the

surface roughness or asperities on the rolling surface, rx is the equivalent radius of the rolling

cylinder and E0 is the equivalent elastic modulus of the surfaces. Equivalent parameters are used here since the rolling contact problems are modeled as contact between a rolling cylinder or ball and a rigid flat surface. The equivalent radius and elastic modulus are calculated by the following.

rx= r1r2 r1+ r2 , (4) E0 = 2E1E2 (1 − ν2 2)E1+ (1 − ν12)E2 , (5)

where r1 and r2 are the radii of the curvature of the original surfaces, E1 and E2 are Young’s

modules and ν1 and ν2 are Poisson’s ratios for the surfaces. The elastic deformation of the solid

is superimposed over the thickness of the film as its the last term in Equation (3). The fourth equation of the system calculates the viscosity of the lubricant. It is called Roelands equation and it is given as η = η0(Γ)exp [ln(η0(Γ)) + 9.67] h −1 + (1 + 5.1 · 10−9p)ZR(Γ)i (6)

where Γ is the temperature, η0(Γ) is the dynamic viscosity at atmospheric pressure and given as

log(η0(Γ)) = −4.2 + G0 1 + Γ 135 −S0 , (7)

and ZR(Γ) is the temperature exponent and given as

ZR(Γ) = Dz+ Czlog 1 + Γ 135 . (8)

The coefficients Cz, Dz, G0 and S0 are constants and depend on the lubricant. Finally the last

equation in this model is the pressure density relationship given by the Dowson and Higginson relation stated as ρ = ρ0 1 + A1p 1 + A2p [1 − α(Γ − Γ40)] (9)

where ρ0is the density of the lubricant at the reference temperature and Γ40 represents reference

temperature of 40°C.

The system is solved numerically in most cases due to its non-linearity. The numeric model of the system is presented in the following section.

2.1.2 Numerical setup

The numerical setup is built on the approach presented in Reference [13] which uses the Finite Difference Method (FDM) to discretize the differential equations in the system. The approach uses dimensionless parameters and some of which are given below.

P = p/pHertz, ρ = ρ ρ0 , H = hrx a2 , Ash= ashrx a2 , X = x a, Y = y a, T = tum a , ∆T = um(Xe− X0) Ntuc , (10)

where X0and Xeare the normalized x-coordinates of the beginning and end points of the simulated

(10)

beginning of the simulation based on the input parameters, a and phertz are the Hertz contact half

width and Hertzian pressure respectively given as

a = r 8f rx πE0 , phertz= 2f πa. (11)

Also, in order to simplify the discretized expressions of the equations two more parameters are introduced and defined as

λ = 12umr 2 x a3_p Hertz , = pH 3 ηλ (12)

With this Reynolds’ equation (Equation (1)), without the time derivative term, is discretized in space as ∂ ∂X i,j ∂ ∂XPi,j + ∂ ∂Y i,j ∂ ∂YPi,j −∂ρi,jHi,j ∂X = 0. (13)

This is used to obtain the pressure profile of the time independent solution without the asperities in the contact area. The initial film thickness (h0 in Equation (3)) is also obtained during this

phase. The derivative terms in Equation (13) are defined as follows. ∂ ∂X i,j ∂ ∂XPi,j

=(i+1,j+ i,j) Pi+1,j− (2i,j+ i+1,j+ i−1,j) Pi,j+ (i,j+ i−1,j) Pi−1,j 2∆X2

∂

∂X ρi,jHi,j = ρi,j

3Hi,j− 4Hi−1,j+ Hi−2,j

∆X + Hi,j ∂ρ_i,j ∂Pi,j

3Pi,j− 4Pi−1,j+ Pi−2,j

∆X

(14) The partial derivative with respect to Y is analogous to the X partial derivative. This discretization in space implies that the simulated contact area is split into nodes arranged on a 2D grid. This says nothing about the shape of the grid but Reference [8] makes it a square for the simplicity of representation.

To obtain the time dependent solution, the full expression of Reynolds’ equation is discretized using the Crank-Nicolson discretization scheme which produces the following.

_∂ ∂X i,j ∂ ∂XPi,j tn+1 + _∂ ∂Y i,j ∂ ∂Y Pi,j tn+1 − _∂ρ i,jHi,j ∂X tn+1 + _∂ ∂X i,j ∂ ∂XPi,j tn + _∂ ∂Y i,j ∂ ∂Y Pi,j tn − _∂ρ i,jHi,j ∂X tn − 2 _(ρ

i,jHi,j)tn+1 − (ρi,jHi,j)tn

∆T

= 0

(15)

The other equation that needs discretization is Equation (3) which models the thickness of the lubricant film. The focus of the discretization is the elastic deformation term which includes a surface integral. The discretization is given as follows.

H0= H0+ Xi2− Ash(i, j, t) + 2 πE0 Nx X k=1 j+Nx X l=j−Nx P (k, l)(AD+ BD+ CD+ DD) (16)

The terms AD to DD are added to simplify the expression and due to their size they were kept

out of this section. They can be found in Equation number 22 in [8]. The load balance equation is another surface integral over the x − y plane but with a simpler expression to integrate. Reference [8] does not show this discretization and thus this section will not either. Finally the viscosity and density expressions do not include any continuous calculus and therefore, are not discretized. They do, however, use the unit-less parameters shown above.

2.1.3 Thermal model

In his later work, Carl-Magnus Everitt added a model of the thermal changes that can occur in a contact in order to more accurately simulate the lubricant film thickness. The model with this added detail is called Thermal Elastohydrodynamic Lubrication (TEHL) [9].

(11)

The main assumption made in this model is that the temperature is constant throughout the thickness of the lubricant film. This means that the fluid temperature term is calculated over a 2D grid. The same assumption is not made for the solids in the model, however, and therefore every metal in the contact has its own 3D grid. The number of nodes into the metals is set to 40 nodes for each node on the 2D grid of the lubricant. The discretization is shown in Figure 2 below.

Figur 2: Illustration of the discretization of the contact metals and lubricant film. The film consists of a single layer of nodes while the metals contain multiple layers.

Heat transfer is computed between every node and its direct neighbors, as illustrated by the arrows in the figure, in every iteration of the algorithm. The algorithm concludes when equilibrium is reached.

2.1.4 Overview of the simulation

An overview of the steps taken to solve for the pressure at every time step is shown in Figure 3 below. This description is based on [13] which describes the model on which this simulation was built. Also, some of the details were obtained from examining the source code. This does not serve as a comprehensive description of the code and it is only shown to illustrate how the system is solved.

2.2 Parallelization and OpenMP

The main scope of this work is to get the most out of a single shared memory parallel computing platform. Also, not reducing the portability of the original code by making it hardware specific is an important aspect of this thesis. With those conditions in mind, the tool chosen to parallelize the code was OpenMP.

OpenMP stands for Open Multi-Processing, and it is an Application-Programmer Interface (API) which facilitates the creation of multi-threaded code. The API consists of compiler directives, environment variables and run time functions. It is developed by the OpenMP Architecture Review Board (ARB) which contains representatives from the biggest semiconductor vendors. The board represents a joint effort by the vendors to create a common tool to facilitate multi-core parallelism on a wide array of computing platforms [2]. The ARB produces the specification of the API which is then implemented by compiler vendors or developers.

This section presents a survey of what is found in the literature regarding parallelization using OpenMP. The rest of the section is organized as follows: Different memory models are

(12)

disambi-Figur 3: Illustration of the execution of the simulation.

guated; OpenMP’s programming model is discussed; The main work-sharing methods in OpenMP are briefly described; Amdahl’s law is introduced to illustrate the potential speedups attainable theoretically; Finally, published literature about parallelizing using OpenMP is briefly described.

2.2.1 Memory models

The first memory model discussed is the SMP model. SMP architectures are the most common type of parallel computers used today where almost every modern personal computer contains two physical cores or more. In these architectures every processor has access to a single physical shared memory. This access can either be symmetric or asymmetric for every processor in the system. The former type of architectures is called Uniform Memory Access (UMA) SMP while the latter is called Non Uniform Memory Access (NUMA) SMP. Asymmetrical access can be due to the memory being physically closer to some processors in the system than others. Therefore, most big multi-processor architectures are NUMA-SMP. The communication mechanisms used in these types of systems usually ensure cache coherence (cc) across processors and these types of architectures are then called cc-NUMA-SMP. OpenMP specifications assume SMP architectures and while using it the programmer could usually ignore the differences between the subcategories in this model [2].

The second memory model is the Distributed Memory Parallel (DMP) model. These systems are made up of multiple independent computers connected by a network. These machines could then be used to collaboratively execute programs. In this case, every individual machine has its own memory and its portion of the data needed for the program. The machines can communicate and exchange data by passing messages over the network. The most common tool to use with these

(13)

types of systems is the Message Passing Interface (MPI). If the system is made up of SMP machines then OpenMP can be used locally alongside MPI in order to exploit the potential computation power of each machine in the system [2].

The third memory model is derived from the first two and is called Distributed Shared Memory Parallel (DSM). These are distributed systems which allow every node to access the local memory of any other machine in the system. In this case the distributed memory is modeled as a big shared memory which is accessed non-uniformly. The communication scheme across the network can also ensure cache memory (or local memory) coherence to ensure changes to the same data is recorded in every machine that has a copy of said data. Also, heterogeneous architectures which consist of SMP systems augmented with co-processors or accelerators are categorized under this model since multiple address spaces can be in use in these systems [11]. OpenMP can be used with DSM architectures if the memory sharing is guaranteed by the system. Moreover, OpenMP specification 4.0 (and later) provides explicit support for heterogeneous systems due to their popularity [18].

The three memory models are illustrated in Figure 4 below.

Figur 4: Illustration of the three memory models. (a) depicts an SMP model, (b) is a DMP model and (c) is a DSM model. The figure was adapted from Figure 1.2 from [2].

2.2.2 OpenMP’s programming model

The OpenMP specification has changed significantly over the years with new features, that support different architectures, being constantly added. That being said, the core programming model is kept constant in order to allow for backward compatibility and easy portability. This model, along with the general terminology used with OpenMP applications and a brief introduction to some of its directives, is described below.

The programming model OpenMP uses is characterized by shared memory parallelism which is assumed in OpenMP applications. The model enables the creation and high-level control of threads which are streams of instructions assigned to processors. In OpenMP threads are created and destroyed when needed using what is called the Fork-Join model illustrated in Figure 5 below. In this model the program is always executed by a single thread initially and at the end. This thread is called the initial thread. More threads are created, and work is distributed among them, when a compiler directive indicating the beginning of a parallel region is encountered. This is referred to as the fork. Also, within the parallel region the initial thread is called the master thread. When the corresponding directive indicating the end of the parallel region is encountered, all the threads but the initial/master thread terminate. This is referred to as the join. The threads share the common address space but can, and usually do, have private data.

In this process OpenMP allows the user to: • Create threads

(14)

Figur 5: The fork join model used in OpenMP. Fortran Compiler directives that indicate the beginning and end of a parallel region are shown on the right.

• Define which of the variables in the scope are shared among threads and which should be private for each

• Synchronize threads

All of these are done through compiler directives mainly. The one which is discussed more here is the work sharing control since the rest are either intuitive or not necessarily done explicitly by the user (as in the case of thread synchronization).

2.2.3 OpenMP’s work-sharing constructs

Work sharing is described within parallel regions and right before the block to be parallelized through directives called work-sharing constructs. These include the loop construct, sections con-struct, single construct and the Workshare construct in Fortran. The loop construct tells the compiler to distribute the iterations of the loop over the threads. The user can further guide how this should be done by specifying the method using optional clauses with the constructs. Different work distribution methods can lead to a reduced imbalance between the threads at run time. The section construct can be used to specify blocks that can be done concurrently by multiple threads. This construct can be used to easily pipeline segments of a program. The single construct, as the name entails, is used inside parallel regions to limit the number of threads that can execute a block to one. Finally, the Fortran Workshare construct is used with Fortran array operations to parallelize array manipulations.

2.2.4 Potential performance

In general when N processors are used the execution time should reduce to 1/N of its original length. This represents the ideal situation which is usually not achieved due to the overhead OpenMP adds to the execution of the program. Another reason is Amdahl’s law which states that the sequential part of a program dominates the execution time after a certain number of threads are used to execute the parallel part. This partitioning of the program is assumed because every program has parts that can be done concurrently and parts that must be done sequentially. Also, in big programs it can be difficult to identify every parallelizable part. Amdahl’s law can be formulated as follows:

S = 1

fpar/P + (1 − fpar)

, (17)

where fpar is the fraction of the code which has been parallelized, P is the number of processors

used and S is the expected speedup. This, therefore, can create a limit to the linear speedup obtained when increasing the number of processors executing the code as shown in Figure 6.

(15)

Figur 6: The effects of Amdahl’s law illustrated for different fractions of parallelized code.

That being said, the speedup can also be affected by other factors such as cache memory aggregate and problem size. The first is caused by using multiple processors each with their own cache memory. This leads to a bigger portion of the program’s data residing in the faster cache memory. The latter is due to increasing the time a single processor spends on the parallelized region of the code. This then leads to a significantly smaller increase in the execution time of the same code when done by a team of threads. With these two factors, super-linear speed ups (where the speed up is more than the number of processors used) can be encountered.

2.2.5 Related work

Many examples from published literature examine the performance due to parallelization using OpenMP in different areas of science. These articles were looked at to create a base for the approach to take in this thesis. This section briefly summarizes the interesting points from the articles considered.

Reference [17] reports the attempt to parallelize FEAP, an open source finite element analysis program, written by members of the department of the civil and environmental engineering at the University of California at Berkeley. The authors targeted the subroutine which required the most execution time. Parallelization was done by splitting up the iterations of the main loop on a number of threads. The performance was analyzed with respect to the problem size (in this case the number of elements) and the number of threads. Increasing the problem size in this case lead to an overall slower execution time with every number of threads. The trend in the speedup observed was linear, and equal to the number of threads used, up to eight threads and sub-linear for 16 and 24 threads.

Reference [10] attempts to improve the performance of ClamAV, an open source anti-virus software, by partially parallelizing it using OpenMP. This work targeted the string search functions which represented 52.86% of the execution time as reported in the paper. The functions were called from within loop nests to cover a collection of files and thus the loop work-sharing construct was used. The analysis looked at the effect of using a different number of threads and at varying the method used to distribute the iterations of the loops over the threads. The speedups observed were always sub-linear and less than the number of threads used. The best speed-up was around 2.6 and occurred when using four threads.

Reference [12] explored the parallelization of the Modular Transport 3 Dimensional Multispecies Transport Model (MT3DMS) using OpenMP. Like the previous examples this one also attempted to parallelize the most time consuming parts of the software. The segment modified represented

(16)

96% of the execution time. The analysis looked at the relationship between the number of threads and the speed-up obtained. The speedup was sub-linear for every thread count exceeding one and the maximum speed-up obtained was 4.15 at eight threads.

Reference [19] is about parallelizing a DNA sequence search algorithm. This reference examines different tiling methods to work around the data dependencies in the model and exploit possible parallelism. The work does not look at the effects of varying the problem size or thread number. Also, it is not clear if four or eight threads were used in this work since it does not mention if the Intel platform used supports hyper-threading. That being said the speedup obtained decreased for bigger tile sizes and its maximum value was 7.5.

(17)

3 Methodology

This section of the thesis describes the tools used to achieve the results presented in the later sections. This information can be used, if access to the original code is possible, to recreate these results. Also, this section documents the approach taken to identifying samples of performance data considered representative to the worst case total execution time of the program. Terms such as “Simulation Test Cases” and “Reference Time Steps” are related to the aforementioned approach and are defined and explained here as well.

3.1 Tools

The platforms used to collect the performance data of the TEHL simulation are documented under the “Hardware” header. The threading of the code was also aided by several software tools and those are described under the “Software” header.

It should be noted that in this section, and the rest of the thesis for that matter, the term “Processor” refers to the complete single chip die while the term “Core” refers to a single processing unit inside the chip. Therefore, a processor can contain multiple cores but not vice versa.

3.1.1 Hardware

This project used two different machines to assess the performance of the code. The first of which was an Intel based personal desktop with an i7-7800X processor which has 6 physical cores running at a base frequency of 3.5 GHz. The second was a single node of a supercomputer on Scania’s premises. The node is Intel based with two E5-2690V2 processors which have 10 physical cores each and run at a base frequency of 3.0 GHz. Both processors used support Hyper-threading, which means each physical core could be used as two logical cores. This means 12 and 40 logical cores on the desktop and the supercomputer node respectively. The specifications of the two machines are summarized in Table 1 below.

Property

Machine

Desktop computer Supercomputer node Processor(s) 1 x Intel Core i7-7800X 2 x Intel Xeon E5-2690 V2 Number of physical Cores 1 x 6 2 x 10

Base frequency 3.5 GHz 3.0 GHz Maximum Turbo frequency 4.0 GHz 3.6 GHz Hyper-threading Supported Supported L3 Cache memory size 8.25 MB 25 MB

Tabell 1: Specifications highlights of the two computers used. Obtained from [3] and [5].

The logical cores added by hyper-threading can lead to inaccurate performance results if one is not careful with how the threads are distributed over the hardware. This was dealt with by using the affinity interface offered by the Intel run time library which is automatically linked by the compilers used [4, 6]. More information about this can be found in Section 5 which compares the performance obtained when using 1 thread per core and 2 threads per core.

The approach taken to ensure that the clock frequencies do not fluctuate significantly, during a slightly periodic workload, was by setting the computer to high-performance mode. It was assumed that this way the clock frequency was maintained around the highest attainable value by the system which is usually something between the base frequency and the “Maximum Turbo frequency”. Again, this was easy to change on the personal computer but not on the node. Luckily, however, the supercomputer node is always set to high-performance mode.

3.1.2 Software

The software tools used in this project consist of compilers, APIs and a code profiler. The compiler used was Intel’s Fortran Compiler (ifort) which was available for both Windows and Linux. The compiler implements OpenMP which in turn is one of the APIs used in this project. The second API used was Intel’s Math Kernel Library (MKL) which provides easy to use, and highly optimized

(18)

implementations of complex mathematical operations and algorithms. MKL was used to implement the Fast Fourier Transform (FFT) approach used to calculate the elastic deformation. Finally, the code profiler used was Intel’s VTune Amplifier.

Since two computers were used, different versions of the tools were used as necessary. For instance, different versions of ifort were used. This is due to having access to different licenses on each machine. Also, the two compilers implement different OpenMP standards and therefore they differ in the version of the OpenMP API used. VTune Amplifier was not installed on the supercomputer node since it was not included in the license package of the tools used on that machine. The main software differences between the two machines are summarized in Table 2. This table also serves to highlight the specific versions of the main tools used.

Property

Machine

Desktop computer Supercomputer node

Operating system Windows 10 Enterprise Red Hat Enterprise Linux release 6.4 Fortran compiler ifort 19.0 ifort 12.0

OpenMP standard OpenMP 4.5 and 5 partially OpenMP 3.1 VTune Amplifier 2019 update 3 Not installed Intel’s MKL 2019 update 3 2019 update 3

Tabell 2: Software versions used on the two machines.

The differing compilers meant that some of the newer Fortran language features could not be included in the project to not affect the portability of the source code between the two machi-nes. Therefore this was consciously avoided while making changes to the program. Similarly with OpenMP, most of the features added between the 3.1 and the 4.5 standards had to be ignored. None of the new OpenMP features seemed necessary for this project. However, it can be interesting to explore explicit vectorization which was introduced in OpenMP 4.5.

Intel’s VTune Amplifier was selected because it is easy to use and provides a comprehensive analysis of the code in terms of hotspots, hardware utilization, threading efficiency and more. The hotspots and threading efficiency analyses were used the most during this project. Analyzing hotspots identifies areas worth parallelizing and analyzing threading efficiency analysis enables fine tuning the work distribution among threads. Most of the performance data presented in the following sections is collected using VTune Amplifier.

Finally, and for the sake of comprehensiveness, the Integrated Development Environment (IDE) used to debug and edit the code was Microsoft’s Visual Studio 2017. Again, it was mainly selected due to its accessibility and since it integrates easily with ifort 19.0 which was the compiler used during the development of the code.

3.2 Performance data collection

Since a single simulation can take more than a week of execution time on a single thread, it was decided that the performance of smaller parts of the simulation should be looked at instead. The selected parts had to accurately represent the performance characteristics of the entire simulation so that improvements to the execution time of these parts would directly correspond to that of the simulation.

In the case of this thesis it was intuitive to break the simulation into smaller parts since the “Time dependent solution” part of the simulation calculates a set of solutions that correspond to a set of time steps as shown in Figure 3. Running the program to solve for a single time step required some changes to the code. This is described briefly under the “Single-step mode” header. The “Simulation Test Cases” subsection explains how the input file can affect the coverage of the code significantly. Finally, this section ends with describing the selected time step and the reasoning behind this selection under the “Reference Step” header.

3.2.1 Single-step mode

Creating the Single time-step solver (referred to henceforth as “Single-step mode”) required un-derstanding the data dependency between an individual time step and another. In this program a time step uses data calculated in the previous and the second to previous time steps in addition to

(19)

the lubrication offset height calculated in the time independent part of the simulation. A simplified example of this is presented in Figure 7 below which shows the “Time dependent solution” part of Figure 3.

Figur 7: Illustration of the data dependency between time steps.

In the figure, the subscripts “1” and “2” refer to data calculated in the previous and second to previous time steps respectively. While the subscript “0” refers to data calculated in the time independent part of the simulation. The data from the second to previous time step is used to calculate the right hand side residuals of Reynolds’ equation (Equation (1)) when using a different time dependence method than the one shown in Equation 1.

The figure does not represent the approach taken to determine all the dependency between the time steps and serves as an illustration only. In reality the paths that can be taken through the code during a single step were examined and all the uninitialized data used (hence data normally written in a different time step) was listed. Therefore, a good understanding of the underlying mathematical model was not necessary to achieve this.

Collecting the required data was the next step in making this mode. This was straightforward and involved outputting all the identified data to files once each time step finishes executing. This created another mode for the program which can be called “data collection” mode. The entire simulation had to run completely in data collection mode before single-step mode could be used. To collect the needed data quickly a partially parallelized version of the program was used. This version is highlighted later in Section 4.2.

Finally, after identifying and obtaining the dependency data, the single-step mode was as simple as reading the input parameters, reading the dependency data, jumping to the beginning of a time dependent time-step solution and letting the program continue until the step is completed. At which point a flag is checked to determine if single-step mode was selected and in that case the program terminates. This is illustrated in Figure 8.

3.2.2 Simulation Test Cases

As shown in the first illustration of the simulation in Figure 3, as well as in Figure 8, an input file is read initially. This file contains around a hundred input parameters which allow the user to customize the analysis to some extent. For example, through these parameters, the resolution of the finite grid is defined as well as the speed the asperity moves at inside the contact area. These two properties directly affect the execution time since the first controls the size of the model and the second controls the number of time steps in the time dependent solution. Some of the other input parameters affect which parts of the code are visited in a given simulation. This can indirectly affect execution time since some of the more costly subroutines are optional or have multiple possible paths through them with varying lengths. For example the entire dynamic thermal model is optional and is only enabled if the respective selector input parameter is set to a specific value. Another example is shown in Listing 1, where the parameter geom determines whether the first path or the second path is taken through the subroutine with the latter having a higher potential

(20)

Figur 8: Illustration of the code path taken when single step mode is selected. The “Time indepen-dent solution” section is minimized except for the “Start” and “Read input setup file” processes.

execution time since it contains triple the amount of nested Do-loops. The main workload in the loops is truncated for the sake of clarity but it should be noted that at the last level all the nested loops have similar workloads.

1 IF(Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN 2 DO J=1,NN ,SS ! ====== Path 1 ====== 3 DO I=1,NX ,SS 4 ... 5 DO L=1,NYs ,SS 6 ... 7 DO K=1,NX ,SS 8 ... 9 END DO 10 END DO 11 ... 12 END DO

13 END DO ! ====== End of Path 1====== 14 ELSE 15 DO J=1,NN ,SS ! ====== Path 2 ====== 16 DO I=1,NX ,SS 17 ... 18 DO L=−NX+J−1,1−SS ,SS 19 ... 20 DO K=1,NX ,SS 21 ... 22 END DO 23 END DO 24 DO L=NYs+SS ,NX+J,SS 25 ... 26 DO K=1,NX ,SS 27 ... 28 END DO 29 END DO 30 ... 31 END DO 32 END DO 33 DO J=1,NN ,SS 34 DO I=1,NX ,SS 35 ... 36 DO L=−(J−1),NYs−J,SS 37 ... 38 DO K=1,NX ,SS 39 ... 40 END DO 41 END DO 42 ... 43 END DO

(21)

45 END IF

Listing 1: A shortened segment of the original elastic deformation calculation subroutine demonstrating the potential influence of input parameters on the execution time.

A “Simulation Test Case” is completely defined by an input file. The test cases were provided by the author of the code and they are simulation set-ups used by him to either recreate results found in the literature or to create original results for his research. Since blindly testing every possible combination of inputs is not feasible, the tests in this thesis focused on some of these test cases. Initially, one that does not use the thermal model was profiled and then used to create the initial parallel version of the code. Later, a test case which uses the thermal model, and which was being used by the author of the code to write the recently published paper (at the time of writing this report) cited as Reference [9] in this thesis, was taken as a comprehensive and representative test case and was used to create the final threaded version of the code. The selected test case is described more in the following subsection.

3.2.3 Reference Step

As mentioned in the introduction to Section 3.2, the simulation was split into smaller parts of a single time step each. Then some of these time steps were studied as samples with behaviour representative to that of the full simulation. Not every time step is interesting to study since some of them have a very short execution time due to various reasons. For example, the first few time steps of simulations with small localized asperities seem to have short execution time since during those steps the asperity would still be behind the high pressure contact area and close to the edge of the simulated surface. This means that the input pressure would significantly resemble a correct output since no surface disturbance is introduced yet.

A good reference time step should ideally have the worst possible execution time in a simulation. This way an upper bound on the execution time of the simulation can be placed given the total number of time steps in the simulation. Also improvements to the execution of a long time step will be easier to observe and measure especially since speed-ups due to multi-core parallelism can be severely capped by Amdahl’s law when the parallelized segment of the program is small as shown in Figure 6.

Given the aforementioned reasons, the selected time step was a step that does not lead to a convergent solution. This means the program depletes all the allowed iterations attempting to find a solution that satisfies the convergence conditions without succeeding. However, the result is usually close enough to what is expected and the divergence is usually due to the very conservative tolerance limits set for the entire simulation. This leads to having time steps with the longest execution times in the simulation since all the parts of the program that contribute to finding a solution will be executed more often. Therefore, the way this selection was made was intuitive since a failed time step is very close to the ideal reference time step described above. The selected time step was the 423rd out of 514 of a test case which looks at the effect of a single asperity surface detail on the pressure, temperature and lubrication thickness of a TEHL contact area. This information is not intended to be comprehensive since describing the step and the test case properly would require going into technicalities which are out of the scope of this paper. However, this can still help any result recreation effort if the original code along with the test case setup file were obtained.

(22)

4 Execution

This section attempts to capture the exploratory nature of this study introduced by the incre-mental parallelization of the code. The performance of the code is presented before, during and after the parallelization effort. The last part of this section documents the FFT elastic deformation implementation and presents the performance improvement which resulted from it. The code is segmented into subroutines which are the smallest parts referred to in the analyses shown below. Unless stated otherwise, all of the following performance data is collected using Intel’s VTune Amp-lifier from the reference time step described in Section 3.2.3 running on the “Desktop computer” machine described in Section 3.1.1. It should also be noted that all sequential optimizations carried out on the code, except for the FFT approach, is included in the initial performance analysis and not documented any further in this report. This is because those were deemed unfitting to the scope of the study.

4.1 Initial performance

The initial code was profiled to determine the starting performance, in terms of execution time, and the biggest hotspots in the code. The summary of the analysis is presented in Table 3. The format of this table is used throughout this section to document the analyses conducted on the code. The left segment of the table displays execution time information while the right segment illustrates the concurrency of the code at the current parallelization step. The execution time information is split into “total wall time”, which represents the total elapsed time from the beginning of the execution till its conclusion, and the parts of this total spent on the five biggest serial hotspots in the program at the current parallelization step. The serial hotspots represent the demanding parts of the program that are executed using a single core only. Therefore, adding up the wall time spent in the serial hotspots will give the amount of wall time spent executing the sequential parts of the program and not the total execution time. This difference is not apparent now but will become so in the later sections.

Serial hotspots Wall time CPU histogram VI 908.563s

TEMP CALC METAL 414.924s libm powf l9 104.725s TEMP CALC IJ 43.319s LUBRICATION TEMP 41.749s Total wall time 1787.076s

Tabell 3: Summary of the initial code analysis.

As shown in the table, the subroutine VI represented around 51% of the execution time initially. This subroutine calculates the elastic deformation, caused in the contact area by the high pressure, using the direct summation method. The rightmost term in Equation 16 shows the calculation method described. For more information about the direct summation method, for calculating the elastic deformation, the reader is referred to Reference [1].

Following VI, the 2nd, 4th and 5th highest execution times are caused by temperature model subroutines. These subroutines are optional and only used when the dynamic temperature model is selected. This means that VI would represent a significantly larger majority of the execution time in the test cases that do not use this model. Therefore, VI was the starting point of the parallelization effort.

Finally, it should be noted that libm powf l9 is one of the implementations of the floating point power calculation intrinsic function used in Fortran and included in a standard math library linked automatically by ifort. The source of this implementation is not accessible and for that reason the approach taken to address this was slightly different.

(23)

4.2 Parallelizing VI

Following the initial analysis results, VI was identified as the most significant hotspot in the program. To parallelize this subroutine its source code had to be examined. Listing 1, in the previous section showed most of the subroutine but shortened for readability. Listing 2 below highlights one of the biggest nested loops in the subroutine with the workload in the loops shown. While this loop is not necessarily executed, it is enough to demonstrate how this subroutine was parallelized.

1 IF( Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN 2 ... 3 ELSE 4 DO J=1,NN ,SS ! ====== Path 2 ====== 5 DO I=1,NX ,SS 6 H0 =0.0 7 DO L=−NX+J−1,1−SS ,SS 8 LL=abs((J−L)/SS) 9 DO K=1,NX ,SS

10 IK= IABS (I−K)/SS

11 H0=H0+AK(IK ,LL) ( P line (K)) 12 END DO 13 END DO 14 DO L=NYs+SS ,NX+J,SS 15 LL=abs((L−J)/SS) 16 DO K=1,NX ,SS

18 H0=H0+AK(IK ,LL) ( P line (K)) 19 END DO 20 END DO 21 Wside (i,j)=H0 22 END DO 23 END DO 24 ... 25 END IF

Listing 2: Shortened segment of the original elastic deformation calculation subroutine showing one of the nested loops.

Using OpenMP to parallelize such a segment is trivial since every iteration of the topmost loop is independent from all other iterations. This means that, when multiple threads are used each thread can execute a different iteration of the topmost loop concurrently. In OpenMP terms, this is easily done by creating a parallel region and using the loop work-sharing construct. This is demonstrated in Listing 3 below.

1 IF(Geom .EQ. 2 .OR. Geom .EQ. 3 .OR. Geom .EQ. 6) THEN 2 ...

3 ELSE

4 !$OMP PARALLEL DO &

5 !$OMP & IF( use multiple cores ) & 6 !$OMP & PRIVATE (J,I,L,LL ,K,IK ,H0) & 7 !$OMP & SHARED (NN ,SS ,NX ,NYs , P line ,Wside ,AK) 8 DO J=1,NN ,SS 9 DO I=1,NX ,SS 10 H0 =0.0 11 DO L=−NX+J−1,1−SS ,SS 12 LL=abs((J−L)/SS) 13 DO K=1,NX ,SS

15 H0=H0+AK(IK ,LL) ( P line (K)) 16 END DO 17 END DO 18 DO L=NYs+SS ,NX+J,SS 19 LL=abs((L−J)/SS) 20 DO K=1,NX ,SS

22 H0=H0+AK(IK ,LL) ( P line (K)) 23 END DO 24 END DO 25 Wside (i,j)=H0 26 END DO 27 END DO

28 ! $OMP END PARALLEL DO 29 ...

30 END IF

(24)

OpenMP’s syntax has not been formally described previously since it was deemed unnecessary for the reader to have a comprehensive overview of it. Instead, OpenMP constructs are described whenever they are encountered in this section. The first thing worth noting here is that OpenMP directives are marked by “!$OMP” at the beginning of the line. This symbol is what the compiler looks for if it implements OpenMP, otherwise, they are treated as comments and ignored. Next, the directive “PARALLEL DO” is called a combined parallel work-sharing construct and both starts a parallel region and defines how the work should be shared among threads. The IF clause controls whether the parallel section is enabled or disabled. The latter means no threads are created to execute the section and, therefore, the section is executed sequentially by the original master thread. The “Private” and “Shared” clauses define which of the variables in the following parallel section are private to each thread or shared among threads. As shown, the loop indices are private while the arrays used are shared in this case. OpenMP guarantees that the ranges of indices, of the topmost loop, assigned to the threads are not overlapping. This way each thread has exclusive access to different elements of the shared arrays since they are accessed by the loop indices. Finally, the end of the parallel region is marked by an “END” clause which marks the point at which the threads synchronize and join again [2].

The VI subroutine contains more nested loops but all of them have the same general structure. The parallelization illustrated in Listing 3 is applied to every other one of those loops which creates the first parallel version of the code. For the reader’s reference, this also represents the partially parallelized version of the code mentioned in Section 3.2.1 above and used with the data-collection mode.

The analysis following this change is summarized in Table 4. As this is the first analysis that uses multiple cores, it should be mentioned that the maximum number of threads that can be used during the execution is kept to the default value defined by the OpenMP implementation which in this case is set to the total number of physical cores in the system. As shown, this simple change cut around 750 seconds from the execution time. Now VI runs concurrently and takes 136.9 seconds to complete which is equal to the amount of time the program uses 6 cores. The other serial hotspots still have similar execution times to those shown initially, however, TEMP CALC METAL represents around 40% of the execution time now compared to 23% of it initially. Therefore, the next logical step is to parallelize this subroutine.

Serial hotspots Wall time CPU histogram TEMP CALC METAL 410.993s

libm powf l9 104.197s TEMP CALC IJ 45.327s LUBRICATION TEMP 43.150s MET TEMP UPD 37.295s Total wall time 1038.618s

Tabell 4: Summary of the analysis of the code with parallel VI.

4.3 Parallelizing TEMP CALC METAL

Similar to VI above, most of the time spent in TEMP CALC METAL is within nested loops. The structure of the big nested loop which represents the majority of the subroutine is illustrated in Figure 9 below. The exact Do-statements are put into the figure to show the weight of the different parts of the loop. It should be noted that the variables N N n and f ini are proportional to the size of the problem defined by the resolution of the contact area discretization. n met is the number of nodes within the metal of one of the surfaces and this is usually hard-coded to be 39. SS represents the step size and it is used to solve using the multigrid numerical method which is used during the initial steps of the simulation.

The l, Jm, and K-index loops are all easily parallelizable since the iterations can be executed out of order and lead to the same results. Meaning every iteration is independent from the others and uses data calculated before the beginning of the loop or within said iteration. Using the loop work-sharing construct can yield different results based on the level selected for the parallelization. The two options, in this case, are either parallelizing the l-index loop or parallelizing both the Jm

(25)

Figur 9: The nested loop in TEMP CALC METAL.

and K-index loops. The former would lead to splitting the block of code into a maximum of two threads only since the l-index loop contains two iterations and since the loop work-sharing construct distributes whole loop iterations to threads. The latter can produce work for more threads since the resolution of the discretization grid was big enough to require more than 10 iterations in every test case examined.

A different approach which uses nested parallelism is also possible and done by parallelizing both the l-index loop and the Jm and K-index loops as well. This way two threads are created to execute the l-index loop and each of them then creates more threads to execute their share of the Jm and K-index loops. This can be inefficient due to the repeated starting and stopping of parallel regions which are accompanied by the execution overhead introduced when forking and joining threads. This was quickly implemented and the resulting execution time was around 20% more than the one obtained through the parallelization method described below.

The code was parallelized as illustrated in Figure 10 below. The parameters of the “Private” and “Shared” clauses were removed for clarity. This introduces two clauses not used before, the “Firstprivate” clause and the “Reduction” clause. The first is used to initialize private data since according to the OpenMP standard private variables are not defined upon entry to a parallel region [2]. This was necessary since the original value of K oil, determined outside the loop, was used within the loop in some cases and rewritten in others. The “Reduction” clause is used with commutative and associative mathematical operations which occur recurrently within a loop [2]. In this case the sequential code guaranteed that dt lim was minimized by checking if a new minimum was calculated at the end of every iteration. This is done in the parallel version using the OpenMP defined “min” reduction operation which compares the final private values of dt lim from every thread and obtains the minimum. The performance of the code after this parallelization step is shown in Table 5 below.

Serial hotspots Wall time CPU histogram libm powf l9 98.407s

TEMP CALC IJ 41.931s LUBRICATION TEMP 39.756s MET TEMP UPD 36.502s EPSILON DERIVATIVE 33.298s Total wall time 712.782s

Tabell 5: Summary of the analysis of the code with parallel TEMP CALC METAL.

This step reduced the execution time by about 300 seconds. Now the program runs sequentially for around 60% of the execution time down from around 80% at the previous parallelization step.

(26)

Figur 10: The nested loop in TEMP CALC METAL post parallelization.

The costliest serial subroutine now is the floating point power function from the standard math library which cannot be parallelized normally.

4.4 Addressing

libm powf l9

Since this subroutine can not be threaded using OpenMP directives, the approach taken was to run the segments of the code, that use this function, in parallel. This means independent power calculations which would normally occur sequentially, one after the other, would be done concurrently on multiple cores. So instead of speeding up the subroutine, multiple instances of it are run in parallel to reduce its weight on the execution time of the program. To achieve this, the parts of the code that use libm powf l9 had to be identified. This was done with the assistance of VTune Amplifier since it provides a detailed caller/callee report as part of the code analysis. Following this, the source files of the caller subroutines were examined to identify possible parallelization approaches.

The results of the caller/callee report that concern libm powf l9 are summarized in Table 6 below. The caller/callee relationship is shown as a tree graph where the child nodes are the callers and the parent nodes are the callees of a given node in the tree. The root node is libm powf l9 and the tree is expanded until common callers are reached on every branch in the tree. The branches that are not followed represent an insignificant fraction of the subroutine’s execution time. The interesting thing to note here is that all significant branches lead to LUBRICATION TEMP eventually. For that reason LUBRICATION TEMP’s code had to be examined more closely.

The interesting part of LUBRICATION TEMP is the location of the calls to the subroutines that contain (or in turn call subroutines that contain) floating point power calculations. These are: TEMP CALC IJ; EDA CALC; and NEWTONIAN. This is shown in Figure 11 which highlights that TEMP CALC IJ is called from within a nested do-loop while the other two subroutines are called once, outside the loops. This means that intuitively, the nested loop should be parallelized in order to do some of the floating point power calculations concurrently. This should cut back around a third of the total execution time of libm powf l9. For the other two

(27)

Callee

Caller Percentage Wall time libm powf l9 EPSILON DERIVATIVE TEMP CALC IJ LUBRICATION TEMP POISEUILLE INCREMENT EDA CALC LUBRICATION TEMP NEWTONIAN LUBRICATION TEMP HREE 100.0% 37% 36.5% 36.5% 0.5% 31.8% 31.8% 31.2% 30.8% 0.4% 98.407s 36.410s 35.948s 35.948s 0.462s 31.309s 31.309s 30.688s 30.337s 0.351s Tabell 6: Caller/callee report for libm powf l9.

thirds EDA CALC and NEWTONIAN must be examined and parallelized accordingly.

Figur 11: The calls to the demanding subroutines in LUBRICATION TEMP.

The outer iterations in this nested loop were independent from one-another, similar to every nested loop described so far. Therefore, parallelizing it was straightforward. What sets this loop apart, however, is the potentially varying workload of the iterations of the inner loop. This is due to the if-statement which can bypass most of the body of the loop if its predicate evaluates to true. OpenMP provides some tools that can be used in this case to ensure a better balance to the workload between threads. These are the schedule types which are set using the “Schedule” clause. OpenMP provides three different types of schedules called Static, Dynamic and Guided. As its name implies, the first schedule splits the loop into as equal chunks of iterations as possible and assigns each chunk to a thread. The other two schedule types assign smaller chunks of iterations to each thread initially and assign more of what remains to the threads that finish executing their initial

(28)

shares in a first come first serve basis. The last two can reduce work imbalance but they come with more execution overhead due to assigning work at run-time. This overhead is worth it in some cases since severe imbalance between threads can lead to accidental serialization of parallel regions which in turn leads execution times higher than the pre-parallel code due to OpenMP’s thread creation overhead [2]. Using a non-static schedule was deemed necessary in this case and the OpenMP directives used are shown in Figure 12. The second parameter in the “Schedule” clause is the size of the initial chunks assigned to each thread which in this case was set to a value proportional to the number of threads used which is stored in cores. The last interesting thing to point out here is that this parallelization required three reduction operations: a summation, maximization and minimization. As shown, multiple reductions are expressed using multiple “Reduction” clauses.

Figur 12: The parallelization of the nested loop that calls TEMP CALC IJ.

Moving on to NEWTONIAN, a quick inspection of its code shows that it is segmented, by if-statements, into similar parts consisting of one big nested do-loop each. The purpose of this arrangement is to calculate certain simulation parameters using different physical models described in literature. The if-statements check the value of lub param which is an input parameter provided to the program by the user through the input file described earlier. Therefore, only one part of the subroutine is used during a given simulation run. The approach to parallelize this subroutine was by parallelizing every nested loop in every segment of the code. This was straightforward since once again the iterations of the outer loops were not dependent on each other. One of the shorter segments is illustrated in Figure 13 below along with the OpenMP directives used with it. The two shown statements, inside the loop, are the ones that use the power operator “**”.

Now EDA CALC remains. This one is a short subroutine with its main body being a nested do-loop. Again parallelizing the subroutine required parallelizing the nested-loop. This loop did not require using any new clauses not encountered before. Also, nothing in the body of the subroutine is interesting to note. Therefore, this is not documented further.

The performance after parallelizing these subroutines is shown in Table 7 below. This paralle-lization step lead to a significant boost in performance because it does not only reduce the time spent using libm powf l9, but it reduces the execution times of the three subroutines paralleli-zed as well. The time spent executing the program serially is now around 30% of the total execution time down from around 60% in the previous step. What is left is to finalize the parallelization by addressing the parts of the program that appear on the top five list now. Note that ReadFile and for get s are not parts of the EHL code and are implementations of other Fortran intrinsics.

(29)

Figur 13: One of the segments of NEWTONIAN parallelized.

Lastly, you can note that the usage of simultaneous cores between 1 and 6 is more significant now and that is due to some imbalance in the workload between the threads.

Serial hotspots Wall time CPU histogram MET TEMP UPD 37.799s

LUBRICATION TEMP 20.032s CP CALC 15.019s ReadFile 2.337s for get s 1.830s Total wall time 521.778s

Tabell 7: Summary of the analysis of the code with concurrent usage of libm powf l9.

4.5 Finalizing the parallelization

This section briefly documents how the remaining hotspots were addressed. The first subroutine to be tackled here is MET TEMP UPD. Looking at its source, the subroutine is two alternatives of simple nested-loops that are selected by an if-statement based on the number of the time step being solved currently. Both loops contain no data dependencies between the outer iterations and their parallelization required no new OpenMP clauses. The second subroutine, CP CALC, consists of a single nested do-loop that executes some calculations. Again parallelizing this only required the “Parallel Do” work-sharing construct without any auxiliary clauses.

The last hotspot to tackle on the list is LUBRICATION TEMP again. Or at least the remaining serial parts of it. This was not done in the previous parallelization step because the rest of the subroutine has nothing to do with libm powf l9. What is not shown in Figure 11 are three other nested do-loops in the subroutine. Those loops only compute data without calling any external functions. Also, and like every loop parallelized so far, the outer iterations are independent in all three loops. Parallelizing one of the loops required using a couple of reduction clauses but both have been encountered before and therefore a new figure will not be introduced to illustrate this.