An evaluation of the system performance of a beowulf cluster

(1)

An evaluation of the system performance of a beowulf cluster

http://www.nsc.liu.se/grendel/

by

Karl-Johan Andersson, Daniel Aronsson and Patrick Karlsson

Internal Report No. 2001:4

(2)

Abstract

To make accurate predictions of which type of computational problems that are suitable to solve on a beowulf-cluster, one needs to measure the performance of the system. A beowulf- cluster can be a very good alternative too a specialized supercomputer, especially when it comes to the price performance ratio. A beowulf-cluster is a number of commodity off-the- shelf PC’s connected through some sort of network. All basic software needed to configure a cluster, and for compiling and running parallel applications on it, can be found free of charge on the Internet.

We have configured and tested a 16-node beowulf-cluster for the Department of Scientific Computing at Uppsala University, and the National Supercomputer Centre at Link¨oping, both in Sweden. We have tested both hardware and software for comparative purposes with other systems, and also for identifying characteristics of the system. Our results show that the system is well adopted for the type of computational jobs that the Department of Scientific Computing intend to run on the system. The tests also showed us that if the system is upgraded in the future, the course of action should be in the following order: Larger main memories on the nodes, faster interconnect and finally if the computational power is too low, faster processors.

Sammanfattning

För att kunna förutsäga vilka typer av beräkningsproblem som framg˚angsrikt kan lösas p˚a ett beowulf-kluster m˚aste man mäta systemets prestanda. Ett beowulf-kluster kan vara ett mycket lämpligt alternativ till en superdator, särskilt om man begrundar förh˚allandet mellan prestanda och pris. Ett beowulf-kluster är ett antal vanliga persondatorer sammnkopplade i ett nätverk. All nödvändig mjukvara som krävs för att konfigurera klustret samt att kompilera och kör parallella applikationer, finns fritt tillgänglig p˚a internet.

Vi har konfigurerat och testat ett beowulf-kluster best˚aende av sexton processorer ˚at Avdelningen för teknisk databehandling vid Uppsala universitet och Nationellt Superda- torcenter vid Linköpings universitet. Vi har testat b˚ade mjukvara och h˚ardvara i syfte att fastställa systemets egenskaper samt att möjliggöra jämförelser med andra system. Resultat- en visar att systemet är väl lämpat för att lösa de typer av problem som typiskt förekommer vid Avdelningen för teknisk databehandling. Testerna visar även att eventuella framtida upp- graderingar av systemet bör ske i följande ordning: större minneskapacitet p˚a noderna, snabbare nätverk och slutligen, om större beräkningskraft behövs, snabbare processorer.

(3)

1 Introduction

1.1 What is a beowulf cluster?

Generally, a beowulf cluster is a set of regular PC workstations commonly interconnected through an ethernet. It operates as a parallel computer but differs from other parallel computers in the sense that it consists of mass-produced off-the-shelf hardware. Usually, a parallell computer is built of highly specialized hardware and the architecture is choosen depending on the needs. This makes it optimal for solving certain problems. However, it also makes it very expensive and since it often is more or less custom built, technical support is exclusive. By constructing a beowulf cluster these issues are solved. The penalty of going with a beowulf cluster is in reduced communication capacity between the processors, since an ethernet is much slower than a custom-built interconnect hardwired to a motherboard.

Recent years have shown an immense increase in the use of beowulf clusters. This is due to mainly two reasons; Firstly, the magnitude of the PC market has allowed PC prices to decrease while sustaining dramatic performance increase. Secondly, the linux community has produced a vast asset of free software for these kinds of applications. Beowulf clusters emphasize [RBM97]

• no custom components

• dedicated processors

• a private system area network

• a freely available software base.

The name ’Beowulf’ originates from Englands oldest known epic, dating back to about 1000 A.D. It tells the story of hero warrior Beowulf and his battle with the monster Grendel. Grendel is also the name of the beowulf cluster at the Department of Scientific Computing, and Beowulf was the name of the very first beowulf cluster at NASA. In this article, we will present some measurements performed on Grendel, along with an implementation of a typical application at the Department of Scientific Computing.

1.2 A brief history

The history of Beowulf cluster computers began in 1994, when Thomas Sterling and Donald Becker at The Center of Excellence in Space Data and Information Sciences (CESDIS) were missioned to investigate whether clustered PCs could perform heavy computational tasks at a greater capability than contemporary workstations, but at the same cost. CESDIS, which is sponsored by the NASA HPCC Earth and Space Sciences project, is often faced with tasks involving large data sets. The first PC cluster, named Beowulf, was built to address problems associated with these large data sets. It consisted of 16 DX4 processors connected by a 10Mbps ethernet. Since the communication performance was too low to match the computational performance, Becker rewrote the ethernet drivers and built a “channel bonded” ethernet where the network traffic was striped across two or more ethernets [BEO].

Beowulf was an instant success. The idea of using cheap and easy-to-get equipment quickly spread into the academic and research communities. In October of 1996, a beowulf cluster exceeded one gigaflops sustained performance on a space science application for a total system cost of under

$50000 [RBM97]. The cluster that this article concerns has a peak performance of just over 11 gigaflops for a cost of about $15000.

1.3 System specifications

Grendel is a beowulf cluster, build from 17 separate standard PC-computers. Every computer consists of commodity off-the-shelf products. These computers are connected together with a fast ethernet network. The head of the cluster, the front-end is a separate computer that is connected

(4)

to the cluster and to the Internet. All jobs are submitted through this computer which takes care of scheduling and monitoring. This computer also hosts a shared file area used by the other PCs, so called nodes.

The other 16 nodes all have exactly the same configuration, both hardware and software. The nodes have their own hard drives. Each node runs its own operating system, and accesses a common file area through the front-end. The only difference in hardware between the front-end and the nodes is that the front-end has a second network interface card (NIC) and a slightly larger hard drive (60 GB).

All of the installed software is free and public except for the compilers (fortran and C/C++).

The operating system used for all computers is RedHat Linux.

The individual computers were assembled by Advanced Computer Technology AB in Link¨oping.

The cluster was then put together 26-27 march 2001 at the Department of Scientific Computing, Uppsala University by system technicians from National Supercomputer Center, Link¨oping Uni- versity with assistance from us.

1.3.1 Data

General data

Hostname : grendel.it.uu.se IP-address : 130.238.17.47

Layout : one front-end, 16 nodes

OS : RedHat Linux 6.2

Kernel version : Linux 2.2.18

Network : Fast ethernet (100 Mbps)

Topology : Switched ethernet, twisted-pair cables Node data

Case : Enlight 7230

CPU : Athlon (Thunderbird 1 GHz)(133 MHz FSB)

CPU family : i686

MHz : 1007

L1 cache size : 64 KB (code)/64 KB (data) L2 cache size : 256 KB

Motherboard : ASUS A7V133

Main memory : 256 MB of PC133 SDRAM

Secondary memory : 10 GB ATA (Fujitsu MPG3102AT) Swap-memory : 517 MB

NIC-driver : eepro100.c v1.09j-t rev1.20.2.10

OS : RedHat Linux 6.2

Kernel version : Linux 2.2.18 Filesystem : ext2, NFS

CPU data[AMD00]

CPU : Athlon (Thunderbird 1GHz)(133 MHz FSB) CPU family : i686

MHz : 1007

L1 code cache : 64 KB two-way set-associative L1 data cache : 64 KB two-way set-associative

L2 cache : 16-way set-associative (on-die, full-speed) TLB : 512 entries (Multi-level split)

(5)

Network

Type : Fast ethernet (100 Mbps) Topology : Switched ethernet, single switch Interconnect : Twisted-pair (RJ45)

Network switch : HP ProCurve 2424M

NIC : Intel PCI EtherExpress Pro/100+ i82557 NIC-driver : eepro100.c v1.09j-t rev1.20.2.10

Local IP-subnet : 192.168.1.0/255.255.255.0

2 Installed software

2.1 Operating system

The operating system installed is Linux. The distribution used is RedHat 6.2 with additional updates. The currently running Linux kernel is version 2.2.18. The kernel is recompiled to match our needs. There are also some extra kernel modules compiled for hardware monitoring.

2.2 Programming environment

The programming languages intended for use are Fortran and C.

As parallel processing has matured, two programming paradigms have been developed, shared memory and message passing. These paradigms have their origin in different hardware architectures. Shared memory-programming is mainly intended for high-end-computers which actually have a shared memory. Each processor has access to all memory, or some partition thereof. Mes- sage passing on the other hand originates from distributed memory machines. Here every processor has its own memory-area and communicates with the others by sending messages.

Not surprisingly message passing has become the most popular technique for implementing parallel applications on beowulf clusters. Currently there are no shared memory-libraries installed on Grendel. The most well known APIs that use the message passing paradigm in parallel computations are MPI and PVM. Both of these are installed on Grendel by NSC but PVM is not tested since it is not used at the Department of Scientific Computing.

There are two different sets of compilers installed; EGCS 2.91.66 that is installed with the RedHat Linux-distribution, and the Portland Group Workstation compilers (PG 3.2-3).

EGCS

C : gcc, cc /usr/bin/gcc

C++ : g++ /usr/bin/g++

Fortran-77 : f77 /usr/bin/f77 Portland Group

C : pgcc /usr/local/pgi/linux86/bin/pgcc C++ : pgCC /usr/local/pgi/linux86/bin/pgCC Fortran-77 : pgf77 /usr/local/pgi/linux86/bin/pgf77 Fortran-90 : pgf90 /usr/local/pgi/linux86/bin/pgf90 HPF : pghpf /usr/local/pgi/linux86/bin/pghpf

The recommended compilers to use for high-performance-applications are the Portland Group compilers. These are commercial products intended for high-performance computing.

2.3 MPI libraries

The Message Passing Interface (MPI) is a standard for writing applications using the message- passing paradigm. This standard is supervised by the MPI Forum, which is comprised of high performance computing professionals from over 40 organizations [MPI95]. The goal of MPI Forum is to form a standard for message-passing applications which meets the needs of the majority of

(6)

users. MPI provides a framework for vendors to implement efficient implementations. This ensures that program written for MPI compiles for all implementations, but efficiency may differ.

The two by far most popular choices for clusters are MPICH and LAM/MPI. Both these are free implementations of the MPI standard. Both cover MPI version 1.1 completely (MPICH fully implements 1.2) and part of version 2.0.

The MPI Chameleon (MPICH) was developed at Argonne National Laboratory as a research project to provide features that would make MPI implementation simple on different architectures.

To do this MPICH implements a middle-layer called Abstract Device Interface (ADI). This has a smaller interface making it easier to implement on different hardware, but it also may decrease efficiency.

Local Area Multicomputer (LAM) was originally developed at the Ohio Super Computing Facility but is now maintained by the Laboratory of Scientific Computing at Notre Dame. LAM is built to be more ”cluster friendly” by using small daemons to effect achieve fast process control.

MPICH on the other hand uses system daemons to control processes.

As we will see (see 4.2.2) LAM seems to be the best choice for writing parallel applications for this cluster. All further tests and benchmarks in this report use LAM for message-passing.

Both these implementations use TCP/IP as the underlying protocol and are thus limited by this.

Both these implementations are installed on Grendel and are avaliable for use.

LAM and MPICH use two different strategies for running parallel programs. Both of them are run through the command mpirun which spawns the processes to the different processors. LAM uses a user level daemon that controls communication between different processors. For this to work one first needs to start up this program through the command lamboot on each node. In our case there is also a scheduler that schedules the right number of processors to the application.

The daemon runs in user-mode so there is a need to run this program for each user that submits jobs to the cluster. LAM also implements the MPI 2.0 MPI Spawn call that allows tasks to be spawned from within the application, as opposed to running a program like mpirun.

NSC in Link¨oping has created a version of mpirun which first runs lamboot and then calls the real mpirun. After finishing it calls lamhalt on each node to clean up. So there are never two instances of the daemon running on any node at the same time, assuring that an application have exclusive use of the requested processors.

MPICH on the other hand attempts to start remote processes by connecting to a default system level daemon, or by using remote shell. This means that you don’t have to run a specific daemon for each user.

Both LAM and MPICH use TCP/IP as the underlying communications layer. LAM communicates using mainly UDP-packets. It can be configured to use TCP-connections instead (using the c2c-option for lamboot). All necessary connections are established between the daemons at startup by lamboot and are closed when calling lamhalt.

We have not done any test on TCP versus UDP-traffic but according to [CLMR00] UDP is superior to TCP in the case of MPI. We have done some raw benchmark tests for TCP (see 3.2.2 and 3.4.1) but no comparing tests for TCP versus UDP using MPI. Since this network is a dedicated local area network, in fact every node is connected to a single full-duplex switch, there is no need for flow control and congestion control supported by TCP. TCP has some drawbacks for this type of traffic. This includes the slow start feature that checks the congestion level in the communication network. According to [CLMR00] this feature slows down performance considerably. By modifying the TCP-options in the Linux kernel this can be turned off.

Applications pass messages through standard UNIX stream sockets to the LAM-daemon, which then communicates with the other daemons on the other nodes (using UDP, see above). The daemon on the remote machine then passes the message to the application through a UNIX stream socket.

(7)

2.3.1 Latencies in MPI

We want to establish a simple model for latency for the different steps in the communication procedure using MPI. How much overhead does MPI introduce and which message size is optimal?

To find out this we use a simple ping-pong program that sends a message between two nodes and measures network traffic. See appendixD for source code.

The program sends messages of different sizes and measures the average time to send and receive the message, number of bytes and packets sent and received. The information of the number of bytes and packets was gathered from the Linux kernel and shows the actual number of bytes sent and received from a specific network interface card. This includs IP and UDP headers and additional MPI overhead. The number of packets is the total number of ethernet frames sent and received. So if a single packet is fragmented into several frames it is the resulting frames that are counted, not the original number of packets.

0 500 1000 1500 2000 2500 3000 3500 4000

0 500 1000 1500

Message size in bytes

Bytes per packet

Figure 1: Bytes per packet for different message sizes

Figure1shows the number of bytes per packet for different message sizes in MPI. We can clearly see the MTU limit (Maximum Transfer Unit ) at 1500 bytes introduced by IP. When packets reach the 1500-bytes limit they are split up into two packets. We can see in the graph that when we send a MPI-message of 1600 bytes it results in two IP-packets of 850 bytes each, a total overhead of 100 bytes (including UPD- and IP-headers). Below 1500 bytes there is an overhead of 89 bytes.

This shows that the MPI daemon is unaware of the underlying network limits.

However it seems that this is not a substantial drawback for the communication performance.

Figure2show the number of packets transmitted per second over the same interval. The lower the packet size the more packets can be transmitted per second. Thus the total bandwidth increases with larger messages. The effect of packet limitations seems not to be that limiting on bandwidth.

Timings from this program give an RTT (Round Trip Time) of 130 µs for sending and receiving one empty packet. This results in approximately 65 µs latency for one MPI-message.

To do further timing, and to get better results, we used the benchmark program netpipe from Scalable Computing Lab [SMG96]. This program is similar to pingpong but could be set up to run on different systems, like pure TCP/IP, MPI and PVM. We tested this program on pure IP-traffic and MPI (using LAM). Netpipe was compiled using pgcc with no additional compiler flags. It was run by using the command

NPtcp -t -h node -o output.txt -P

(8)

0 500 1000 1500 2000 2500 3000 3500 4000 0

1000 2000 3000 4000 5000 6000 7000 8000

Packets per second

Figure 2: Packets per second for different message sizes

mpirun -np 2 NPmpi -o output.txt -P

Figure 3 shows a signature graph for TCP/IP and MPI-communication, i e bandwidth against message size. The theoretical maximum bandwidth for fast ethernet is 100 Mbps. As we can see, bandwidth for TCP/IP is increasing up to 90 Mbps and then stays there (at least for messages up to 20 MB). MPI follows that curve up until 80 kB when it suddenly drops down to about 80 Mbps. This is probably because LAM changes communication strategy. From this it is possible to calculate latencies for sending empty packets for both IP and MPI. We got 52 µs for IP and 62 µs for MPI traffic. The variance in these measurement was below 1% according to netpipe.

To test the congestion level in the switch we set up this test program at eight nodes at the same time, each node sending and receiving 90 Mbps. There was no noticeable decrease in network performance.

To find out which part of the time is due to actual communications and which is due to TCP/IP-implementation we ran the same program locally on one computer. You can force LAM to run two processes on the same processor by giving the -np 2 option to mpirun but only make one node avaliable through lamboot. To time local TCP/IP traffic you just run both the server and client on the same computer.

Figure 4 shows a signature graph for this local configuration. The bandwidth itself is not important since this configuration is only intended for testing purposes (you will never run a parallel MPI-program on one processor anyway). One can notice that MPI now is far behind TCP/IP, about 1 Gbps. The latency for local TCP/IP transmission is now 9 µs and for MPI 17 µs. The MPI overhead is still about 10 µs.

Note that we assume that the only difference between local traffic and remote traffic is overhead because of transport (network drivers, network interface cards, switch, transport media). Since we are running the tests on a single computer there is also overhead introduced by context switching and hardware interrupts that is not present in a two-way communication. These are hard to predict and are therefore neglected.

This model also assumes small message sizes. The timings are extrapolated values for zero-sized messages from many different message sizes.

(9)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 0

10 20 30 40 50 60 70 80 90 100

Throughput in Mbps

TCP/IP MPI

Figure 3: Network bandwidth for different message sizes

MPI latencies (approx) Application

MPI : ∼ 10 µs

IP-stack : ∼ 10 µs Transport layer : ∼ 40 µs

These timings could be compared to the Beowulf-cluster Ingvar at NSC, Link¨oping University [St1]. Ingvar has a high speed SCI-network in a ring topology and special designed MPI-libraries (SCAMPI) for maximum performance. The communication is not based on TCP/IP. From the application down through MPI takes 5.0 µs. For the SCI-net to transport the message takes 0.1 µs. A total time of 5.1 µs compares to our∼ 60 µs.

2.4 Linear algebra packages

LAPACK (Linear Algebra PACKage) is a free subroutine library for applying well-known linear algebra operations on matrices and vectors. Operations include multiplication, factorization, inversion, solution of simultaneous linear equations and finding eigenvalues and eigenvectors.

Almost all computations inside LAPACK are performed by calls to BLAS (Basic Linear Algebra Subprograms). BLAS contains simple linear algebra routines that is the core of LAPACK. An optimized BLAS results is an optimized LAPACK.

Supercomputer vendors often has a high performance version of these packages for use with their architecture. Since beowulf is not a single architecture it is impossible to write a single linear algebra package that is optimal on all beowulf clusters.

To attain an optimized BLAS a program called ATLAS (Automatically Tuned Linear Algebra Software) has been developed. There are a number of architectural details to consider, including

• type of cache (n-way set-associative, direct mapped, . . . )

• number of cache levels

• number and type of registers

• type of pipelining (combined multiply/add or not)

(10)

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶ 10⁷ 10⁸ 0

500 1000 1500 2000 2500 3000

Throughput in Mbps

TCP/IP MPI

Figure 4: Local bandwidth for different message sizes

ATLAS tries to compile an optimized BLAS-package by fine-tuning a number of parameters to match the hardware. Performance is measured in Mflops (FLoating point OPerations per Second).

ATLAS and LAPACK are installed for use on Grendel. We have not used these packages in any of our tests.

ATLAS version 3.2.1 was downloaded in source code¹and unpacked to /disk/global/src/ATLAS.

Three bugs were corrected according to latest ATLAS errata: ”Floating point errors in output matrices propagate inappropriately”, ”Performance bug in complex TRSM” and ”Error in architectural defaults for ATHLON”.

ATHLON processors have an extension of the standard Pentium instruction set called 3DNow.

These extra instructions allows two single precision (32 bit) floating point operations to be executed simultaneously. ATLAS has the option to use the 3DNow instruction set. The speedup is significant; The acceleration is about a factor of two, but only in the case of matrix-multiply [SAR01]. The drawback is that 3DNow does not use complete IEEE compliant arithmetics. It does not handle NaNs or Infinities at all. Therefore 3DNow is not included.

3 Hardware Benchmarks

3.1 Introduction

The performance of a single workstation depends on how well the hardware works and how well we can utilize the hardware through software. We examine the hardware performance by executing several test programs, so called benchmarks. The results of these tests are shown below.

3.2 LMbench 2.0 Benchmark [MS96]

LMbench is a set of small benchmarks designed to measure performance of several components and parts crucial for efficient system performance. The intent is to produce real application figures achievable by normal applications, instead of marketing performance figures. Latency, bandwidth or a combination of the two are the main performance bottlenecks of current systems, and thus LMbench focuses on measuring a system’s ability to transfer data between processor,

1http://www.netlib.org/atlas/atlas3.2.1.tgz

(11)

cache, memory, network and disk. It does not measure graphics throughput, computational speed or any multiprocessor features.

3.2.1 Implementation

LMbench is highly portable and should run as is with gcc as default compiler. For the cluster that would be the GNU project C compiler (egcs-1.1.2). The basic system parameters are described below.

Basic system parameters

Host : grendel.it.uu.se

CPU : Athlon (Thunderbird)(133 MHz FSB) (×17)

CPU family : i686

MHz : 1007

Motherboard : ASUS A7V133

Main memory : 256 MB of PC133 SDRAM Secondary memory : 10 GB ATA

OS kernel : Linux 2.2.18

Network : Intel 100/PRO+ NIC Network switch : HP ProCurve 2424M The benchmark tests six different aspects of the system:

• Processor and processes

• Context switching

• Communication latencies

• File and virtual memory system latencies

• Communication bandwidths

• Memory latencies 3.2.2 Results

The results are an average of ten independent runs of LMbench 2.0 to ensure accuracy. We also include an error estimate in the result, based on one standard deviation.

Processor, processes (µs) - smaller is better

null call : 0.27 ± 0.000

null I/O : 0.38 ± 0.035

stat : 3.72 ± 0.167

open/close : 4.63 ± 0.149

select : 26.3 ± 10.56

signal install : 0.77 ± 0.003 signal catch : 0.95 ± 0.000

fork proc : 110.1 ± 2.47

exec proc : 706.2 ± 25.93

shell proc : 3605.3 ± 35.99

Null system call The time it takes to do getppid. This is useful as a lower bound cost on anything that has to interact with the operating system.

null I/O The time it takes to write one byte to /dev/null.

(12)

stat Measures how long it takes to stat a file (i e examine a files characteristics.).

open/close The time it takes to first open a file and then close it.

Simple entry into the operating system The time it takes to run select on a number of file descriptors.

Signal handling latencies The time it takes to install or catch signals.

Creates a process through fork+exit The purpose of the three last benchmarks is to time the creation of a basic thread of control. It measures the time it takes to split a process into two copies, but it is not very useful since the processes perform the same thing.

Creates a process through fork+execve The time it takes to create a new process and have that process perform a new task.

Creates a process through fork+/bin/sh -c The time it takes to create a new process and have the new process running a program by asking the shell to find that program and run it.

Context switching (µs) - smaller is better

2p/0K : 0.870 ± 0.1803

2p/16K : 1.6200 ± 0.18880

2p/64K : 15.8 ± 0.42

8p/16K : 5.4410 ± 0.37353

8p/64K : 117.7 ± 0.48

16p/16K : 15.4 ± 1.35

16p/64K : 117.7 ± 0.48

Context switching The time it takes for n processes of size s (i.e. np/sK) to switch context.

The processes are connected in a ring of UNIX pipes.

Local communication latencies (µs) - smaller is better

pipe : 4.021 ± 0.3027

AF UNIX : 8.34 ± 0.833

UDP : 11.5 ± 0.53

RPC/UDP : 26.4 ± 0.84

TCP : 16.4 ± 1.35

RPC/TCP : 39.1 ± 0.74

Interprocess communication latency through pipes Measures the interprocess communication latencies between two processes communicating through a UNIX pipe. The context switching overhead is included and the result is per round trip.

Interprocess communication latency through UNIX sockets Measures the time it takes to send a token back and forth between two processes using a UNIX socket.

Interprocess communication latency via UDP/IP The benchmark measures the time it takes to pass a token back and forth between a client/server. No work is done in the processes.

Interprocess communication latency through SUN RPC via UDP The time it takes to perform the benchmark above using SUN RPC instead of standard UDP sockets.

Interprocess communication latency via TCP/IP The benchmark measures the time it takes to pass a token back and forth between a client/server. No work is done in the processes.

Interprocess communication latency through SUN RPC via TCP The time it takes to perform the benchmark above using SUN RPC instead of standard UDP sockets.

(13)

File & VM system latencies (µs) - smaller is better create 0K file : 4.0323 ± 0.33987 delete 0K file : 0.8759 ± 0.03664 create 10K file : 9.6717 ± 0.13760 delete 10K file : 1.7550 ± 0.04307 Mmap latency : 9813.1 ± 95.32

Prot fault : 0.578 ± 0.0039

Page fault : 361.6 ± 13.2514

File system create/delete performance The time it takes to create/delete small files in the current working directory.

Memory mapping and un-memory mapping files The time it takes a mapping to be made and unmade. Useful for processes using shared libraries, where the libraries are mapped at start up time and unmapped at process exit.

Signal handling latency The time it takes to handle a memory protection fault.

Pagefaulting pages from a file Measures the time it takes a page from a file to be faulted in.

The file is first flushed from memory and then accessed.

Local communication bandwidths (M B/s) - bigger is better

pipe : 790.7 ± 27.41

AF UNIX : 516.3 ± 34.66

file reread : 332.9 ± 16.21

Mmap reread : 462.0 ± 0.00

Bcopy (libc) : 300.6 ± 113.64

Bcopy (hand) : 264.1 ± 0.74

mem read : 481.7 ± 9.62

mem write : 361.6 ± 13.2514

Data movement through pipes Creates a UNIX pipe between two processes and measures the throughput when moving 50MB through the pipe in 64KB blocks.

Data movement through UNIX stream sockets Measures the throughput when moving 10MB in 64KB blocks through a UNIX stream between two processes.

Reading and summing of a file Measures how fast data is read when reading a file in 64KB blocks. Each block is summed up as a series of 4 byte integers in an unrolled loop. The benchmark is intended to be used on a file that is in memory (i e it is a reread benchmark).

Moving a file Measures how fast it can create a memory mapping to a file and then read the mapping similarly to the above benchmark.

Memory copy speed Measures how fast it can allocate memory and then Bcopy libc.

Memory copy speed on unrolled loops The measured data transfer speed when performing an unrolled and unaligned Bcopy.

Memory read rate (with overhead) Measures data transfer speed when the program allocates a specified amount of memory and zeros it. It then times the reading of that memory as a series of integer loads and adds.

Memory write rate (with overhead) Measures the data transfer when the program allocates a specified amount of memory and zeros it. It then times the writing of that memory as a series of integer stores and increments.

(14)

Memory latencies (ns) - smaller is better L1 cache : 2.279 ± 0.0005

L2 cache : 19.0 ± 0.00

Main memory : 151.0 ± 0.00

Memory read latencies Measures the time it takes to read memory with varying memory sizes and strides. The entire memory hierarchy is measured: onboard and external caches, main memory and TLB miss latency. It does not measure the instruction cache.

1KB 4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

0 20 40 60 80 100 120 140 160 180

L1

L2 Memory latency

Latency (ns)

Array size stride 16KB

stride 32KB stride 64KB stride 128KB stride 256KB stride 512KB stride 1024KB

Figure 5: The measured memory latencies for different strides. We can clearly make out the L1 cache, but the L2 border is harder do distinguish. There seems to be a practical limit of 224MB on the L2 cache.

3.3 Stream Benchmark

The Stream benchmark is a simple benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. It is im- plemented in a very straightforward way without sophisticated optimizations. Stream thereby produces results that correspond to memory bandwidth expected from an ordinary user application.

Stream has been run several times for many different problem sizes to ensure that the throughput is not dependent on the problem size (note that Stream only handles problem sizes much larger than the cache sizes. Small data sets that fit in the caches are not considered here). However, when handling data sets whose sizes approach the total RAM size, the kernel will start to swap.

(15)

0 50 100 150 200 250 300 0

100 200 300 400 500 600 700 800

Memory bandwidth and capacity measured with Streams

Problem size (MB)

Memory bandwidth (Mb/s)

Copy Scale Add Triad

Figure 6: Main memory throughput shown by Streams.

This produces an extreme decrease in performance. As can be seen in Figure 6, this happens at a problem size of 240 MB. Larger data sets than this can not be used. The four different graphs represent different types of operations. It may not be obvious what triad does; It combines the earlier operations by both multiplying and adding vector elements, writing the result to a new vector. The output produced by Stream is presented in the following table

Streams output

Function Rate (MB/s) Copy: 580

Scale: 580

Add: 690

Triad: 690

3.4 NETPerf Network Performance Benchmark [NET96]

Netperf is a benchmark aimed at measuring various aspects in a network. Its focus is on performance using TCP or UDP using Berkley Sockets interface (BSD sockets). Based on a client/server model there are two executables, one server and one client: netserver and netperf. Netperf can be used with a wide variety of control sequences, but we use it for measuring normal bulk data transfer performance.

3.4.1 Results

The client application netperf was executed with

(16)

netperf -P 0 -l 10 -H host TCP STREAM -i 10,2 -I 95,5 -- -m size -s 65534 -S 65534 where size was unevenly sampled from the range [1, 65536] of message sizes. The result is presented in Figure7.

1 4 16 64 256 1K 4K 16K 64K

0 10 20 30 40 50 60 70 80 90 100

Packet Size

Mb/s

Figure 7: TCP Stream measurement, bandwidth(packet size) between the hosts g16 and g1.

For efficiency reasons the only suitable conclusion would be to never send less than 32 bytes over the ethernet.

4 Parallel Benchmarks

4.1 Introduction [KGGK94]

The performance of a sequential program is usually measured in execution time and/or operations per second, expressed as a function of problem size. The performance of a parallel program does not only depend on the problem size but also on the architecture and the number of processors. There are mainly two reasons why one uses parallel benchmarks: To establish an upper performance limit for the parallel system and compare the limit with other systems, and to investigate the performance of parallel applications and algorithms on a specific system. For the latter we need to measure speedup and efficiency. When situated with a sequential application one is often interested in the performance gain achieved by parallelizing the algorithm over N processors. The speedup is defined as the ratio of the time it takes to solve a problem on a single processor to the time it takes to solve the problem on N processors. It is assumed that the single processor problem is

(17)

solved with the best known sequential solution with respect to time.

SN = T (1)

T (N ) (1)

The serial run time can be divided into two parts: the serial run time Tsand the parallel run time Tp. The parallel run time is subject to parallelization, hence equation1can be written as

SN = Ts+ Tp

Ts+^T_N^p . (2)

Equation2 is known as Amdahl’s law and is usually expressed with an inequality as the upper bound on achievable speedup. It can be very suitable to apply Amdahl’s law on serial algorithms for determining the algorithm’s parallelism. The next measurement is the efficiency, defined as the fraction of time for which a processor is usefully employed.

EN =SN

N (3)

On an ideal parallel system the speedup is equal to N and the efficiency is equal to one. A benchmark that tests portable parallel implementations instead of determining the upper performance limit is the NAS Parallel Benchmark.

4.2 NAS Parallel Benchmark 2.3 (NPB) [BHS

⁺

95, BBB

⁺

94]

The Numerical Aerodynamic Simulation (NAS) program at NASA Ames Research Center provide a set of benchmarks derived from computational fluid dynamics (CFD) codes, which have a

”. . . wide acceptance as a standard indicator of supercomputer performance. ”. NPB 2.3 is a set of eight benchmarks based on Fortran 77 (with a few common extension that are also a part of Fortran 90) and the MPI message passing standard. They are intended to run with little or no tuning, approximating the performance of a portable parallel program on a distributed memory computer. The benchmarks are not intended to test only MPI, but to measure the overall system performance.

4.2.1 Implementation

From the original set of eight benchmarks we have selected six to measure the performance of the cluster. They are divided into two groups depending on their utilization of CPU, memory and network: kernel benchmarks and application benchmarks. The kernel benchmarks are intended to put pressure on the Linux kernel with its implementation of the TCP/IP stack. The application benchmarks concentrate more on CPU and memory utilization.

Multigrid (MG) MG uses a multigrid method to compute the solution of the three-dimensional scalar Poisson equation. It partitions the grid by successively dividing it in two, starting with the z dimension, then the y and x dimensions, until all processors are assigned.

Conjugate Gradient (CG) CG is used to compute an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. It represents typical unstructured grid computations with its test of irregular long distance communication using unstructured matrix vector multiplication.

3-D FFT PDE (FT) FT contains the computational kernel of a three-dimensional FFT-based spectral method. It performs 1-D FFTs in the x and y dimensions on a distributed 3-D array, which is done entirely within each processor, and then continues with an array transposition which requires an all-to-all communication. The final FFT is then performed.

LU solver (LU) LU simulates a CFD application which uses successive over-relaxation (SSOR) to solve a block lower-block upper triangular system of equations, derived from an unfactored implicit finite-difference discretization of the Navier–Stokes equations in three dimensions.

(18)

Pentadiagonal solver (SP) SP simulates a CFD application that solves uncoupled systems of equations resulting from an implicit finite-difference discretization of the Navier–Stokes equations. It solves scalar pentadiagonal systems from a full diagonilization of the above scheme.

Block tridiagonal solver (BT) BT originates from the same problem as SP, but instead of solving scalar systems, it solves block-triangular systems of 5× 5 blocks.

MG, CG, FT and LU runs on a power-of-two number of processors, whereas SP and BT require a square number of processors. FT, MG and CG are kernel benchmarks and the rest are application benchmarks. To appropriately test different sizes of supercomputers, NPB 2.3 contains four different classes of problem sizes. The classes are W(orkstation), A, B, and C.

Benchmark code Class W Class A Class B Class C

MG 64³ 256³ 256³ 512³

CG 7000 14000 75000 150000

FT 128²× 32 256²× 128 512× 256² 512³

LU 33³ 64³ 102³ 162³

SP 33³ 64³ 102³ 162³

BT 33³ 64³ 102³ 162³

We choose to measure the performance of class W and class B. To create the binaries we edited /config/make.def to look like this:

MPI = pgf90

FLINK = pgf90

FFLAGS = -fast -Nlam -Mvect=prefetch FLINKFLAGS = -Nlam

MPICC = pgcc

CLINK = pgcc

CC = cc -g

BINDIR ../bin

RAND = randdp

The compilers are Portland Group’s Fortran 90 compiler pgf90 (version 3.2-3) and C compiler pgcc (version 3.2-3). The -Nlam flag above is a shorthand for:

/usr/local/lib/nscmpi lam.o -L/usr/local/lam-6.5.1-pgi/lib -llamf77mpi\ -llammpi++ -lmpi -llam

and the fast flag is set by default to O2.

4.2.2 Results

The cluster supports two implementations of the message passing interface (MPI), LAM and MPICH. The first objective is to determine which one performs best on our cluster. To do this we run the class B size problems using both LAM and MPICH, and then calculate an average performance difference. The following two tables are a comparison between the LAM and MPICH implementation of the MPI message passing interface. The LAM implementation performs better, letting us do 4.26% more Mop/s (OPerations per Second) on average than MPICH, and hence it is used in all further testing².

2The - sign in the tables signifies problem sizes that would not execute on the cluster.

(19)

NPB 2.0 (PGI Compiler, MPICH) - Problem class B

Procs FT MG LU CG SP BT

1 - 1.31 163.16 29.51 - -

2 - 255.31 324.73 78.92

4 5.62 402.73 606.55 109.48 360.99 - 8 57.50 749.16 1148.17 218.40

9 544.77 1000.15

16 285.39 1135.15 2057.80 266.52 821.25 1580.00 NPB 2.0 (PGI Compiler, LAM) - Problem class B

Procs FT MG LU CG SP BT

1 - 1.29 163.02 29.77 - -

2 - 268.47 321.53 82.32

4 6.69 397.52 598.43 122.55 406.85 - 8 53.68 594.58 1173.31 244.20

9 651.96 1089.82

16 776.19 1238.19 2182.58 307.89 858.46 1705.98

The two tables above shows that the binaries compiled with LAM manage to provide 4.26%

more Mop/s then the ones compiled with MPICH. This difference is due to faster communication over the ethernet, thus increasing the total number of operations performed per second. Hence we use LAM in all further parallel testing.

The next area of investigation is what type of problem runs well on our cluster. We choose the W(orkstation) problem size and run the six benchmarks again (see Figure8). This provides us with a thumbprint of Grendel.

We expect that the kernel benchmarks, especially FT, will perform poorly on our architecture, because of the large communication overhead. This large communication overhead is due to the 100 Mbps ethernet network, and the TCP/IP-stack in the Linux kernel. The network would be the obvious choice of upgrade for boosting the performance of highly parallel applications. Figure 8suggests that the performance is poor for the kernel benchmarks, but let us analyze the results further. We use the speedup and efficiency measurements from section4.1. In figure9we can see the results from running the six benchmarks with the class W problem size.

4.3 Summary and conclusion of the benchmark results

The first objective of our benchmarks was to find out which implementation of MPI is most useful.

Both LAM and MPICH implement the same MPI standard. The LAM approach of a user-level daemon controlling message communication and the use of only UDP-packets results in the best performance. The larger messages we try to send from MPI the higher bandwidth we get. Peak performance is at about 80 kB data (1000 doubles) per message.

We also conclude that high network traffic between nodes doesn’t affect overall network performance.

Looking at the speedup (Figure9) we find that the application benchmarks perform better than kernel benchmarks, and this becomes evident when we look at the efficiency. FT is approaching zero efficiency already at sixteen processors. MG and CG are not performing much better. In section6.1we’ll see that a typical application at the Department of Scientific Computing, the so called adveq problem, may be seen as an application benchmark and runs very well on Grendel.

We can now deduce that a mix between computations and communication, where communication is kept to a minimum is the key to achieving high performance for parallel applications.

(20)

2 4 6 8 10 12 14 16 0

200 400 600 800 1000 1200 1400 1600

NAS Parallel Benchmark (Class W) − LAM parallel API

Nprocs

Mop/s total

FT MG LU CG SP BT

Figure 8: The performance using the W(orkstation) problem size.

5 Theoretical speedup model

5.1 Introduction

In 1967 a researcher at IBM, Gene Amdahl, wrapped up some then newly discovered thoughts on how to do work in parallel. This conclusion was named Amdahl’s law and refers to limits placed on the amount of speedup one can expect from a parallelized computational job. We want to refine this law and use it as a model for simulating the speedup behaviour of our parallel system.

A naive expectation of doing work in parallel would be to think that splitting a computational job among N processors would result in a completion in _N¹ time, or in other words leading to a N -fold increase in computational power. To formulate Amdahl’s law we must first recognize that every parallelized job can contain a serial part (i e work that must be done by a single processor) and a parallel part, which is the part that can be subjected to parallelization.

5.2 Theoretical model [Bro00]

5.2.1 Defining speed

The objective of a parallel computational job is to get as much work done as possible in the shortest possible time, hence we must define the speed of a program. We start by stating that the average speed of a program is equal to the work done divided by the time it took to perform this work, Speed = W ork/T ime.

Using earlier statements we rewrite T ime = Ts+ Tp as a sum of the time it took to perform the serial work Ts and the time for performing the parallel work Tp. We then have Speed1 =

(21)

2 4 6 8 10 12 14 16 2

4 6 8 10 12 14 16

Speedup − NPB (Class W) − LAM parallel API

Nprocs

Speedup

Optimal FT MG LU CG SP BT

2 4 6 8 10 12 14 16

0 0.2 0.4 0.6 0.8 1

Efficiency − NPB (Class W) − LAM parallel API

Nprocs

Efficiency

FT MG LU CG SP BT

Figure 9: Illustrating the performance gain in terms of speedup and efficiency.

W ork/(T_s+ T_p) where the subscript 1 denotes the number of processors performing the work. The speed for doing the same amount of work on N processors would then be Speed_N = W ork/(T_s+

Tp

N).

5.2.2 Defining Speedup

Defining the speedup S as the ratio between the speed of performing a job on one processor and doing the same job on N processors, we arrive at Amdahl’s law.

S_N = Ts+ Tp

T_s+^T_N^p

Amdahl’s law applied to computational jobs immediately rules out a great number of jobs as suitable for parallelization. If the time it takes to perform the serial part is relatively large compared to the parallel part, we will achieve little or no speedup by parallelizing the job. Hence Amdahl’s law refers to the best possible speedup one may achieve³.

5.2.3 Refining Amdahl’s law

Although useful in its current expression, Amdahl’s law is still too optimistic since it completely ignores overhead from the parallelization. We arrive at a more fine grain description of the speedup if we introduce two new elements in the formulation.

3Amdahl’s law is usually expressed with an inequality.

(22)

Tis The average serial time that is spent on communication in various ways. This time probably depends on the number of processors in some way. A suitable first approximation would be that it is proportional to the number of processors.

Tip The average parallel time (that could be just idle time) when doing communication.

Using these definitions we end up with a better⁴ estimate for the speedup achieved through parallelization.

SN = Ts+ Tp

Ts+ N× Tis+^T_N^p + Tip

We still need to determine how to calculate T_s, T_is, T_p and T_ip, and we do this by rewriting the variables as:

Ts= OPs× top Where OPs is the number of arithmetic operations performed in the serial part, and top is the average time it takes to perform an arithmetic operation.

Tis= M P Is× tmpi Where M P Is is the number of doubles sent in the serial part. The variable tmpiis then the average time to send a double with the MPI interface.

Tp= OPp× top OPp is the number of arithmetic operations performed in the parallel part.

Tip= M P Ip× tmpi And finally M P Ipwhich is the number of doubles sent with the MPI interface in the parallel part of the program.

The speedup model (Equation (4))

SN = (OPs+ OPp)top

OP_st_op+ N (M P I_st_mpi) + (OP_pt_op)/N + M P I_pt_mpi (4) is the model used in our simulations.

5.3 Model verification

We verify the model by simulating a real application and compare the results to measured data.

We choose to use the adveq-application that is presented in section6.1.

Adveq needs certain input parameters which are fully explained in section 6.1. Figure 10 represents the variables q = 20, nnx = 1024 and nny = 1024. We notice a constant ten second difference between the estimated time and the actual execution time, which has negative impact on the estimated speedup. The error is clearly visible in the right-hand side of the figure, where the constant error is the same order of magnitude as the total execution time (see Section6.1for an in-depth explanation).

6 Adapting problems to fit Grendel

6.1 Advec

The adveq problem represents a common type of algorithm, used at the Department of Scientific Computing. The original code is presented in AppendixA. We start by stating the problem:

We want to solve the hyperbolic PDE problem

u_t+ u_x+ u_y = f (x, y), 0 6 x 6 1, 0 6 y 6 1 u(t, 0, y) = h(y− 2t) + up(0, y), 0 6 y 6 1

u(t, x, 0) = h(x− 2t) + up(x, 0), 0 6 x 6 1

u(0, x, y) = h(x + y) + up(x, y), 0 6 x 6 1, 0 6 y 6 1

(5)

4It is still a very simple model.

(23)

2 4 6 8 10 12 14 16 2

4 6 8 10 12 14 16

Number of processors

Speedup

Optimal Model Adveq

2 4 6 8 10 12 14 16

0 200 400 600 800 1000

Exec. time (s)

Model Adveq

Figure 10: A verification of the model using the adveq-code presented in section6.1

where

f (x, y) = 2e^x+y+ 3x²+ 6y²+ sin x + y 2

+ cos x + y 2

up(x, y) = e^x+y+ x³+ 2y³+ sin x + y 2

− cos x + y 2

h(z) = sin (2πz) The PDE problem (5) has the solution

u(t, x, y) = h(x + y− 2t) + up(x, y) (6) Here, we will solve the problem numerically by introducing the leap-frog scheme

uⁿ⁺¹_i,j + 2∆t(fi,j− D0xuⁿ_i,j− D0yuⁿ_i,j) where

D_0x= uⁿ_i+1,j− uⁿi−1,j

2∆x , D_0y=uⁿ_i,j+1− uⁿi,j−1

2∆y For simplicity, we’ll use the analytical solution (6) on the boundaries.

The computational area is divided into an nnx by nny grid. Due to stability restrictions, the total amount of timesteps is set to N t = (nnx− 1) + (nny − 1). The grid is cut in the column dimension so that each processor gets a stripe of (approximately) the same width as the others.

MPI is used for the message passing procedure. To make it possible to experiment with the communication-computation ratio, we have a parameter q which decides how many of a stripe’s

(24)

outermost columns are to be sent to the adjacent stripes. Sending data in larger chunks saves time spent on communicational start-ups. On the other hand, increasing q also increases computational work, since the same data somtimes is calculated on two processors. Note also that increasing q does not affect the total amount of data that is to be sent.

Initially the size of the problem and the parameter q are distributed to all processors. The first two timesteps are calculated from the analytical solution. This is due to the finite difference stencil, which needs two layers⁵to compute a third. After this, work is done on each processor in primarily three steps :

repeat N t/q times

1. Send the q outermost columns on each side of the stripe to the adjacent processors, respectively. This is done for the newest and the middle layer.

repeat steps 2 and 3 q times

2. Phase out the oldest layer, then set the middle layeras the oldest. Then set the newest layer as the middle, leaving space for the new layer.

3. Calculate the new layer. For each turn in the inner loop, the calculated layer will be thinner and thinner, since there’s no communication going on. After q turns we will end up with a stripe of the initial width. Go back to step 1 and widen the stripe.

To estimate the time consumption we count which and how many operations are performed in each step :

1. q columns are sent in eight steps (even-numbered and odd-numbered processors, send and receive, u and unew). This is done N t/q times. Time consumption : N t/q∗8(ts+q∗nnx∗tw) 2. Pointers are easily shifted. We approximate this with zero work.

3. Since the work is somewhat unbalanced (the difference is however minor) we consider a middle stripe. The outer loop is performed N t/q times. The inner is performed q times. In the first turn of the inner loop, the overhead is 2(q− 1) columns (q − 1 on each side). The next turn, the overhead is 2(q− 2) a s o. In total there are 2 ∗ 0.5 ∗ q(q − 1) overhead columns in the inner loop. Add to that the q∗ nny/size columns in the original stripe. In total we have on each node

Number of elements = N t/q∗ nnx ∗ (2 ∗ 0.5 ∗ q(q − 1) + q ∗ nny/size)

= N t∗ nnx((q − 1) + nny/size) where size is the number of processors used.

How many floating point operations are required on each element is not easily estimated, especially since there are exponential and trigonometric function calls. There’s also main memory fetches involved. The easiest is to make a test program that measures the time to operate on one element. This is done in Appendix C. The test program yields that operating on one element consumes approximately 430 ns. Let t_op = 430 ns. For t_s and t_w we use the results from our ping-pong program (see AppendixD), t_s= 62µs and t_w= 750ns. We now have

Total time = N t(4(t_s/q + nnx∗ tw) + nnx((q− 1) + nny/size)top) (7) if the number of processors is equal or greater than three. In the case of one or two processors the total time is reduced to

Total time for one processor = N t∗ nnx ∗ nny ∗ top

Total time for two processors = N t(4(t_s/q + nnx∗ tw) + nnx(0.5∗ (q − 1) + nny/size)top)

(25)

0 2 4 6 8 10 12 14 16 0

100 200 300 400 500 600 700 800 900 1000

Execution time (s)

Adveq, gridsize 1024x1024

Experiment, q=1 Theory, q=1 Experiment, q=20 Theory, q=20

Figure 11: Sending data in larger chunks does not make up for the increase in computational work.

A na¨ıve MPI implementation of the adveq problem is presented in Appendix B. By using intermediate sends and receives and using vector notation on the calculation, shorter execution times can be achieved. However, such an implementation is hard to model and will not correspond to the original code.

Simulating the time consumption and validating the modell by using experimentall data will however reveal, as can be seen by taking a closer look at Equation (7), that increasing q is not motivated. We choose the problem grid 1024× 1024 elements. Both experimental and theoretical values for q = 1 and q = 20 are presented in Figure11. Efficiency and speedup for the case q = 20 are shown in Figure10.

When we use several processors, a constant time difference of 10 seconds between the model and the experimental data appears. This difference is caused by the instructions in the code that do not concern the computation nor the communication. The size of such overhead is hard to estimate and is handled only in very advanced models. Usually, as in this case, the purpose of a model is to give an estimate of the maximum achievable performance.

We examine the q dependence further by analyzing how the execution time depends on q. As shown in Figure12, the appearence of the graph depends more on the gridsize than on the number of processors used. Only a very small problemsize could motivate the use of a greater value on q than 1, and even in these cases, the gain is only fractions of a second.

It is clear that the communicational overhead produced by the parallelization is too small in the adveq problem to motivate an increased computational load. This also shows that typical algorithms at the Department of Scientific Computing, such as adveq, runs very well on Grendel.

Hence the bottleneck that the communication constitues is not a big issue.

5By layer we here refer to the data set that one timestep constitute.

(26)

3436 3438 3440 3442 3444 3446 3448 3450 3452 3454 3456 3458 3460 3462 3464 3466

time(s)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 q

(a) 2500x2500 elements, 4 processors

916 918 920 922 924 926 928 930 932 934 936 938 940 942 944 946

time(s)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 q

(b) 2500x2500 elements, 16 processors

2.4 2.5 2.6 2.7 2.8 2.9 3

time(s)

0 2 4 6 8 10

q

(c) 200x200 elements, 4 processors

1.1 1.2 1.3 1.4 1.5 1.6 1.7

time(s)

0 2 4 6 8 10

q

(d) 200x200 elements, 16 processors

Figure 12: The execution time depending on q. A large number of processors on a small data set gives rise to an increased ratio between the time spent on communication and the total execution time. In (a), the communication time per total time ratio at the extreme point is about one percent, while it in (d) is as large as about 40 percent (note however that this is an extreme case).

A The original Adveq code

!======================================================================

!

! ---

! Routine : Main

! Purpose : Solve Ut+Ux+Uy=F(x,y) with Leap-Frog

! Author : Jarmo Rantakokko

! Date : 990614

!

!======================================================================

program adveq implicit none

integer, parameter :: DP=kind(0.0D0)

(27)

!-- Variables.

integer :: Nx,Ny,Nt,i,j,k,nthreads

real(kind=DP) :: dt,dx,dy,norm,T,x,y,v,ti real(kind=DP)::ttime,ctime,walltime

real(kind=DP),pointer,dimension(:,:)::uold,u,unew,temp;

real(kind=DP)::F,up,h integer omp_get_max_threads integer,parameter:: disk=10

character(len=*),parameter :: input=’/home/da/adveq/original/params.dat’

!--- ttime=walltime()

! Set up the problem

namelist /problemsize/ Nx,Ny,nthreads open(unit=disk,file=input)

read(disk,problemsize) close(disk)

! call omp_set_num_threads(nthreads)

dx=1.0_DP/Nx; dy=1.0_DP/Ny; dt=1.0_DP/(Nx+Ny); T=1.0; Nt=nint(T/dt);

allocate(uold(0:Nx,0:Ny),u(0:Nx,0:Ny),unew(0:Nx,0:Ny)) write(*,*) ’============================================’

write(*,*) ’Version : Fortran 90’

! write(*,’(A,I8)’) ’ Number of threads:’,omp_get_max_threads() write(*,’(A,3I8)’) ’ Problem Size :’,Nx+1,Ny+1,Nt

write(*,*) ’============================================’

write(*,*) ’Computing...’

! Initial conditions

!$OMP PARALLEL DO PRIVATE(i,j,x,y) do j=0,Ny

do i=0,Nx

x=real(i,kind=DP)/Nx; y=real(j,kind=DP)/Ny u(i,j)=h(x+y)+up(x,y);

unew(i,j)=h(x+y-2*dt)+up(x,y);

end do end do

!$OMP END PARALLEL DO

! Integrate the solution in time ctime=walltime()

do k=2,Nt

! Swap pointers

temp=>uold; uold=>u; u=>unew; unew=>temp;

! Leap Frog

!$OMP PARALLEL

!$OMP DO private(i,x,y) do j=1,Ny-1

do i=1,Nx-1

(28)

x=real(i,kind=DP)/Nx; y=real(j,kind=DP)/Ny unew(i,j)=uold(i,j)+2*dt*(F(x,y)- &

((u(i+1,j)-u(i-1,j))/2.0_DP*Nx+ &

(u(i,j+1)-u(i,j-1))/2.0_DP*Ny)) end do

end do

!$OMP END DO NOWAIT

! Boundary conditions ti=k*dt;

!$OMP DO private(y) do j=0,Ny

y=real(j,kind=DP)*dy

unew(0,j)=h(y-2*ti)+up(0.0_DP,y) end do

!$OMP END DO NOWAIT

!$OMP DO private(x) do i=1,Nx

x=real(i,kind=DP)*dx

unew(i,0)=h(x-2*ti)+up(x,0.0_DP) end do

!$OMP END DO NOWAIT

!Exact boundary conditions x=1.0_DP

!$OMP DO private(y) do j=1,Ny

y=real(j,kind=DP)*dy

unew(Nx,j)=up(x,y)+h(x+y-2*ti) end do

!$OMP END DO NOWAIT y=1.0_DP

!$OMP DO private(x) do i=1,Nx-1

x=real(i,kind=DP)*dx

unew(i,Ny)=up(x,y)+h(x+y-2*ti) end do

!$OMP END DO NOWAIT

!$OMP END PARALLEL end do

ctime=walltime()-ctime

! Residual norm || u_new-(h(x+y-2*t)+up(x,y)) ||

norm=0.0

!$OMP PARALLEL DO PRIVATE(i,x,y,v) REDUCTION(+:norm) do j=0,Ny

do i=0,Nx

x=real(i,kind=DP)/Nx; y=real(j,kind=DP)/Ny v=h(x+y-2*Nt*dt)+up(x,y)

norm=norm+(unew(i,j)-v)*(unew(i,j)-v);

end do end do

!$OMP END PARALLEL DO

(29)

ttime=walltime()-ttime

! Display results

write(*,*) ’---’

write(*,’(A,F9.4,A)’) ’ Total time : ’,ttime,’ sec’

write(*,’(A,F9.4,A)’) ’ Compute time : ’,ctime,’ sec’

write(*,’(A,E14.6)’) ’ Error norm : ’,norm/sqrt(real(Nx*Ny)) write(*,*) ’---’

end program adveq

B The MPI implementation of the Adveq code

program adveq implicit none include ’mpif.h’

integer, parameter :: DP=kind(0.0D0) integer :: Nx,Ny,Nt,i,j,k,nthreads,q,q1 integer :: nnx,nny,nx1,nx2,ny1,ny2,rest

real(kind=DP) :: dt,dx,dy,mynorm,norm,T,x,y,v,ti real(kind=DP) :: ttime,ctime,walltime

real(kind=DP),pointer,dimension(:,:) :: uold,u,unew,temp real(kind=DP) :: F,up,h ! Functions

integer,parameter :: disk=10

character(len=*),parameter :: input=’/home/da/adveq/p1/params.dat’

integer :: rank,size,ierror integer, dimension(3) :: tmpbuf ttime = walltime()

! Initialize MPI, find out my rank and how many procs are used call MPI_INIT(ierror)

call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)

! Distribute problem dimensions if(rank .eq. 0) then

namelist /problemsize/ Nx,Ny,q open(unit=disk,file=input) read(disk,problemsize) close(disk)

tmpbuf(1) = Nx tmpbuf(2) = Ny tmpbuf(3) = q