Introduction to the HPC cluster

39  Download (0)

Full text

(1)

Introduction to the HPC cluster

Mikica Kocic

2020-11-24, Fysikum

Outline

Part I

A brief history of high-performance computing HPC cluster @ Fysikum and the cluster usage policy Other collaboration tools at Fysikum

Introduction to computer architecture and performance HPC services and nodes overview

How to apply for an HPC account

Documentation website and contacting the support

(2)

Outline

Part II

Login (ssh, kerberos, X11, VPN)

Where are my files? (storage and file systems)

Working on the head nodes (editing, compiling, file transfer) Building software (environment modules, miniconda,

singularity containers, nix packages, easybuild, charliecloud, ocr) Running jobs (sbatch, salloc, srun, prun/mpirun)

Monitoring jobs (squeue, sinfo, hpc-moni, hpc-ac)

Outline

Part III

Performance engineering Debugging and profiling tools

(3)

A brief history of HPC

A brief history of high-performance computing

PC

Workstation

Vector Supercomputer

MPP Supercomputer

Vector = Single Instruction, Multiple Data (SIMD) MPP = Massively Parallel Processing

HPC food chain before mid 1990’s

Rajkumar Buyya, High Performance Cluster Computing, 1998

(4)

A brief history of high-performance computing

Commodity Cluster HPC food chain after mid 1990’s

Rajkumar Buyya, High Performance Cluster Computing, 1998

Evolution of supercomputer architecture

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

6/1993 6/1994 6/1995 6/1996 6/1997 6/1998 6/1999 6/2000 6/2001 6/2002 6/2003 6/2004 6/2005 6/2006 6/2007 6/2008 6/2009 6/2010 6/2011 6/2012 6/2013 6/2014

Single Processor Constellations SMP Cluster MPP SIMD

The performance share of architecture on the TOP500 list

Today nearly 90% of the TOP500 supercomputers are clusters!

https://top500.org/resources/top-systems/

Diagram from: B. Li and P. Lu, The Evolution of Supercomputer Architecture: A Historical Perspective, in W. Xu et al. (Eds.): NCCET 2015, CCIS 592, pp. 145–153, 2016.

(5)

HPC @ Fysikum

Technical Division / HPC support @ Fysikum

As of September 2019, the Technical Division provides a high-performance computing (HPC) cluster support.

Fysikum’s HPC cluster is a common resource which is available to all research groups at Fysikum.

Basic infrastructure is provided by Fysikum:

rack space, power, operating system infrastructure, hardware and software installation & maintenance, basic login nodes, storage, and interconnect, as well as some general computing nodes.

Focus on the continuous development of the cluster

(decommissioning the old equipment, scaling the cluster up & out).

(6)

Cluster Usage Policy

Available compute time is shared equally between Fysikum users (managed by the common queuing system).

Excessive use of storage requires a user contribution to the cluster.

Additional compute nodes are funded by projects/research groups.

Research groups have priority on the resources they funded.

If the resources are idle they are available for other users.

External collaborators can get access to the cluster after approval by the host. They get access to the cluster on the same premises as the host’s group members.

Collaboration tools @ Fysikum

Note the other collaboration tools at Fysikum:

NextCloud

https://nextcloud.fysik.su.se

GitLab

https://gitlab.fysik.su.se

Indico

https://indico.fysik.su.se

Newdle

https://newdle.fysik.su.se

CoCalc (experimental)

https://cocalc.fysik.su.se

(7)

Background:

Computer Architecture

& Performance

Computer Architecture 101

Processor

Memory I/O system

Interconnect

(8)

Quantifying Performance

Performance

Processor Core Count Flops/Core

Memory Capacity Memory Bandwidth Network Latency

Network Bandwidth

I/O Performance

HPL vs Data Analytics

IBM RedBook REDP-5478, Networking Design for HPC and AI on IBM Power Systems, 2018

(9)

Computer Aided Engineering

Srini Chari, HPC and HPDA for the Cognitive Journey with OpenPOWER, 2016

Life Sciences

Srini Chari, HPC and HPDA for the Cognitive Journey with OpenPOWER, 2016

(10)

Processor

Memory I/O system

Interconnect

How to scale ?

Shared Memory Multiprocessing (SMP)

Processor Processor

Processor Processor

Main memory I/O system

One or more levels

of cache One or

more levels of cache One or

more levels of cache One or

more levels of cache

Shared cache

Private caches

Hennessy & Patterson, Computer Architecture, 2019

(11)

Distributed Shared Memory

Memory I/O

Interconnection network

Memory I/O Memory I/O

Multicore MP

Multicore MP

Multicore MP

Multicore MP

Memory I/O

I/O Memory

Memory I/O Memory I/O Memory I/O

Multicore MP

Multicore MP

Multicore MP

Multicore MP

Compute node

+

Hennessy & Patterson, Computer Architecture, 2019

Fysikum’s HPC Cluster

(12)

Cluster elements

IBM RedBook SG24-8280, Implementing an HPC on S822LC, 2016

Fysikum’s HPC Cluster

sol-login sol-nix cocalc

Management Node Head Nodes

c01n01 c01n08

. . .

Compute Nodes

Public Network Infiniband

EDR 100 Gbps

& DDR 20 Gbps

solar partition

c02n01 c02n10

. . . fermi partition

c03n01 c03n11

. . . cops partition

10 GbE

& 1 GbE

10 GbE

/cfs/data Lustre filesystem

314 TiB 9 x (3+1) x 14 TB

6 - 10 GBy/s

110 TiB 2 x (5+2) x 14 TB

700 MBy/s /cfs/home NFS / ZFS 11 nodes Dell R6525,

2 x 32 cores/node, CPU:

AMD EPYC2 7502, 2.5 GHz, RAM 512 GB, 3200 MT/s

10 nodes HP DL160 G6, 2 x 4 cores/node, CPU:

Intel Xeon L5500, 2.27 GHz, RAM 24 GB

8 nodes Del R416 II, 2 x 6 cores/node, CPU:

AMD Opteron 4238, 3.3 GHz, RAM 32 GB

Interconnect

Management Network

Storage Nodes

(13)

Accessing the cluster

Before starting using the cluster, you will need to open an account.

We do not use passwords to access the system.

You have to use Kerberos or provide your public SSH key.

To apply for an HPC account, complete the following form:

(here you can also enter your public SSH key) https://it.fysik.su.se/hpc-reg

Documentation & Support

User’s guide

https://it.fysik.su.se/hpc

System internals

https://it.fysik.su.se/hpc-sys

Mailing list

hpc@fysik.su.se

Support mail

hpc-support@fysik.su.se

Issue tracker

https://gitlab.fysik.su.se/hpc/support

(14)

Part II

Login via SSH

We have two login nodes: sol-login and sol-nix.

Allowed authentications: SSH public key or Kerberos.

If you use Kerberos, you need to have a valid ticket first:

kinit -f

username@SU.SE

To login, issue (the port must be given outside the SU network):

ssh -p <port>

username@sol-login.fysik.su.se To enable X11 forwarding, use-X:

ssh -X -p <port>

username@sol-login.fysik.su.se When troubleshooting, use-v,-vvor-vvv:

ssh -vvv -p <port>

username@sol-login.fysik.su.se

(15)

Configuring ssh on the client

The SSH private and public keys are generated using

ssh-keygen

and stored in

~/.ssh/id_*

and

~/.ssh/id_*.pub

, respectively.

ssh obtains configuration data from:

1. command-line options

2. user’s configuration file (

~/.ssh/config

)

3. system-wide configuration file (

/etc/ssh/ssh_config

)

You can use

~/.ssh/config

to configure options per host basis:

~/.ssh/configexample (see man ssh_config)

Host sol-*.fysik.su.se

Port <port>

User <user>

ForwardX11 yes

Wireguard VPN

Fysikum started setting up an own VPN infrastructure based on Wireguard. The service is still experimental and intended for those who need remote access to lab infrastructure, file servers and internal license servers.

After installation you need to generate a private-public key pair.

Please send the public key to holger.motzkau at fysik.su.se and a request of what you want to access, then you will receive a

configuration file.

Stay tuned. More info:

https://www.fysik.su.se/english/staff/it-and-telephony/vpn

(16)

Where to store the files?

The cluster offers two storage systems /cfs/home and /cfs/data.

An efficient usage of these requires knowing when to use what.

/cfs/home 110 TiB

used for the home directories

backup via ZFS snapshots (every 10 minutes)

keep your programs, source files and the final results here mounted via NFS, backend is ZFS

not fast: max write throughput ≈ 700 MB/s

/cfs/data 314 TiB

used for the large data storage not backup (!)

Lustre file system (with ZFS backend)

very fast: max write throughput from 6 to 10 GB/s POSIX compatible access control lists (ACL)

Transferring files

Working with the HPC cluster can involve transferring data back and forth between your local machine.

File transfers between the computers can be done using scp or rsync.

Files from the Internet can be transferred using curl or wget.

sol-login and c03n11 have 10 GbE connections to the Internet.

If you wish to transfer large amounts of data, submit a Slurm job to c03n11 (not to choke the Internet connection on the login node).

(17)

Lustre file system

Use

ls -l

only where absolutely necessary.

Use

lls

,

/bin/ls

or

lfs find -D 0 *

instead (

lls

is a system-wide alias for

/bin/ls -U

).

Lustre file system commands: lfs

Search the directory tree:

lfs find

Check your own disk usage:

lfs quota /cfs/data

Check available disk space:

lfs df -h

Check/modify file stripes:

lfs getstripe

,

lfs setstripe

Check/modify access control lists:

getfacl and setfacl

Access Control Lists

setfacl and getacl – utilities to set and get Access Control Lists (ACLs) of files and directories.

Example: give therwx-rights to users for some directory in /cfs/data

DIR=/cfs/data/username/directory

mkdir $DIR chmod go= $DIR

USERS="user1 user2 user3"

for user in $USERS; do

setfacl -R -m u:$user:rwx $DIR setfacl -R -m d:u:$user:rwx $DIR done

getfacl $DIR

(18)

Storage performance

The relatively slow rate of I/O operations can create bottlenecks.

Pay attention to how your programs are doing I/O as that can have a huge impact on the run time of your jobs.

Things to remember:

Minimize I/O operations.

Larger I/O operations are more efficient than small ones.

If possible aggregate reads/writes into larger blocks.

Avoid creating too many files.

Post-processing a large number of files can be very hard.

Avoid creating directories with very large numbers of files.

Create directory hierarchies instead (also improves interactiveness).

Working on the login nodes

When you login to the cluster with ssh, you will login to a designated login node in your home directory.

Available editors: vim, nano and emacs

Available terminal multiplexers: screen and tmux

Things to remember:

Do not run parallel jobs on the login nodes.

The login nodes are not fast and have a limited memory.

They are shared among the users.

Sometimes compiling longer programs can be faster on the compute nodes.

(19)

Installing software on a multi-user HPC system

Typical questions from a user new in the HPC environment:

Can I do sudo? I need to install some software.

What is the root password?

More serious portability and reproducibility questions:

My software does not compile. Some libraries are missing.

After compiling, my software does not work

(it works on my laptop/it worked on another cluster).

(20)

Cluster is a specific multi-user system

A multi-user HPC cluster is very different from a single-user’s laptop.

It facilitates a broad spectrum of users with varying requirements.

Performance of the built software is very important.

There are side-by-side multiple software versions and variants.

Software installations should remain available ‘indefinitely’.

On Fysikum’s HPC cluster, the base OS installation is kept minimal.

The installed software is kept in various package managers.

Available package managers

Lmod environment module files

Miniconda (a small, bootstrap version of Anaconda) Nix package manager

Singularity containers

also available:

EasyBuild (a build and installation framework for HPC systems) Spack (a package manager supporting multiple versions,

configurations, platforms, and compilers)

CharlieCloud (libghtweight user-defined software stacks for high-performance computing)

OCR (open community runtime for shared memory)

(21)

Working with modules

The environment module files (‘modules’) allow for dynamic add/remove of installed software packages to the running environment.

Displaying modules

module list

module available [<name>]

module show <name>

Loading, swaping and unloading modules

module load <names>

module swap <name1> <name2>

module unload <names>

You can also use the aliasml

ml

ml swap openmpi3 openmpi4 ml Mathematica/12.1

ml nix

Working with Conda

Minconda is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.

Using system-wide conda module

ml conda

‘which python‘

Installing conda locally in~/.local2/bin

mkdir -p ~/miniconda

cd ~/miniconda

wget wget https://repo.continuum.io/miniconda/

Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh bash miniconda.sh -b -p ~/.local2

export PATH=~/.local2/bin:$PATH

coda init bash

(22)

Working with Nix package manager

Nix is a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible.

Activating Nix package manager

ml nix

nix-env -q

Nix is a default environment on the login node sol-nix

ssh username@sol-nix.fysik.su.se

ml

Singularity – the full control of your environment

Singularity containers can be used to package entire scientific

workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.

Benefits:

Escape “dependency hell”

Local and remote code works identically every time.

Package software and dependencies in one file.

One file contains everything and can be moved anywhere.

Use same container in different SNIC clusters.

Negligible performance decrease.

(23)

SLURM (Simple Linux Utility for Resource Management)

Slurm – “an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters“

allocates resources (‘compute nodes’) for some duration of time framework for starting, executing, and monitoring work on the allocated nodes

arbitrates contention for resources by managing a queue

Commands to know:

sbatch, scancel, squeue, salloc, srun prun / mpirun, xvfb-run

Running batch jobs

To submit a job, use the command sbatch

sbatch <script>

The script can contain data to identify the requested resources:

Example: run myprog for 1 hour on 4 nodes with 12 cores

#/bin/bash -l

#SBATCH -J jobname

#SBATCH -t 1:00:00

#SBATCH -p solar

#SBATCH --nodes=4

#SBATCH --ntasks-per-node=12

#(alternative --ntasks=48) cat $0

prun -n 48 ./myprog

If you chance your mind, use scancel to cancel the job.

(24)

Running interactive jobs

Use salloc to obtain a job allocation and then issue commands.

Example: allocate 24 cores, run a command then reqlinquish the job

salloc -n 24

hostname

prun hostname

<Ctrl-D>

You can also use srun to run an interactive bash shell.

Example: allocate 2 nodes and run bash interactivelly on the master node

srun -N 2 --pty bash

hostname

prun hostname

<Ctrl-D>

Monitoring

Slurm commands:

squeue, qtop, scancel, sinfo

Display running jobs that belong to the user

squeue -u $USER # or use: qtop

Stop a running job or removes a pending one from the queue

scancel <jobnumber>

Real-time monitoring tools and accounting info:

HPC-moni

https://it.fysik.su.se/hpc-moni

HPC-ac

https://it.fysik.su.se/hpc-ac

(25)

Part III

Available debugging and profiling tools

valgrind scalasca scorep pdtoolkit tau

(26)

Performance engineering

Example: Tuning the memory performance

STREAM – the standard benchmark for memory performance Website:

https://www.cs.virginia.edu/stream/

The benchmark contains 4 tests (Copy, Scale, Sum, Triad).

An illustration of the triad test (Fortran):

do i = 1 , n

a ( i ) = b( i ) + s *c ( i ) end do

Vector processing and SIMD instructions

The CPU instruction sets supported by different partitions:

Instruction set solar fermi cops qcmd

sse • • • •

sse2 • • • •

sse3 • • • •

sse4_1 • • • •

sse4_2 • • • •

sse4a • • •

fma4 •

fma • •

avx • • •

avx2 • •

(27)

STREAM Triad performance

for different array sizes

(28)

MPI, Opt Level = 0

MPI, Opt Level = 1

MPI, Opt Level = 2

MPI, Opt Level = 3

OMP (pinned), Opt Level = 0

OMP (pinned), Opt Level = 1

OMP (pinned), Opt Level = 2

OMP (pinned), Opt Level = 3

0 10 20 30 40 50 60

0 50 100 150 200

Number of Cores

MemoryBandwidth(GB/s)

cops, STREAM Triad, Memory bandwidth for large array sizes

(29)

● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

■ ■ ■ ■ ■ ■ ■ ■ ■

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■■ ■

◆ ◆ ◆ ◆◆

◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲

▲ ▲▲ ▲▲

▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

Opt Level = 0, single task

Opt Level = 1, single task

Opt Level = 2, single task

Opt Level = 3, single task

1000 104 105 106 107 108 109

0 20 40 60 80 100

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, Single task, Different optimization levels

2 cops-stream-triad.nb

(30)

● ● ● ● ● ●● ● ● ● ●● ●

● ●●●

● ● ● ● ●

● ● ● ● ● ●● ● ● ● ● ● ●

■ ■ ■ ■ ■ ■ ■ ■■ ■

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆◆◆◆◆

◆ ◆

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲ ▲▲▲

▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

Opt Level = 0, MPI ranks = 64

Opt Level = 1, MPI ranks = 64

Opt Level = 2, MPI ranks = 64

Opt Level = 3, MPI ranks = 64

1000 104 105 106 107 108 109

0 500 1000 1500 2000 2500

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, MPI ranks = 64, Different optimization levels

cops-stream-triad.nb 3

(31)

● ● ● ● ● ●● ● ● ●● ● ●

● ● ● ●

●●

● ● ● ● ● ●● ● ● ● ● ● ●

■ ■ ■ ■ ■ ■ ■ ■ ■■ ■ ■

■ ■

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆◆ ◆◆◆◆

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲▲ ▲ ▲

▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

Opt Level = 0, OMP (pinned) threads = 64

Opt Level = 1, OMP (pinned) threads = 64

Opt Level = 2, OMP (pinned) threads = 64

Opt Level = 3, OMP (pinned) threads = 64

1000 104 105 106 107 108 109

0 500 1000 1500 2000 2500

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, OMP (pinned) threads = 64, Different optimization levels

4 cops-stream-triad.nb

(32)

● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ● ■ ■ ■ ■ ■

◆ ◆ ◆ ◆◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲▲ ▲ ▲▲

▲ ▲▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲

▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

▼ ▼ ▼ ▼▼ ▼▼

▼ ▼▼ ▼

▼ ▼ ▼ ▼ ▼▼▼ ▼ ▼ ▼ ▼ ▼ ▼▼

▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼

○ ○ ○ ○ ○○○

○ ○ ○

○ ○ ○○ ○○○

○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○○

□ □ □ □ □ □□

□ □

□□

□ □ □ □ □ □□ □ □ □ □ □ □

Opt Level = 0, OMP (pinned) threads = 1

Opt Level = 0, OMP (pinned) threads = 2

Opt Level = 0, OMP (pinned) threads = 4

Opt Level = 0, OMP (pinned) threads = 8

Opt Level = 0, OMP (pinned) threads = 16

Opt Level = 0, OMP (pinned) threads = 32

Opt Level = 0, OMP (pinned) threads = 64

1000 104 105 106 107 108 109

0 200 400 600 800

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, Using OMP (pinned), Optimization Level = 0

cops-stream-triad.nb 5

(33)

● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ● ■ ■ ■ ■ ■

◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲▲ ▲ ▲

▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

▼ ▼ ▼▼ ▼ ▼

▼ ▼▼▼

▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

○ ○ ○○ ○○

○○○

○ ○

○○

○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○○

□ □ □□ □ □

□ □ □ □

□ □ □ □ □ □□ □ □ □ □ □ □

Opt = 0, MPI ranks = 1

Opt = 0, MPI ranks = 2

Opt = 0, MPI ranks = 4

Opt = 0, MPI ranks = 8

Opt = 0, MPI ranks = 16

Opt = 0, MPI ranks = 32

Opt = 0, MPI ranks = 64

1000 104 105 106 107 108 109

0 100 200 300 400 500 600 700

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, Using MPI, Optimization Level = 0

6 cops-stream-triad.nb

(34)

● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆

◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲ ▲ ▲ ▲▲

▲ ▲ ▲ ▲▲ ▲▲

▲ ▲ ▲▲ ▲ ▲▲ ▲ ▲

▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼▼ ▼

▼ ▼

▼ ▼ ▼ ▼▼ ▼

▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼

○ ○ ○ ○ ○○○ ○ ○○ ○○○

○○

○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○○

□ □ □ □ □ □□ □ □ □ □□ □

□ □

□ □ □ □ □ □□ □ □ □ □ □ □

Opt Level = 1, OMP (pinned) threads = 1

Opt Level = 1, OMP (pinned) threads = 2

Opt Level = 1, OMP (pinned) threads = 4

Opt Level = 1, OMP (pinned) threads = 8

Opt Level = 1, OMP (pinned) threads = 16

Opt Level = 1, OMP (pinned) threads = 32

Opt Level = 1, OMP (pinned) threads = 64

1000 104 105 106 107 108 109

0 500 1000 1500 2000

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, Using OMP (pinned), Optimization Level = 1

cops-stream-triad.nb 7

(35)

● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ■ ■ ■ ■ ● ● ● ● ● ●● ■ ■ ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■

◆ ◆ ◆ ◆◆◆◆◆ ◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆

◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆

▲ ▲ ▲ ▲ ▲▲▲ ▲ ▲ ▲ ▲▲ ▲▲

▲ ▲ ▲▲ ▲ ▲▲

▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

▼ ▼▼▼

▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

○ ○ ○ ○ ○○○

○○

○○

○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○○

□ □ □ □ □ □□ □ □

□ □ □ □ □ □□ □ □ □ □ □ □

Opt = 1, MPI ranks = 1

Opt = 1, MPI ranks = 2

Opt = 1, MPI ranks = 4

Opt = 1, MPI ranks = 8

Opt = 1, MPI ranks = 16

Opt = 1, MPI ranks = 32

Opt = 1, MPI ranks = 64

1000 104 105 106 107 108 109

0 500 1000 1500 2000

Array size

MemoryBandwidth(GB/s)

cops, STREAM Triad, Using MPI, Optimization Level = 1

8 cops-stream-triad.nb

Figure

Updating...

References

Related subjects :