Introduction to the HPC cluster
Mikica Kocic
2020-11-24, Fysikum
Outline
Part I
A brief history of high-performance computing HPC cluster @ Fysikum and the cluster usage policy Other collaboration tools at Fysikum
Introduction to computer architecture and performance HPC services and nodes overview
How to apply for an HPC account
Documentation website and contacting the support
Outline
Part II
Login (ssh, kerberos, X11, VPN)
Where are my files? (storage and file systems)
Working on the head nodes (editing, compiling, file transfer) Building software (environment modules, miniconda,
singularity containers, nix packages, easybuild, charliecloud, ocr) Running jobs (sbatch, salloc, srun, prun/mpirun)
Monitoring jobs (squeue, sinfo, hpc-moni, hpc-ac)
Outline
Part III
Performance engineering Debugging and profiling tools
A brief history of HPC
A brief history of high-performance computing
PC
Workstation
Vector Supercomputer
MPP Supercomputer
Vector = Single Instruction, Multiple Data (SIMD) MPP = Massively Parallel Processing
HPC food chain before mid 1990’s
Rajkumar Buyya, High Performance Cluster Computing, 1998
A brief history of high-performance computing
Commodity Cluster HPC food chain after mid 1990’s
Rajkumar Buyya, High Performance Cluster Computing, 1998
Evolution of supercomputer architecture
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
6/1993 6/1994 6/1995 6/1996 6/1997 6/1998 6/1999 6/2000 6/2001 6/2002 6/2003 6/2004 6/2005 6/2006 6/2007 6/2008 6/2009 6/2010 6/2011 6/2012 6/2013 6/2014
Single Processor Constellations SMP Cluster MPP SIMD
The performance share of architecture on the TOP500 list
Today nearly 90% of the TOP500 supercomputers are clusters!
https://top500.org/resources/top-systems/
Diagram from: B. Li and P. Lu, The Evolution of Supercomputer Architecture: A Historical Perspective, in W. Xu et al. (Eds.): NCCET 2015, CCIS 592, pp. 145–153, 2016.
HPC @ Fysikum
Technical Division / HPC support @ Fysikum
As of September 2019, the Technical Division provides a high-performance computing (HPC) cluster support.
Fysikum’s HPC cluster is a common resource which is available to all research groups at Fysikum.
Basic infrastructure is provided by Fysikum:
rack space, power, operating system infrastructure, hardware and software installation & maintenance, basic login nodes, storage, and interconnect, as well as some general computing nodes.
Focus on the continuous development of the cluster
(decommissioning the old equipment, scaling the cluster up & out).
Cluster Usage Policy
Available compute time is shared equally between Fysikum users (managed by the common queuing system).
Excessive use of storage requires a user contribution to the cluster.
Additional compute nodes are funded by projects/research groups.
Research groups have priority on the resources they funded.
If the resources are idle they are available for other users.
External collaborators can get access to the cluster after approval by the host. They get access to the cluster on the same premises as the host’s group members.
Collaboration tools @ Fysikum
Note the other collaboration tools at Fysikum:
NextCloud
https://nextcloud.fysik.su.se
GitLab
https://gitlab.fysik.su.se
Indico
https://indico.fysik.su.se
Newdle
https://newdle.fysik.su.se
CoCalc (experimental)
https://cocalc.fysik.su.se
Background:
Computer Architecture
& Performance
Computer Architecture 101
Processor
Memory I/O system
Interconnect
Quantifying Performance
Performance
Processor Core Count Flops/Core
Memory Capacity Memory Bandwidth Network Latency
Network Bandwidth
I/O Performance
HPL vs Data Analytics
IBM RedBook REDP-5478, Networking Design for HPC and AI on IBM Power Systems, 2018
Computer Aided Engineering
Srini Chari, HPC and HPDA for the Cognitive Journey with OpenPOWER, 2016
Life Sciences
Srini Chari, HPC and HPDA for the Cognitive Journey with OpenPOWER, 2016
Processor
Memory I/O system
Interconnect
How to scale ?
Shared Memory Multiprocessing (SMP)
Processor Processor
Processor Processor
Main memory I/O system
One or more levels
of cache One or
more levels of cache One or
more levels of cache One or
more levels of cache
Shared cache
Private caches
Hennessy & Patterson, Computer Architecture, 2019
Distributed Shared Memory
Memory I/O
Interconnection network
Memory I/O Memory I/O
Multicore MP
Multicore MP
Multicore MP
Multicore MP
Memory I/O
I/O Memory
Memory I/O Memory I/O Memory I/O
Multicore MP
Multicore MP
Multicore MP
Multicore MP
Compute node
+
Hennessy & Patterson, Computer Architecture, 2019
Fysikum’s HPC Cluster
Cluster elements
IBM RedBook SG24-8280, Implementing an HPC on S822LC, 2016
Fysikum’s HPC Cluster
sol-login sol-nix cocalc
Management Node Head Nodes
c01n01 c01n08
. . .
Compute Nodes
Public Network Infiniband
EDR 100 Gbps
& DDR 20 Gbps
solar partition
c02n01 c02n10
. . . fermi partition
c03n01 c03n11
. . . cops partition
10 GbE
& 1 GbE
10 GbE
/cfs/data Lustre filesystem
314 TiB 9 x (3+1) x 14 TB
6 - 10 GBy/s
110 TiB 2 x (5+2) x 14 TB
700 MBy/s /cfs/home NFS / ZFS 11 nodes Dell R6525,
2 x 32 cores/node, CPU:
AMD EPYC2 7502, 2.5 GHz, RAM 512 GB, 3200 MT/s
10 nodes HP DL160 G6, 2 x 4 cores/node, CPU:
Intel Xeon L5500, 2.27 GHz, RAM 24 GB
8 nodes Del R416 II, 2 x 6 cores/node, CPU:
AMD Opteron 4238, 3.3 GHz, RAM 32 GB
Interconnect
Management Network
Storage Nodes
Accessing the cluster
Before starting using the cluster, you will need to open an account.
We do not use passwords to access the system.
You have to use Kerberos or provide your public SSH key.
To apply for an HPC account, complete the following form:
(here you can also enter your public SSH key) https://it.fysik.su.se/hpc-reg
Documentation & Support
User’s guide
https://it.fysik.su.se/hpc
System internalshttps://it.fysik.su.se/hpc-sys
Mailing listhpc@fysik.su.se
Support mail
hpc-support@fysik.su.se
Issue tracker
https://gitlab.fysik.su.se/hpc/support
Part II
Login via SSH
We have two login nodes: sol-login and sol-nix.
Allowed authentications: SSH public key or Kerberos.
If you use Kerberos, you need to have a valid ticket first:
kinit -f
username@SU.SETo login, issue (the port must be given outside the SU network):
ssh -p <port>
username@sol-login.fysik.su.se To enable X11 forwarding, use-X:ssh -X -p <port>
username@sol-login.fysik.su.se When troubleshooting, use-v,-vvor-vvv:ssh -vvv -p <port>
username@sol-login.fysik.su.seConfiguring ssh on the client
The SSH private and public keys are generated using
ssh-keygen
and stored in~/.ssh/id_*
and~/.ssh/id_*.pub
, respectively.ssh obtains configuration data from:
1. command-line options
2. user’s configuration file (
~/.ssh/config
)3. system-wide configuration file (
/etc/ssh/ssh_config
)You can use
~/.ssh/config
to configure options per host basis:~/.ssh/configexample (see man ssh_config)
Host sol-*.fysik.su.se
Port <port>
User <user>
ForwardX11 yes
Wireguard VPN
Fysikum started setting up an own VPN infrastructure based on Wireguard. The service is still experimental and intended for those who need remote access to lab infrastructure, file servers and internal license servers.
After installation you need to generate a private-public key pair.
Please send the public key to holger.motzkau at fysik.su.se and a request of what you want to access, then you will receive a
configuration file.
Stay tuned. More info:
https://www.fysik.su.se/english/staff/it-and-telephony/vpn
Where to store the files?
The cluster offers two storage systems /cfs/home and /cfs/data.
An efficient usage of these requires knowing when to use what.
/cfs/home 110 TiB
used for the home directories
backup via ZFS snapshots (every 10 minutes)
keep your programs, source files and the final results here mounted via NFS, backend is ZFS
not fast: max write throughput ≈ 700 MB/s
/cfs/data 314 TiB
used for the large data storage not backup (!)
Lustre file system (with ZFS backend)
very fast: max write throughput from 6 to 10 GB/s POSIX compatible access control lists (ACL)
Transferring files
Working with the HPC cluster can involve transferring data back and forth between your local machine.
File transfers between the computers can be done using scp or rsync.
Files from the Internet can be transferred using curl or wget.
sol-login and c03n11 have 10 GbE connections to the Internet.
If you wish to transfer large amounts of data, submit a Slurm job to c03n11 (not to choke the Internet connection on the login node).
Lustre file system
Use
ls -l
only where absolutely necessary.Use
lls
,/bin/ls
orlfs find -D 0 *
instead (lls
is a system-wide alias for/bin/ls -U
).Lustre file system commands: lfs
Search the directory tree:
lfs find
Check your own disk usage:
lfs quota /cfs/data
Check available disk space:lfs df -h
Check/modify file stripes:
lfs getstripe
,lfs setstripe
Check/modify access control lists:getfacl and setfacl
Access Control Lists
setfacl and getacl – utilities to set and get Access Control Lists (ACLs) of files and directories.
Example: give therwx-rights to users for some directory in /cfs/data
DIR=/cfs/data/username/directory
mkdir $DIR chmod go= $DIR
USERS="user1 user2 user3"
for user in $USERS; do
setfacl -R -m u:$user:rwx $DIR setfacl -R -m d:u:$user:rwx $DIR done
getfacl $DIR
Storage performance
The relatively slow rate of I/O operations can create bottlenecks.
Pay attention to how your programs are doing I/O as that can have a huge impact on the run time of your jobs.
Things to remember:
Minimize I/O operations.
Larger I/O operations are more efficient than small ones.
If possible aggregate reads/writes into larger blocks.
Avoid creating too many files.
Post-processing a large number of files can be very hard.
Avoid creating directories with very large numbers of files.
Create directory hierarchies instead (also improves interactiveness).
Working on the login nodes
When you login to the cluster with ssh, you will login to a designated login node in your home directory.
Available editors: vim, nano and emacs
Available terminal multiplexers: screen and tmux
Things to remember:
Do not run parallel jobs on the login nodes.
The login nodes are not fast and have a limited memory.
They are shared among the users.
Sometimes compiling longer programs can be faster on the compute nodes.
Installing software on a multi-user HPC system
Typical questions from a user new in the HPC environment:
Can I do sudo? I need to install some software.
What is the root password?
More serious portability and reproducibility questions:
My software does not compile. Some libraries are missing.
After compiling, my software does not work
(it works on my laptop/it worked on another cluster).
Cluster is a specific multi-user system
A multi-user HPC cluster is very different from a single-user’s laptop.
It facilitates a broad spectrum of users with varying requirements.
Performance of the built software is very important.
There are side-by-side multiple software versions and variants.
Software installations should remain available ‘indefinitely’.
On Fysikum’s HPC cluster, the base OS installation is kept minimal.
The installed software is kept in various package managers.
Available package managers
Lmod environment module files
Miniconda (a small, bootstrap version of Anaconda) Nix package manager
Singularity containers
also available:
EasyBuild (a build and installation framework for HPC systems) Spack (a package manager supporting multiple versions,
configurations, platforms, and compilers)
CharlieCloud (libghtweight user-defined software stacks for high-performance computing)
OCR (open community runtime for shared memory)
Working with modules
The environment module files (‘modules’) allow for dynamic add/remove of installed software packages to the running environment.
Displaying modules
module list
module available [<name>]
module show <name>
Loading, swaping and unloading modules
module load <names>
module swap <name1> <name2>
module unload <names>
You can also use the aliasml
ml
ml swap openmpi3 openmpi4 ml Mathematica/12.1
ml nix
Working with Conda
Minconda is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.
Using system-wide conda module
ml conda
‘which python‘
Installing conda locally in~/.local2/bin
mkdir -p ~/miniconda
cd ~/miniconda
wget wget https://repo.continuum.io/miniconda/
Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh bash miniconda.sh -b -p ~/.local2
export PATH=~/.local2/bin:$PATH
coda init bash
Working with Nix package manager
Nix is a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible.
Activating Nix package manager
ml nix
nix-env -q
Nix is a default environment on the login node sol-nix
ssh username@sol-nix.fysik.su.se
ml
Singularity – the full control of your environment
Singularity containers can be used to package entire scientific
workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.
Benefits:
Escape “dependency hell”
Local and remote code works identically every time.
Package software and dependencies in one file.
One file contains everything and can be moved anywhere.
Use same container in different SNIC clusters.
Negligible performance decrease.
SLURM (Simple Linux Utility for Resource Management)
Slurm – “an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters“
allocates resources (‘compute nodes’) for some duration of time framework for starting, executing, and monitoring work on the allocated nodes
arbitrates contention for resources by managing a queue
Commands to know:
sbatch, scancel, squeue, salloc, srun prun / mpirun, xvfb-run
Running batch jobs
To submit a job, use the command sbatch
sbatch <script>
The script can contain data to identify the requested resources:
Example: run myprog for 1 hour on 4 nodes with 12 cores
#/bin/bash -l
#SBATCH -J jobname
#SBATCH -t 1:00:00
#SBATCH -p solar
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#(alternative --ntasks=48) cat $0
prun -n 48 ./myprog
If you chance your mind, use scancel to cancel the job.
Running interactive jobs
Use salloc to obtain a job allocation and then issue commands.
Example: allocate 24 cores, run a command then reqlinquish the job
salloc -n 24
hostname
prun hostname
<Ctrl-D>
You can also use srun to run an interactive bash shell.
Example: allocate 2 nodes and run bash interactivelly on the master node
srun -N 2 --pty bash
hostname
prun hostname
<Ctrl-D>
Monitoring
Slurm commands:
squeue, qtop, scancel, sinfo
Display running jobs that belong to the user
squeue -u $USER # or use: qtop
Stop a running job or removes a pending one from the queue
scancel <jobnumber>
Real-time monitoring tools and accounting info:
HPC-moni
https://it.fysik.su.se/hpc-moni
HPC-achttps://it.fysik.su.se/hpc-ac
Part III
Available debugging and profiling tools
valgrind scalasca scorep pdtoolkit tau
Performance engineering
Example: Tuning the memory performance
STREAM – the standard benchmark for memory performance Website:
https://www.cs.virginia.edu/stream/
The benchmark contains 4 tests (Copy, Scale, Sum, Triad).
An illustration of the triad test (Fortran):
do i = 1 , n
a ( i ) = b( i ) + s *c ( i ) end do
Vector processing and SIMD instructions
The CPU instruction sets supported by different partitions:
Instruction set solar fermi cops qcmd
sse • • • •
sse2 • • • •
sse3 • • • •
sse4_1 • • • •
sse4_2 • • • •
sse4a • • •
fma4 •
fma • •
avx • • •
avx2 • •
STREAM Triad performance
for different array sizes
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
■ ■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■ ■
■
■
■
■
■
■
■ ■ ■ ■
■ ■ ■
■
■ ■ ■ ■ ■ ■ ■ ■
◆ ◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ ◆
◆
◆
◆
◆ ◆
◆
◆ ◆ ◆ ◆
◆ ◆ ◆ ◆
◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆
◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆
▲ ▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲ ▲
▲
▲
▲ ▲ ▲
▲
▲ ▲ ▲ ▲
▲ ▲ ▲ ▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼
▼ ▼ ▼
▼ ▼
▼ ▼
▼
▼
▼
▼ ▼
▼
▼
▼
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
○ ○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○ ○ ○
○ ○
○ ○ ○
○ ○
○
○ ○
○
○
○
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
□ □
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□ □ □
□ □
□
□ □
□ □
□
□ □
□
□
□
□ □ □ □ □ □ □ □ □ □ □ □ □ □ □
◇ ◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇
◇ ◇ ◇
◇ ◇
◇
◇ ◇
◇
◇
◇
◇ ◇
◇
◇
◇
◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇ ◇
● MPI, Opt Level = 0
■ MPI, Opt Level = 1
◆ MPI, Opt Level = 2
▲ MPI, Opt Level = 3
▼ OMP (pinned), Opt Level = 0
○ OMP (pinned), Opt Level = 1
□ OMP (pinned), Opt Level = 2
◇ OMP (pinned), Opt Level = 3
0 10 20 30 40 50 60
0 50 100 150 200
Number of Cores
MemoryBandwidth(GB/s)
cops, STREAM Triad, Memory bandwidth for large array sizes
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
■
■
■
■ ■
■■■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
■
■
■
■■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■■ ■
◆
◆ ◆
◆ ◆ ◆ ◆◆
◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆
◆
◆
◆
◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲
▲
▲
▲
▲ ▲
▲▲
▲
▲
▲
▲
▲ ▲
▲ ▲▲ ▲▲ ▲ ▲
▲
▲
▲
▲
▲
▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
● Opt Level = 0, single task
■ Opt Level = 1, single task
◆ Opt Level = 2, single task
▲ Opt Level = 3, single task
1000 104 105 106 107 108 109
0 20 40 60 80 100
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, Single task, Different optimization levels
2 cops-stream-triad.nb
● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●
●
●
● ● ●● ●●●
● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■■ ■■ ■
■
■
■
■
■
■
■
■
■
■
■
■ ■
■
■
■
■
■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆◆◆◆◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ ◆ ◆
◆
◆
◆
◆
◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲▲ ▲▲▲
▲
▲
▲
▲
▲
▲
▲
▲▲
▲
▲
▲ ▲
▲
▲
▲
▲
▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
● Opt Level = 0, MPI ranks = 64
■ Opt Level = 1, MPI ranks = 64
◆ Opt Level = 2, MPI ranks = 64
▲ Opt Level = 3, MPI ranks = 64
1000 104 105 106 107 108 109
0 500 1000 1500 2000 2500
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, MPI ranks = 64, Different optimization levels
cops-stream-triad.nb 3
● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●
●
●
●
● ●●●●●
● ●
●
● ● ● ●
●●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■■ ■ ■
■
■
■
■
■
■
■
■
■
■ ■
■
■ ■
■
■
■
■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆◆ ◆◆◆◆
◆
◆
◆
◆
◆
◆◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
● Opt Level = 0, OMP (pinned) threads = 64
■ Opt Level = 1, OMP (pinned) threads = 64
◆ Opt Level = 2, OMP (pinned) threads = 64
▲ Opt Level = 3, OMP (pinned) threads = 64
1000 104 105 106 107 108 109
0 500 1000 1500 2000 2500
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, OMP (pinned) threads = 64, Different optimization levels
4 cops-stream-triad.nb
●■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆ ◆◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲
▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲
▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼▼
▼ ▼ ▼ ▼ ▼ ▼▼ ▼▼
▼
▼ ▼ ▼ ▼ ▼ ▼▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼
▼ ▼
▼ ▼ ▼ ▼ ▼▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
○ ○ ○ ○ ○ ○ ○ ○○○
○ ○ ○ ○○○○○○
○
○
○
○○ ○ ○○
○
○
○ ○ ○ ○○ ○○○
○ ○ ○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○ ○ ○○
□ □ □ □ □ □ □ □ □□
□ □ □ □ □□□□□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□ □
□□
□ □ □ □ □ □ □ □□ □ □ □ □ □ □ □ □
● Opt Level = 0, OMP (pinned) threads = 1
■ Opt Level = 0, OMP (pinned) threads = 2
◆ Opt Level = 0, OMP (pinned) threads = 4
▲ Opt Level = 0, OMP (pinned) threads = 8
▼ Opt Level = 0, OMP (pinned) threads = 16
○ Opt Level = 0, OMP (pinned) threads = 32
□ Opt Level = 0, OMP (pinned) threads = 64
1000 104 105 106 107 108 109
0 200 400 600 800
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, Using OMP (pinned), Optimization Level = 0
cops-stream-triad.nb 5
●■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼▼
▼
▼
▼ ▼ ▼▼ ▼▼▼
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼
○ ○ ○ ○ ○ ○○ ○○○
○
○
○
○
○○○○○
○
○
○
○○○○○○
○ ○ ○ ○
○
○○
○○
○ ○ ○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○ ○ ○○
□ □ □ □ □ □□ □ □□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□ □ □ □ □
□
□
□ □ □ □ □ □ □ □□ □ □ □ □ □ □ □ □
● Opt = 0, MPI ranks = 1
■ Opt = 0, MPI ranks = 2
◆ Opt = 0, MPI ranks = 4
▲ Opt = 0, MPI ranks = 8
▼ Opt = 0, MPI ranks = 16
○ Opt = 0, MPI ranks = 32
□ Opt = 0, MPI ranks = 64
1000 104 105 106 107 108 109
0 100 200 300 400 500 600 700
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, Using MPI, Optimization Level = 0
6 cops-stream-triad.nb
●■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆
◆
◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲
▲ ▲ ▲ ▲ ▲ ▲▲ ▲▲
▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲ ▲ ▲ ▲ ▲
▲
▲▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼▼
▼
▼
▼ ▼ ▼▼ ▼▼
▼
▼ ▼
▼ ▼ ▼ ▼▼ ▼
▼
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
○ ○ ○ ○ ○ ○ ○ ○○○ ○ ○ ○ ○○ ○○○○
○
○
○
○
○○
○
○○
○
○ ○
○○
○
○○
○
○ ○ ○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○ ○ ○○
□ □ □ □ □ □ □ □ □□ □ □ □ □ □ □□ □□
□
□
□
□
□
□
□
□
□
□
□
□
□ □
□
□
□
□
□ □ □ □ □ □ □ □□ □ □ □ □ □ □ □ □
● Opt Level = 1, OMP (pinned) threads = 1
■ Opt Level = 1, OMP (pinned) threads = 2
◆ Opt Level = 1, OMP (pinned) threads = 4
▲ Opt Level = 1, OMP (pinned) threads = 8
▼ Opt Level = 1, OMP (pinned) threads = 16
○ Opt Level = 1, OMP (pinned) threads = 32
□ Opt Level = 1, OMP (pinned) threads = 64
1000 104 105 106 107 108 109
0 500 1000 1500 2000
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, Using OMP (pinned), Optimization Level = 1
cops-stream-triad.nb 7
●■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ■ ■ ■ ■ ●■ ●■ ● ● ● ● ● ●●■ ■ ■ ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
◆ ◆ ◆ ◆ ◆ ◆◆◆◆◆ ◆ ◆ ◆ ◆◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆
◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆◆ ◆ ◆ ◆ ◆ ◆ ◆◆◆
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲▲
▲ ▲ ▲ ▲ ▲▲ ▲ ▲▲ ▲ ▲
▲
▲
▲
▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼▼▼ ▼▼
▼
▼
▼ ▼ ▼ ▼▼▼▼
▼
▼
▼
▼
▼
▼
▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
○ ○ ○ ○ ○ ○ ○ ○○○
○ ○ ○ ○○○○○○
○
○
○
○
○
○○○○
○ ○
○
○
○
○
○○○
○ ○ ○ ○ ○ ○ ○○○ ○ ○ ○ ○ ○ ○ ○○
□ □ □ □ □ □ □ □ □□ □ □ □ □ □□□ □□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□ □ □ □ □ □ □ □□ □ □ □ □ □ □ □ □
● Opt = 1, MPI ranks = 1
■ Opt = 1, MPI ranks = 2
◆ Opt = 1, MPI ranks = 4
▲ Opt = 1, MPI ranks = 8
▼ Opt = 1, MPI ranks = 16
○ Opt = 1, MPI ranks = 32
□ Opt = 1, MPI ranks = 64
1000 104 105 106 107 108 109
0 500 1000 1500 2000
Array size
MemoryBandwidth(GB/s)
cops, STREAM Triad, Using MPI, Optimization Level = 1
8 cops-stream-triad.nb