Parallel Stream Computing on Many-Core
Architectures
Nicolas Melot
Linköping University
Dept. of Computer and Inf. Science
Linköping, Sweden
Outline
1
Introduction
2
Crown Scheduling
3
Drake
High performance computing (p. 1)
Constant struggle for performance: Hollerith census machine
Census every 10 years.
8 years in 1880.
1 year in 1890.
Picture: “HollerithMachine.CHM” by Adam
Schuster
Flickr: Proto IBM. Licensed under CC BY 2.0 via
Wikimedia Commons
http://commons.wikimedia.org/wiki/File:HollerithMachine.CHM.jpg#/media/File:HollerithMachine.CHM.jpg
High performance computing (p. 1)
Constant struggle for performance: Big data.
Social medias
Internet of things
Applications:
Scientific computing
Marketing
Intelligence
Picture: “View inside detector at the CMS
cavern LHC CERN” by Tighef
Own work. Licensed under CC BY-SA 3.0 via
Wikimedia Commons
http://commons.wikimedia.org/wiki/File:View_inside_detector_at_the_CMS_cavern_LHC_CERN.jpg#/media/File: View_inside_detector_at_the_CMS_cavern_LHC_CERN.jpg
Accelerate computation (pp. 1–5)
How to improve performance?
Miniaturize
End of Moore’s law?
Increase frequency
Too high energy consumption
Parallel programming
Better energy consumption
Very challenging
Instruction-Level parallelism Wall
Scalability issues for consistent
shared memory
Von-Neumann bottleneck
P
f
S
Von Neumann Bottleneck
Access to main memory is expensive
(Drepper [2007])
HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B ab-78123 FX
16GB
DRAM
L3
SRAM
L2
SRAM
L1
SRAM
R
L2
SRAM
L1
SRAM
R
Stream programming (pp. 11–15)
Organize computation in nodes and
channels.
Easier to reason with
(Kahn [1974])
Pipeline parallelism
No need of shared memory consistency
Opportunities to reduce main memory
accesses
On-chip pipelining
Applications
Signal or graphic processing
Large data sets (“big data”)
Online computation
Figure:
A streaming
application of 5 nodes and
6 channels.
Streaming
Straightforward: communication
through shared off-chip main
memory
Easy to implement
Main memory bandwidth is
performance bottleneck
On-chip pipelining:
communication through small
on-chip memories
(Keller et al. [2012])
.
Trade core to core
communication for reduced
off-chip memory accesses
HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123 FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX < < < < < < < HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX < < < < < < <
Many-core architectures (pp. 15–24)
Embed more and more cores into a single chip
High throughput rather than low latency
Relaxed memory consistency models
Explicit use of on-chip network
Low latency small on-chip memories
High latency big off-chip memories
Discrete voltage and frequency scaling
Sleeping states
Examples: SCC, MPPA, Tilera, Xeon Phi, Epiphany.
0 0 1 0 123 4 567 SCC die DIMM Rtile tile R tile R tile R tile R tile R MC MC DIMM tile R tile
R Rtile Rtile Rtile tile
R
tile R tile
R Rtile Rtile Rtile tile R MC MC DIMM DIMM tile R tile
R Rtile Rtile Rtile tile R L2 256 KiB P54C L1 16 KiB MPB 16 KiB P54C L1 16 KiB L2 256 KiB trafficgen
mesh I/F
R
Contribution summary
Early work: microbenchmarking and comparison (pp. 28–57)
Distance between cores matters
(Avdic et al. [2011])
.
Distance to main memory controllers matters
(Melot et al. [2011])
.
Speedup from On-chip pipelined mergesort over classic
mergesort on SCC
(Avdic et al. [2011])
.
In this seminar:
Crown Scheduling for parallel streaming application on
many-core
(Kessler et al. [2013], Melot et al. [2015b])
(pp. 58–132).
Drake: a retargetable C streaming framework for many-core
Streaming computation (pp. 59–61)
Streaming
Software pipelining
Tasks execute in parallel in
steady-state
Static scheduling
Moldable tasks
Steady state
Throughput constraint
Optimize energy
Streaming Task Collection
Independent tasks
Balance workload
No communication cost
Figure:
Streaming taskgraph.
Figure:
Steady state of the streaming
pipeline.
Platform model (p. 62)
Platform
p uniform processors
Discrete frequency set F
Applied to individual cores
Voltage by auto-co-scaling
Can change dynamically any time
Power model
Dynamic power function of frequency
Analytic function or measurements
No restriction
Replaceable
Voltage & Frequency co-Scaling (pp. 16–17)
Voltage and Frequency Co-scaling
Must increase voltage when increasing frequency
Energy penalty when maintaining high voltage for low frequency
Voltage and minimum voltage for current frequency Frequency and maximal frequency for current voltage
Time Frequency (Hz)
Voltage (V)
Task model (pp. 62–63)
Moldable task j
Fixed work
τ
j
Allocation: run on w
j
≥ 1 cores
Maximum W
j
: w
j
≤ W
j
Arbitrary efficiency function
0 < e
j
(q) ≤ 1 for 1 ≤ q ≤ W
j
No convexity, monotony
or continuity constraint
Time proportional to work, parallel
degree and frequency
e
W
1
p
0 1
Figure:
Arbitrary efficiency
function.
ttj
q = wj
Problem formulation (p. 61)
3 static problems
Resource allocation
Find w
j
≤ min(p, W
j
) for each task j
Define execution time of tasks j
Task mapping to cores
Assign tasks to a subset of cores
1..p
Discrete frequency scaling
Assign tasks a frequency level in F
Respect a makespan constraint M
Repeated execution of a task sequence
Non data-ready tasks are delayed to the
next round
All steps influence each other
Crown scheduling (pp. 62–65)
Restrict allocation and mapping to O(p) processor subsets (groups)
P1 P2 P3 P4 P5 P6 P7 P8 G4 G5 G6 G7 G3 G2 G1 G15 G14 G13 G12 G11 G10 G9 G8 G2 G4
P
1 G8P
2 G9 G5P
3 G10P
4 G11 G3 G6P
5 G12P
6 G13 G7P
7 G14P
8 G15 G1Tasks must be allocated as many cores as the size of a group
Crown scheduling (pp. 62–66)
Computing a crown schedule
Separated or integrated phases
Crown
configuration
Crown
allocation
Crown
mapping
Crown
scaling
Dynamic
Crown rescaling
Integrated Crown Scheduling
ILP formulations for each step and for integrated scheduler
by
(Kessler et al. [2013])
Phase separation prevents compromises
Phase integrated constrained by the crown structure
Slow and limited in input problem size
Crown Extensions (pp. 76–77, 88–92)
Adapt to realistic processors
p
6= 2
i
Constraints: frequency islands
Crown Configuration
Explicit description matrices
Core group membership
Group hierarchy
Crown Consolidation
Account for idle energy
Switch unused cores off
Provable approximation
L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2G
6G
7G
3G
2G
1G
11G
10G
9G
8P
5P
6P
3P
4G
5G
4P
1P
2Voltage Islands topology influence (pp. 160–163)
tight average loose Task class 0 2e+09 4e+09 6e+09 8e+09 1e+10 1.2e+10 1.4e+10 E n er g y
Energy quality of schedules Fast,ILP,ILP simple Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Integ. Pruhs [2008] (NLP,energy) Xu [2012] (ILP)
Pruhs [2008] (heur,0) Pruhs [2008] (heur,0) Pruhs [2008] (NLP,energy) Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Xu [2012] (ILP) Fast,ILP,ILP simple Integ. Energy quality of schedules
E n er g y Task class 0 200000 400000 600000 800000 1e+06
tight average loose
Example: Mergesort (pp. 124, 163–164)
Developed by JohN Von Neumann in 1945
(Knuth [1998])
External algorithm
Limits use of slow memories
Stream program: tree structure
Leaves: presort (non-streamed)
other: merge (streamed)
Root task: biggest workload
2
nd
level tasks: half workload of root task
Island-aware scheduling: mergesort (p. 124)
All cores in one island: 100%
2 islands of 16 cores: 47%
Individual cores: 30%
Islands of 2 cores: 33%
Crown Scheduling: energy (pp. 96,98)
Integrated crown schedulers always best (lower is better)
Phase-separated heuristics variants close to ILP variants
Phase-integrated heuristics close to ILP integrated
Pruhs et al. [2008]
as good as integrated for sequential tasks
Lessons learned
Scheduling parallel streaming application with throughput
constraints on many-core homogeneous platforms
Island topology: energy
reduction of 30% to 47%
Consolidation: up to 91%
Good solutions despite
Crown constraints
A good allocation is expensive
Future work
Memory Islands-Aware
scheduling
Resource Sharing:
memory bandwidth
Quasi-Static Scheduling
Integrate communication costs
Accelerate smart allocation
Experiment on real architectures
Heterogeneous platforms
Drake (pp. 134–135)
Stream programming framework
On-chip pipelining
Moldable tasks
Frequency scaling
Scheduling experiments
(Melot et al. [2015a])
Drake: derived from Schedeval
(Janzén [2014])
Separate roles in an application
Stream topology
Tasks’ source code
Target platform-specifics
Host application
Drake C Streaming Framework (pp. 134–136)
Takes code to execute on target platform
Application-specific: Mergesort, FFT, etc.
Platform-specific: SCC, Xeon, MPI, etc.
Message passing
Frequency scaling
Generate executable with monitoring.
Static
Scheduler Draketools Linker C/C++ compiler Platform description Platform backend Drake framework library Final executable binary Main C/C++ program
Taskgraph Drake modules C/C++ code User-written Static schedule Platform-specific plugin
Drake: Stream topology (p. 138)
< node i d =” n0 ”> < d a t a k e y =” v_name ”> r o o t < / d a t a > < d a t a k e y =” v_module ”> r o o t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”>2 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < node i d =” n1 ”> < d a t a k e y =” v_name ”> l e a f _ 1 < / d a t a > < d a t a k e y =” v_module ”> p r e s o r t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”> 400 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < node i d =” n2 ”> < d a t a k e y =” v_name ”> l e a f _ 2 < / d a t a > < d a t a k e y =” v_module ”> p r e s o r t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”> 400 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < edge s o u r c e =” n1 ” t a r g e t =” n0 ”> < d a t a k e y =” e_consumer_name ”> l e f t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ n a m e ”> o u t p u t < / d a t a > < d a t a k e y =” e _ t y p e ”> i n t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ r a t e ”>1 < / d a t a > < d a t a k e y =” e _ c o n s u m e r _ r a t e ”>1 < / d a t a > < / edge > < edge s o u r c e =” n2 ” t a r g e t =” n0 ”> < d a t a k e y =” e_consumer_name ”> r i g h t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ n a m e ”> o u t p u t < / d a t a > < d a t a k e y =” e _ t y p e ”> i n t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ r a t e ”>1 < / d a t a > < d a t a k e y =” e _ c o n s u m e r _ r a t e ”>1 < / d a t a >leaf_1
leaf_2
root
Drake: Task code (p. 141)
# i n c l u d e
< d r a k e . h >
i n t
d r a k e _ i n i t ( t a s k _ t * t a s k ,
v o i d
* a u x )
{
r e t u r n
1 ;
}
i n t
d r a k e _ s t a r t ( t a s k _ t * t a s k )
{
r e t u r n
1 ;
}
i n t
d r a k e _ r u n ( t a s k _ t * t a s k )
{
r e t u r n
d r a k e _ t a s k _ d e p l e t e d ( t a s k ) ;
}
i n t
d r a k e _ k i l l ( t a s k _ t * t a s k )
{
r e t u r n
1 ;
}
i n t
d r a k e _ d e s t r o y ( t a s k _ t * t a s k )
{
r e t u r n
1 ;
}
Drake: host application (p. 142)
# i n c l u d e
< d r a k e . h >
i n t
main ( s i z e _ t
a r g c ,
c h a r
** a r g v )
{
d r a k e _ p l a t f o r m _ t s t r e a m = d r a k e _ p l a t f o r m _ i n i t ( NULL ) ;
d r a k e _ p l a t f o r m _ s t r e a m _ c r e a t e ( s t r e a m ,
A P P L I C A T I O N ) ;
d r a k e _ p l a t f o r m _ s t r e a m _ i n i t ( s t r e a m , NULL ) ;
d r a k e _ p l a t f o r m _ s t r e a m _ r u n ( s t r e a m ) ;
d r a k e _ p l a t f o r m _ s t r e a m _ d e s t r o y ( s t r e a m ) ;
d r a k e _ p l a t f o r m _ d e s t r o y ( s t r e a m ) ;
r e t u r n
E X I T _ S U C C E S S ;
}
Drake: microbenchmarking (pp. 151–154)
Ping-pong application
Two tasks exchange tokens for time period t.
No computation
Measures overhead.
Overhead hiding capabilities.
Ping
Pong
Local
RCCE for comparison
Remote
Number of simultaneous pingpongs
Messages round t rip time [u s] 9 8 7 6 5 4 3 2 1 0 2 1 3 4 5 6 7 8
Drake vs TBB: results (pp. 163–171)
Sorting time
Drake good with
few cores
As good as TBB QS
with 12 cores
TBB MS Seq.
merge worse
TBB MS faster with
12 cores
Drake vs TBB: results (pp. 163–171)
Energy consumption
Drake better than
TBB QS
TBB Seq MS worst
TBB MS still better
Conclusion
Challenges in high-performance
parallel computing
Programming difficulties
Lack of scalability of
architectures
Von Neumann bottleneck
Stream programming
Mature research
No need of coherent shared
memory
Help reduce von Neumann
bottleneck
Our contributions
On-chip pipelining
Crown Scheduling for Moldable Streaming tasks
Drake Streaming Framework
Future work
Investigate more scheduling techniques
Minimize communication costs
Schedule sleeping modes
Use Drake to experiment on real architectures
Port Drake to other architectures (Epiphany, Myriad 2, etc)
Implement Synchronous Data Flow for Drake
Optimize on-chip memory usage
Schedule main memory accesses
Questions
Bibliography
K. Avdic, N. Melot, C. Kessler, and J. Keller. Parallel sorting on Intel Single-Chip Cloud Computer. In Proc. A4MMC workshop on
applications for multi- and many-core processors at ISCA-2011, 2011. ISBN 978-3-86644-717-2. URL
http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A624512. Ulrich Drepper. What every programmer should know about memory, 2007. URL
http://www.akkadia.org/drepper/cpumemory.pdf.
Johan Janzén. Evaluation of energy-optimizing scheduling algorithms for streaming computations on massively parallel multicore architectures. Master’s thesis, Linköping University, 2014. URL
http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A756758.
Gilles Kahn. The semantics of simple language for parallel programming. In IFIP Congress, pages 471–475, 1974. Jörg Keller, Christoph Kessler, and Rikard Hulten. Optimized on-chip-pipelining for memory-intensive computations on
multi-core processors with explicit memory hierarchy. Journal of Universal Computer Science, 18(14):1987–2023, 2012. Christoph Kessler, Nicolas Melot, Patrick Eitschberger, and Jörg Keller. Crown Scheduling: Energy-Efficient Resource Allocation,
Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks. In Proc. of 23rdInt. Workshop on
Power and Timing Modeling, Optimization and Simulation (PATMOS 2013), 2013.
Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0.
N. Melot, K. Avdic, C. Kessler, and J. Keller. Investigation of main memory bandwidth on Intel Single-Chip Cloud Computer. Intel MARC3 Symposium 2011, Ettlingen, 2011. URL http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A664296. Nicolas Melot, Johan Janzen, and Christoph Kessler. Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming
Applications on Manycore Architectures. In Parallel Processing Workshops (ICPPW), IEEE, pages 146–155, Sept 2015a. doi: 10.1109/ICPPW.2015.24.
Nicolas Melot, Christoph Kessler, Jörg Keller, and Patrick Eitschberger. Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems. ACM Trans. Archit. Code Optim., 11(4):
62:1–62:24, January 2015b. ISSN 1544-3566. doi:10.1145/2687653. URL http://doi.acm.org/10.1145/2687653.
Kirk Pruhs, Rob van Stee, and Patchrawat Uthaisombut. Speed Scaling of Tasks with Precedence Constraints. Theory of
Computing Systems, 43(1):67–80, July 2008.
Huiting Xu, Fanxin Kong, and Qingxu Deng. Energy Minimizing for Parallel Real-Time Tasks Based on Level-Packing. In 18th Int.
Conf. on Emb. and Real-Time Comput. Syst. and Appl. (RTCSA), pages 98–103, Aug 2012.