Algorithms and Framework for Energy Eﬃcient Parallel Stream Computing on Many-Core Architectures

(1)

Parallel Stream Computing on Many-Core

Architectures

Nicolas Melot

Linköping University

Dept. of Computer and Inf. Science

Linköping, Sweden

(2)

Outline

1 Introduction

2 Crown Scheduling

3 Drake

(3)

High performance computing (p. 1)

Constant struggle for performance: Hollerith census machine

Census every 10 years.

8 years in 1880.

1 year in 1890.

Picture: “HollerithMachine.CHM” by Adam

Schuster

Flickr: Proto IBM. Licensed under CC BY 2.0 via

Wikimedia Commons

http://commons.wikimedia.org/wiki/File:HollerithMachine.CHM.jpg#/media/File:HollerithMachine.CHM.jpg

(4)

High performance computing (p. 1)

Constant struggle for performance: Big data.

Social medias

Internet of things

Applications:

Scientiﬁc computing

Marketing

Intelligence

Picture: “View inside detector at the CMS

cavern LHC CERN” by Tighef

Own work. Licensed under CC BY-SA 3.0 via

Wikimedia Commons

http://commons.wikimedia.org/wiki/File:View_inside_detector_at_the_CMS_cavern_LHC_CERN.jpg#/media/File: View_inside_detector_at_the_CMS_cavern_LHC_CERN.jpg

(5)

Accelerate computation (pp. 1–5)

How to improve performance?

Miniaturize

End of Moore’s law?

Increase frequency

Too high energy consumption

Parallel programming

Better energy consumption

Very challenging

Instruction-Level parallelism Wall

Scalability issues for consistent

shared memory

Von-Neumann bottleneck

P

f

S

(6)

Von Neumann Bottleneck

Access to main memory is expensive

(Drepper [2007])

HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B HI5 68 82 -2 7 44sv FF2 B ab-78123 FX

16GB

DRAM

L3

SRAM

L2

SRAM

L1

SRAM

R

L2

SRAM

L1

SRAM

R

(7)

Stream programming (pp. 11–15)

Organize computation in nodes and

channels.

Easier to reason with

(Kahn [1974])

Pipeline parallelism

No need of shared memory consistency

Opportunities to reduce main memory

accesses

On-chip pipelining

Applications

Signal or graphic processing

Large data sets (“big data”)

Online computation

Figure:

A streaming

application of 5 nodes and

6 channels.

(8)

Streaming

Straightforward: communication

through shared oﬀ-chip main

memory

Easy to implement

Main memory bandwidth is

performance bottleneck

On-chip pipelining:

communication through small

on-chip memories

(Keller et al. [2012])

.

Trade core to core

communication for reduced

oﬀ-chip memory accesses

HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123 FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX < < < < < < < HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7sv 44 FF2 B HI5 68 82 -2 7 sv 44 FF2 B ab-78123FX < < < < < < <

(9)

Many-core architectures (pp. 15–24)

Embed more and more cores into a single chip

High throughput rather than low latency

Relaxed memory consistency models

Explicit use of on-chip network

Low latency small on-chip memories

High latency big oﬀ-chip memories

Discrete voltage and frequency scaling

Sleeping states

Examples: SCC, MPPA, Tilera, Xeon Phi, Epiphany.

0 0 1 0 123 4 567 SCC die DIMM Rtile tile R tile R tile R tile R tile R MC MC DIMM tile R tile

R Rtile Rtile Rtile tile

R

tile R tile

R Rtile Rtile Rtile tile R MC MC DIMM DIMM tile R tile

R Rtile Rtile Rtile tile R L2 256 KiB P54C L1 16 KiB MPB 16 KiB P54C L1 16 KiB L2 256 KiB traﬃcgen

mesh I/F

R

(10)

Contribution summary

Early work: microbenchmarking and comparison (pp. 28–57)

Distance between cores matters

(Avdic et al. [2011])

.

Distance to main memory controllers matters

(Melot et al. [2011])

.

Speedup from On-chip pipelined mergesort over classic

mergesort on SCC

(Avdic et al. [2011])

.

In this seminar:

Crown Scheduling for parallel streaming application on

many-core

(Kessler et al. [2013], Melot et al. [2015b])

(pp. 58–132).

Drake: a retargetable C streaming framework for many-core

(11)

Streaming computation (pp. 59–61)

Streaming

Software pipelining

Tasks execute in parallel in

steady-state

Static scheduling

Moldable tasks

Steady state

Throughput constraint

Optimize energy

Streaming Task Collection

Independent tasks

Balance workload

No communication cost

Figure:

Streaming taskgraph.

Figure:

Steady state of the streaming

pipeline.

(12)

Platform model (p. 62)

Platform

p uniform processors

Discrete frequency set F

Applied to individual cores

Voltage by auto-co-scaling

Can change dynamically any time

Power model

Dynamic power function of frequency

Analytic function or measurements

No restriction

Replaceable

(13)

Voltage & Frequency co-Scaling (pp. 16–17)

Voltage and Frequency Co-scaling

Must increase voltage when increasing frequency

Energy penalty when maintaining high voltage for low frequency

Voltage and minimum voltage for current frequency Frequency and maximal frequency for current voltage

Time Frequency (Hz)

Voltage (V)

(14)

Task model (pp. 62–63)

Moldable task j

Fixed work

τ

j

Allocation: run on w

j

≥ 1 cores

Maximum W

j

: w

j

≤ W

j

Arbitrary eﬃciency function

0 < e

j

(q) ≤ 1 for 1 ≤ q ≤ W

j

No convexity, monotony

or continuity constraint

Time proportional to work, parallel

degree and frequency

e

W

1 p

0 1

Figure:

Arbitrary eﬃciency

function.

ttj

q = wj

(15)

Problem formulation (p. 61)

3 static problems

Resource allocation

Find w

j

≤ min(p, W

j

) for each task j

Deﬁne execution time of tasks j

Task mapping to cores

Assign tasks to a subset of cores

1..p

Discrete frequency scaling

Assign tasks a frequency level in F

Respect a makespan constraint M

Repeated execution of a task sequence

Non data-ready tasks are delayed to the

next round

All steps inﬂuence each other

(16)

Crown scheduling (pp. 62–65)

Crown scheduling (pp. 62–66)

Computing a crown schedule

Separated or integrated phases

Crown

configuration

Crown

allocation

Crown

mapping

Crown

scaling

Dynamic

Crown rescaling

Integrated Crown Scheduling

ILP formulations for each step and for integrated scheduler

by

(Kessler et al. [2013])

Phase separation prevents compromises

Phase integrated constrained by the crown structure

Slow and limited in input problem size

(18)

Crown Extensions (pp. 76–77, 88–92)

Adapt to realistic processors

p

6= 2

i

Constraints: frequency islands

Crown Conﬁguration

Explicit description matrices

Core group membership

Group hierarchy

Crown Consolidation

P

2

(19)

Voltage Islands topology inﬂuence (pp. 160–163)

tight average loose Task class 0 2e+09 4e+09 6e+09 8e+09 1e+10 1.2e+10 1.4e+10 E n er g y

Energy quality of schedules Fast,ILP,ILP simple Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Integ. Pruhs [2008] (NLP,energy) Xu [2012] (ILP)

Pruhs [2008] (heur,0) Pruhs [2008] (heur,0) Pruhs [2008] (NLP,energy) Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Xu [2012] (ILP) Fast,ILP,ILP simple Integ. Energy quality of schedules

E n er g y Task class 0 200000 400000 600000 800000 1e+06

tight average loose

(20)

Example: Mergesort (pp. 124, 163–164)

Developed by JohN Von Neumann in 1945

(Knuth [1998])

External algorithm

Limits use of slow memories

Stream program: tree structure

Leaves: presort (non-streamed)

other: merge (streamed)

Root task: biggest workload

2 nd

level tasks: half workload of root task

(21)

Island-aware scheduling: mergesort (p. 124)

All cores in one island: 100%

2 islands of 16 cores: 47%

Individual cores: 30%

Islands of 2 cores: 33%

(22)

Crown Scheduling: energy (pp. 96,98)

Integrated crown schedulers always best (lower is better)

Phase-separated heuristics variants close to ILP variants

Phase-integrated heuristics close to ILP integrated

Pruhs et al. [2008]

as good as integrated for sequential tasks

(23)

Lessons learned

Scheduling parallel streaming application with throughput

constraints on many-core homogeneous platforms

Island topology: energy

reduction of 30% to 47%

Consolidation: up to 91%

Good solutions despite

Crown constraints

A good allocation is expensive

Future work

Memory Islands-Aware

scheduling

Resource Sharing:

memory bandwidth

Quasi-Static Scheduling

Integrate communication costs

Accelerate smart allocation

Experiment on real architectures

Heterogeneous platforms

(24)

Drake (pp. 134–135)

Stream programming framework

On-chip pipelining

Moldable tasks

Frequency scaling

Scheduling experiments

(Melot et al. [2015a])

Drake: derived from Schedeval

(Janzén [2014])

Separate roles in an application

Stream topology

Tasks’ source code

Target platform-speciﬁcs

Host application

(25)

Drake C Streaming Framework (pp. 134–136)

Takes code to execute on target platform

Application-speciﬁc: Mergesort, FFT, etc.

Platform-speciﬁc: SCC, Xeon, MPI, etc.

Message passing

Frequency scaling

Generate executable with monitoring.

Static

Scheduler Draketools Linker C/C++ compiler Platform description Platform backend Drake framework library Final executable binary Main C/C++ program

Taskgraph Drake modules C/C++ code User-written Static schedule Platform-speciﬁc plugin

(26)

Drake: Stream topology (p. 138)

< node i d =” n0 ”> < d a t a k e y =” v_name ”> r o o t < / d a t a > < d a t a k e y =” v_module ”> r o o t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”>2 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < node i d =” n1 ”> < d a t a k e y =” v_name ”> l e a f _ 1 < / d a t a > < d a t a k e y =” v_module ”> p r e s o r t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”> 400 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < node i d =” n2 ”> < d a t a k e y =” v_name ”> l e a f _ 2 < / d a t a > < d a t a k e y =” v_module ”> p r e s o r t < / d a t a > < d a t a k e y =” v _ w o r k l o a d ”> 400 < / d a t a > < d a t a k e y =” v _ m a x _ w i d t h ”>1 < / d a t a > < d a t a k e y =” v _ e f f i c i e n c y ”> e x p r t k : p == 1 ? 1 : p & l t ; = W ? 1− 0 . 3 * ( ( p * p ) / (W * W) ) : 1e−06 < / d a t a > < / node > < edge s o u r c e =” n1 ” t a r g e t =” n0 ”> < d a t a k e y =” e_consumer_name ”> l e f t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ n a m e ”> o u t p u t < / d a t a > < d a t a k e y =” e _ t y p e ”> i n t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ r a t e ”>1 < / d a t a > < d a t a k e y =” e _ c o n s u m e r _ r a t e ”>1 < / d a t a > < / edge > < edge s o u r c e =” n2 ” t a r g e t =” n0 ”> < d a t a k e y =” e_consumer_name ”> r i g h t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ n a m e ”> o u t p u t < / d a t a > < d a t a k e y =” e _ t y p e ”> i n t < / d a t a > < d a t a k e y =” e _ p r o d u c e r _ r a t e ”>1 < / d a t a > < d a t a k e y =” e _ c o n s u m e r _ r a t e ”>1 < / d a t a >

leaf_1

leaf_2

root

(27)

Drake: Task code (p. 141)

# i n c l u d e

< d r a k e . h >

i n t

d r a k e _ i n i t ( t a s k _ t * t a s k ,

v o i d

* a u x )

{

r e t u r n

1 ;

}

i n t

d r a k e _ s t a r t ( t a s k _ t * t a s k )

{

r e t u r n

1 ;

}

i n t

d r a k e _ r u n ( t a s k _ t * t a s k )

{

r e t u r n

d r a k e _ t a s k _ d e p l e t e d ( t a s k ) ;

}

i n t

d r a k e _ k i l l ( t a s k _ t * t a s k )

{

r e t u r n

1 ;

}

i n t

d r a k e _ d e s t r o y ( t a s k _ t * t a s k )

{

r e t u r n

1 ;

}

(28)

Drake: host application (p. 142)

# i n c l u d e

< d r a k e . h >

i n t

main ( s i z e _ t

a r g c ,

c h a r

** a r g v )

{

d r a k e _ p l a t f o r m _ t s t r e a m = d r a k e _ p l a t f o r m _ i n i t ( NULL ) ;

d r a k e _ p l a t f o r m _ s t r e a m _ c r e a t e ( s t r e a m ,

A P P L I C A T I O N ) ;

d r a k e _ p l a t f o r m _ s t r e a m _ i n i t ( s t r e a m , NULL ) ;

d r a k e _ p l a t f o r m _ s t r e a m _ r u n ( s t r e a m ) ;

d r a k e _ p l a t f o r m _ s t r e a m _ d e s t r o y ( s t r e a m ) ;

d r a k e _ p l a t f o r m _ d e s t r o y ( s t r e a m ) ;

r e t u r n

E X I T _ S U C C E S S ;

}

(29)

Drake: microbenchmarking (pp. 151–154)

Ping-pong application

Two tasks exchange tokens for time period t.

No computation

Measures overhead.

Overhead hiding capabilities.

Ping

Pong

Local

RCCE for comparison

Remote

Number of simultaneous pingpongs

Messages round t rip time [u s] 9 8 7 6 5 4 3 2 1 0 2 1 3 4 5 6 7 8

(30)

Drake vs TBB: results (pp. 163–171)

Sorting time

Drake good with

few cores

As good as TBB QS

with 12 cores

TBB MS Seq.

merge worse

TBB MS faster with

12 cores

(31)

Drake vs TBB: results (pp. 163–171)

Energy consumption

Drake better than

TBB QS

TBB Seq MS worst

TBB MS still better

(32)

Conclusion

Challenges in high-performance

parallel computing

Programming diﬃculties

Lack of scalability of

architectures

Von Neumann bottleneck

Stream programming

Mature research

No need of coherent shared

memory

Help reduce von Neumann

bottleneck

Our contributions

On-chip pipelining

Crown Scheduling for Moldable Streaming tasks

Drake Streaming Framework

(33)

Future work

Investigate more scheduling techniques

Minimize communication costs

Schedule sleeping modes

Use Drake to experiment on real architectures

Port Drake to other architectures (Epiphany, Myriad 2, etc)

Implement Synchronous Data Flow for Drake

Optimize on-chip memory usage

Schedule main memory accesses

(34)

Questions

(35)

Bibliography

K. Avdic, N. Melot, C. Kessler, and J. Keller. Parallel sorting on Intel Single-Chip Cloud Computer. In Proc. A4MMC workshop on

applications for multi- and many-core processors at ISCA-2011, 2011. ISBN 978-3-86644-717-2. URL

http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A624512. Ulrich Drepper. What every programmer should know about memory, 2007. URL

http://www.akkadia.org/drepper/cpumemory.pdf.

Johan Janzén. Evaluation of energy-optimizing scheduling algorithms for streaming computations on massively parallel multicore architectures. Master’s thesis, Linköping University, 2014. URL

http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A756758.

Gilles Kahn. The semantics of simple language for parallel programming. In IFIP Congress, pages 471–475, 1974. Jörg Keller, Christoph Kessler, and Rikard Hulten. Optimized on-chip-pipelining for memory-intensive computations on

multi-core processors with explicit memory hierarchy. Journal of Universal Computer Science, 18(14):1987–2023, 2012. Christoph Kessler, Nicolas Melot, Patrick Eitschberger, and Jörg Keller. Crown Scheduling: Energy-Eﬃcient Resource Allocation,

Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks. In Proc. of 23rdInt. Workshop on

Power and Timing Modeling, Optimization and Simulation (PATMOS 2013), 2013.

Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0.

N. Melot, K. Avdic, C. Kessler, and J. Keller. Investigation of main memory bandwidth on Intel Single-Chip Cloud Computer. Intel MARC3 Symposium 2011, Ettlingen, 2011. URL http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A664296. Nicolas Melot, Johan Janzen, and Christoph Kessler. Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming

Applications on Manycore Architectures. In Parallel Processing Workshops (ICPPW), IEEE, pages 146–155, Sept 2015a. doi: 10.1109/ICPPW.2015.24.

Nicolas Melot, Christoph Kessler, Jörg Keller, and Patrick Eitschberger. Fast Crown Scheduling Heuristics for Energy-Eﬃcient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems. ACM Trans. Archit. Code Optim., 11(4):

62:1–62:24, January 2015b. ISSN 1544-3566. doi:10.1145/2687653. URL http://doi.acm.org/10.1145/2687653.

Kirk Pruhs, Rob van Stee, and Patchrawat Uthaisombut. Speed Scaling of Tasks with Precedence Constraints. Theory of

Computing Systems, 43(1):67–80, July 2008.

Huiting Xu, Fanxin Kong, and Qingxu Deng. Energy Minimizing for Parallel Real-Time Tasks Based on Level-Packing. In 18th Int.

Conf. on Emb. and Real-Time Comput. Syst. and Appl. (RTCSA), pages 98–103, Aug 2012.