• No results found

Runtime-Guided Management of Multicore Cache Hierarchies

N/A
N/A
Protected

Academic year: 2022

Share "Runtime-Guided Management of Multicore Cache Hierarchies"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Runtime-Guided Management of Multicore Cache Hierarchies

Per Stenström

Chalmers University of Technology

Sweden

(2)

Scaling (as we know it) is ending soon…

A radical new way of how we think about

compute efficiency is needed

(3)

Program

ISA Problem

Slide for Beginners

Algorithm

Microarchitecture Circuits

Electrons

(4)

Program

ISA Problem

Need to Put the Stake in the Ground

Algorithm

Microarchitecture Circuits

Electrons

(5)

MECCAs Approach – Interaction Across Layers

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic

information

(6)

MECCA’s Vision for Efficient Computing

P1: Parallelism:

Programmers: Unlock parallelism and provide semantic information

Resource manager (runtime): Manage parallelism “under the hood”

P2: Power:

Programmers: Express quality of service attributes

Resource manager (runtime): Translate them into efficient use of hardware resources “under the hood”

P3: Predictability:

Programmers: Express quality of service attributes

Resource manager (runtime): Manage parallelism predictably

to meet timing constraints “under the hood”

(7)

Parallelism Management

(8)

Task-based Dataflow Prog. Models

Task A

Task C Task

B

#pragma css task output(a) void TaskA( float a[M][M]);

#pragma css task input(a) void TaskB( float a[M][M]);

#pragma css task input(a) void TaskC( float a[M][M]);

• Programmer annotations for task dependences

• Annotations used by runtime system for scheduling

• Dataflow task graph constructed dynamically

Convey semantic information to runtime for efficient

scheduling

(9)

Runtime Management of Cache Hierarchy

View: Runtime is part of the chip and responsible for management of the cache hierarchy

- in analogy with OS managing virtual memory

Runtime assisted cache management:

• Runtime-assisted cache coherence optimizations

• Runtime-assisted dead-block management

• Runtime-assisted global cache management

(10)

Runtime-Assisted Coherence Management

[Manivannan and Stenstrom IPDPS 2014]

(11)

Vision

Dependency annotations allow for optimizations with high accuracy (like in message passing)

Prefetching

Migratory sharing optimization

Bulk data transfer Forwarding

Prod Cons

Output A[…]

Input A[…]

Prod Cons

Output A[…]

Input A[…]

Prod Cons

Output A[…]

Input A[…]

T1 T2

Inout A[…]

Inout

A[…]

(12)

Baseline System

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

On-chip interconnect

Coherence is maintained by a directory-based MESI

protocol

(13)

Producer-Consumer Sharing 1(2)

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

Upgrade

• Producer must get ownership of data before modifying it

• Upgrade happens block-by-block

(14)

Producer-Consumer Sharing 2(2)

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

Downgrade

• Consumer must remotely read from producer’s cache

Optimization: Producer downgrade blocks

(15)

Migratory Sharing 1(2)

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

Upgrade

• Producer must get ownership of data before modifying it

Upgrade happens block-by-block

(16)

Migratory sharing 2(2)

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

Downgrade And

Self-invalidate

• Current node must remotely read exclusively from previous node

Optimization: Self-invalidate and downgrade blocks

(17)

Runtime Triggered Prefetching

Private L1/L2 Cache

P P P

Private L1/L2 Cache

Private L1/L2 Cache

Dir L3 Dir L3 Dir L3

Prefetching

• Prefetching is helpful to hide LLC access latency

both for producer/consumer and migratory sharing

(18)

Runtime System Support

• Runtime uses dependency information to trigger coherence

optimizations and prefetching of data structures

(19)

Experimental Methodology

(20)

Sharing Characteristics

0 20 40 60 80 100 120

Sharing Pattern Characterization

Cholesky Matmul Qr SparseLu

MC MG OT

0 20 40 60 80 100 120

Memory Stall Time Split-up

Cholesky Matmul Qr SparseLu Remote Upgrade LLC shar LLC priv

The following dominate behavior:

• Migratory sharing

• Remote access, upgrades and LLC access

(21)

Impact on Memory Stall Time

Remote Upgrade LLC sha LLC priv

0 20 40 60 80 100 120

Cholesky

BL D SI D+SI SP SP+D+SI

0 20 40 60 80 100 120

Matmul

BL D SI D+SI SP SP+D+SI

0 20 40 60 80 100 120

SparseLU

BL D SI D+SI SP SP+D+SI

0 20 40 60 80 100 120

Qr

BL D SI D+SI SP SP+D+SI

(22)

Runtime-Assisted Dead-Block Management

[Manivannan et al. HPCA 2016]

(23)

Motivation

• Most of the blocks in LLCs are dead and consume precious cache space

• State-of-the-art schemes predict dead-blocks without any semantic information from the programming model

RADAR’s approach: Use static information from parallel

programming models to accurately detect and eliminate

dead blocks

(24)

Overview of RADAR

Three schemes:

Look Ahead (LA): Use data-dependce graph for prediction

Look Back (LB): Use past region access statistics

Combined: LA ∩ LB and LA LB

Specify Task and its Region Accesses

Is Region dead after current access?

Evict (demote to LRU)

region from LLC

(25)

Look Ahead Scheme

• Look Ahead does not take size of regions into account

• Look Ahead only accounts for a limited outlook

• A00 is reused by T1-T4

• When T1-T4 are done, A00 can be deemed dead

• A22 is accessed by T8, T12, T13

and will not be deemed dead

(26)

Look Back Scheme

• Leverages that region accesses are predictable.

• Region miss = all blocks in that region will miss

Approach: use branch prediction to predict future region hits/misses.

– Predict region MISS: demote all blocks to the region

– Predict region HIT: keep region in the cache

(27)

Combined Scheme

By default, all region accesses are classified as hits

LA LB

LA: Set of predicted region misses under LA

LB: Set of predicted region misses under LB

Combined (conservative): LA ∩ LB

Combined (aggressive): LA LB

(28)

Experimental Methodology

(29)

Miss Rate Improvements

ch ol es ky

m at m ul

sp ar se lu

ga us s

ja co bi

re db la ck

Av er ag e 50

60 70 80 90

100 Look-Ahead Look-Back RADAR-Combined

L L C M is se s n o rm a liz e d t o L R U ( % )

Look-Ahead consistently does better than Look-back

RADAR-Combined provides additional gains

(30)

Exec. Time Improvements

cholesky matmul sparselu gauss jacobi redblack Average 80

84 88 92 96 100

Look-Ahead Look-Back RADAR-Combined

E xe cu tio n T im e n o rm a liz e d t o L R U ( % )

Memory bound apps provide significant gains

(31)

Miss Reduction: SHIP vs RADAR

cholesky matmul sparselu gauss jacobi redblack Average 50

60 70 80 90 100 110

SHiP-PC RADAR-Combined

L L C M is se s n or m a liz e d t o L R U ( % )

SHiP can evict data from LLCs prematurely

RADAR outperforms the best performing predictor

SHiP-PC 2-bit SRRIP

16K SHCT

(32)

Concluding Remarks

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic information

MECCA team members:

Nadja Holtryd, Madhavan Manivannan, Mehrzad Nejat, Vasileios Papaefstathiou,

Risat Pathan, Miquel Pericàs, Per Stenstrom, Mohammad Waqar, Petros Voudouris

(33)

?

References

Related documents

Keywords Slice-aware Memory Management, Last Level Cache, Non-Uniform Cache Architecture, CacheDirector, DDIO, DPDK, Network Function Virtualization, Cache Partitioning,

For each core, accessing data stored in a closer slice is faster than accessing data stored in other slices.. We introduce a slice-aware

• Second level caches can be shared between the cores on a chip; this is the choice in the Sun Niagara (a 3MB L2 cache) as well as the Intel Core 2 Duo (typically 2-6 MB).. •

lIuoh lnlol'llllltion 11 tl.lund in library copie. of magaainee and nne pa~er. Prominent among the.e publiclIIltion. are the !,Ii.ning and So:I.enU'Uc Pr •••• Eng1n8if.ring and

Hyde embodies the fear of violence consuming humanity, morphing from blurring differences between violent and civil to violence creating a monster without restraint.

Re-examination of the actual 2 ♀♀ (ZML) revealed that they are Andrena labialis (det.. Andrena jacobi Perkins: Paxton & al. -Species synonymy- Schwarz & al. scotica while

The contribution of this paper is three-fold: (1) we find that the mean of block sizes can divide all blocks into city blocks and field blocks; (2) based on this finding, we develop

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating