• No results found

Make the Most Out of Last Level Cache in Intel Processors

N/A
N/A
Protected

Academic year: 2022

Share "Make the Most Out of Last Level Cache in Intel Processors"

Copied!
1
0
0

Loading.... (view fulltext now)

Full text

(1)

Make the Most Out of Last Level Cache in Intel Processors

Alireza Farshin

+

, Amir Roozbeh

+*

, Gerald Q. Maguire Jr.

+

, Dejan Kostic

+

0803R3

G65695 01 S12

2

3

4 5

1 Problem Slice-aware Memory Management

CacheDirector

Impact on Tail Latency for NFV Service Chains Conclusions

Intel QPI PCIe + DMA DDR Channels

Core 0

Core 2

Core 6

Core 5

Core 4

Core 7

Core 3

Core 1

Slice 0LLC LLC Slice 4 Slice 7LLC Slice 2LLC

Slice 6LLC LLC Slice 3 Slice 1LLC Slice 5LLC

Number of Cycles

Slice Number

0 1 2 3 4 5 6 7

0 10 20 30 40 50 60

Measuring read access time from Core 0 to different LLC slices.

In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple

slices.

For each core, accessing data stored in a closer slice is faster than accessing data stored in other slices.

We introduce a slice-aware memory management

scheme, wherein data can be mapped to

specific LLC slice(s).

® ®

Intel Xeon E5-2667 v3

Mellanox ConnectX-4

Sending/Receiving Packets via DDIO

Data Direct I/O (DDIO) technology

sends packets to LLC rather than DRAM.

CacheDirector is a network I/O solution that extends DDIO by implementating

slice-aware memory management as an extension to Data Plane Development Kit (DPDK).

It sends the first 64 B of a packet

(containing the packet's header) directly to the appropriate LLC slice.

The CPU core that is responsible for processing a packet can access the packet header in fewer CPU cycles.

Faster links exposes processing elements to packets at a higher rate, but the

performance of the processors is no longer doubling at the earlier rate, making it harder to keep up with the growth in link speeds. For instance,

a server receiving 64 B packets at a rate of 100 Gbps has only 5.12ns to

process a packet before the next packet arrives.

It is essential to exploit every oppurtunity to optimize current computer systems.

We focus on better management of LLC.

With slice-aware memory management scheme, which exploits the latency

differences in accessing different LLC slices, it is possible to boost

application performance and realize cache isolation.

We proposed CacheDirector, a network I/O solution, which utilizes slice-aware memory management.

CacheDirector increases efficiency and performance, while reducing

tail latencies over the state-of-the-art.

We evaluate the performance of NFV service chains in the

presence of CacheDirector.

CacheDirector reduces tail latencies by up to 21.5% for optimized NFV service chains that are running at 100 Gbps.

KTH Royal Institute of Technology (EECS/COM) Ericsson Research

farshin@kth.se amirrsk@kth.se maguire@kth.se dmk@kth.se

Latency (μs)

CDF

0 200 400 600 800 1000

0 20 40 60 80 100

Traditional DDIO

Speedup (%)

Percentiles

75th 0

5 10 15 20

90th 95th 99th

Δ ≈ 119 μs

CacheDirector + DDIO

Accessing the closer LLC Slice can save up to 20 cycles, i.e., 6.25 ns.

Stateful NFV Service Chain

Router NAPT Load Balancer

+ *

Approach:

L1i L2

L1d

Lower level caches are private to each core.

WALLENBERG AI,

AUTONOMOUS SYSTEMS AND SOFTWARE PROGRAM

Work supported by SSF, WASP, and ERC.

Our Paper

References

Related documents

I det här arbetet har det inte samlats flera förfrågningar i samma procedur, vilket innebär att det sker samma antal rundgångar mellan applikationen och databasen som det

While NoC-based many-core processors bring powerful computing capability, their architectural design also faces new challenges. How to provide effective memory management and

Kommunal dagrehabilitering vänder sig till äldre personer med fysiska funktionsnedsätt- ningar och har som övergripande målsättningen att deltagarna ska förbättra eller

1 (…) ingen får slå något barn / barn får inte slåss heller / vuxna får inte slåss / det är inte tillåtet 2 / (.) får ingen inte din pappa och inte din mamma och inte din

Because the LSC microarchitecture can execute load instructions ahead of program order, a processor using LSC gains a larger benefit from register renaming than a fully

Within equational logic [125], slice languages will be used to represent infinite sets of logical equations and to provide new algorithms for the problem of determining whether

Objective: The objective was to interview the patients about their current prescribed treatment and (a) compare with the data on prescribed treatment in the EMR and the

If the regulation would differentiate between processors based on which level the data was accessible to them and regulate liability accordingly the processors