Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks

(1)

Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks

Alireza Farshin

⁺

, Amir Roozbeh

^+*

, Gerald Q. Maguire Jr.

⁺

, Dejan Kostic

⁺

KTH Royal Institute of Technology (EECS) Ericsson Research

farshin@kth.se amirrsk@kth.se maguire@kth.se dmk@kth.se

+ *

WALLENBERG AI,

AUTONOMOUS SYSTEMS AND SOFTWARE PROGRAM

Work supported by SSF, WASP, and ERC.

1 What is DDIO? 4 How to Fine-tune DDIO

Faster link speeds causes DDIO fail to provide the expected benefits, as new incoming packets

can repeatedly evict previously received packets (i.e., not-yet-processed and

already-processed packets) from the LLC. The probability of

eviction is high when:

• High #Receive (RX) descriptors

• High load imbalance factor

• Receiving rate 100 Gbps

• I/O intensive application

• Packet size 512 Byte

A little-discussed register called “IIO LLC WAYS” can be used to tune the capacity of DDIO. Fine-tuning DDIO enables us to process packets with a larger number of RX descriptors while providing the same or better performance.

We need more RX descriptors for 100 Gbps

networks, as additional descriptors reduces the latency incurred by packet loss and PAUSE frames.

2 DDIO Can Become a Bottleneck

Data Direct I/O Technology (DDIO) transfers packets directly to Last Level Cache (LLC) rather than main memory. DDIO

updates a cache line if it is already available in LLC; otherwise, it allocates the cache line in a limited portion of LLC (i.e., 2 ways in a n-way set-associative cache).

DDIO was introduced to improve the performance of I/O applications by mitigating expensive DRAM accesses.

Sending/Receiving Packets via DDIO

I/O Device Traditional DMA

C C C C

Logical LLC

CPU Socket

Loading the Packets

Main Memory

The default value has only 2 set bits 1 1 0 0 0 0 0 0 0 0 0

IIO LLC WAYS

6 ^Conclusion

There is no one-size-fits-all approach to utilize DDIO. Therefore, it is important to optimize DDIO based on the

characteristics of applications and their workload, especially for

multi-hundred-gigabit networks.

0 300 600 900 1200 1500 1800

512 1024 2048 4096

99th Percentile Latency (µs)

Number of RX Descriptors

2Way 4Way 6Way 8Way

Lower tail latency with larger number

of RX descriptors

5 Toward 200 Gbps

Problem: DDIO can degrade performance with faster link

speeds, due to the higher cache injection rate.

Approach: LLC could be bypassed for low-priority or DDIO-insensitive

application, thus making room for the high-priority or highly-DDIO-sensitive

applications. Bypassing could be done via:

• Disabling DDIO for an specific I/O device or

• Exploiting a remote processor’s socket to DMA data

0 200 400 600 800 1000 1200 1400

100 Gbps 200 Gbps

99th Percentile Latency (µs)

30%

Forwarding Rate

Moreover, performance of DDIO only

matters when an application is I/O bound, rather than CPU/memory bound.

3 Sensitivity to DDIO

Different applications have different levels of sensitivity to DDIO.

Memcached TCP UDP

NVMe Full

Write Random

Read

Gray applications show 5% improvement when DDIO is enabled NFV

Chain L2 FW

Stateful NFV Service Chain

Router NAPT Load Balancer

0 20 40 60 80 100

0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100

DDIO Performance (%) Throughput (Gbps)

Relative Processing Time

DDIO Write DDIO Read

Throughput

Increasing processing time improves DDIO performance,

but reduces throughput

Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks