Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks
Alireza Farshin
+, Amir Roozbeh
+*, Gerald Q. Maguire Jr.
+, Dejan Kostic
+KTH Royal Institute of Technology (EECS) Ericsson Research
farshin@kth.se amirrsk@kth.se maguire@kth.se dmk@kth.se
+ *
WALLENBERG AI,
AUTONOMOUS SYSTEMS AND SOFTWARE PROGRAM
Work supported by SSF, WASP, and ERC.
1 What is DDIO? 4 How to Fine-tune DDIO
Faster link speeds causes DDIO fail to provide the expected benefits, as new incoming packets
can repeatedly evict previously received packets (i.e., not-yet-processed and
already-processed packets) from the LLC. The probability of
eviction is high when:
• High #Receive (RX) descriptors
• High load imbalance factor
• Receiving rate 100 Gbps
• I/O intensive application
• Packet size 512 Byte
A little-discussed register called “IIO LLC WAYS” can be used to tune the capacity of DDIO. Fine-tuning DDIO enables us to process packets with a larger number of RX descriptors while providing the same or better performance.
We need more RX descriptors for 100 Gbps
networks, as additional descriptors reduces the latency incurred by packet loss and PAUSE frames.
2 DDIO Can Become a Bottleneck
Data Direct I/O Technology (DDIO) transfers packets directly to Last Level Cache (LLC) rather than main memory. DDIO
updates a cache line if it is already available in LLC; otherwise, it allocates the cache line in a limited portion of LLC (i.e., 2 ways in a n-way set-associative cache).
DDIO was introduced to improve the performance of I/O applications by mitigating expensive DRAM accesses.
Sending/Receiving Packets via DDIO
I/O Device Traditional DMA
C C C C
C C C C
C C C C
Logical LLC
CPU Socket
Loading the Packets
Main Memory
The default value has only 2 set bits 1 1 0 0 0 0 0 0 0 0 0
IIO LLC WAYS
6 Conclusion
There is no one-size-fits-all approach to utilize DDIO. Therefore, it is important to optimize DDIO based on the
characteristics of applications and their workload, especially for
multi-hundred-gigabit networks.
0 300 600 900 1200 1500 1800
512 1024 2048 4096
99th Percentile Latency (µs)
Number of RX Descriptors
2Way 4Way 6Way 8Way
Lower tail latency with larger number
of RX descriptors
5 Toward 200 Gbps
Problem: DDIO can degrade performance with faster link
speeds, due to the higher cache injection rate.
Approach: LLC could be bypassed for low-priority or DDIO-insensitive
application, thus making room for the high-priority or highly-DDIO-sensitive
applications. Bypassing could be done via:
• Disabling DDIO for an specific I/O device or
• Exploiting a remote processor’s socket to DMA data
0 200 400 600 800 1000 1200 1400
100 Gbps 200 Gbps
99th Percentile Latency (µs)
30%
Forwarding Rate
Moreover, performance of DDIO only
matters when an application is I/O bound, rather than CPU/memory bound.
3 Sensitivity to DDIO
Different applications have different levels of sensitivity to DDIO.
Memcached TCP UDP
NVMe Full
Write Random
Write Random
Read
Gray applications show 5% improvement when DDIO is enabled NFV
Chain L2 FW
Stateful NFV Service Chain
Router NAPT Load Balancer
0 20 40 60 80 100
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
DDIO Performance (%) Throughput (Gbps)
Relative Processing Time
DDIO Write DDIO Read
Throughput
Increasing processing time improves DDIO performance,
but reduces throughput