1 LOW-LATENCY, HIGH-BANDWIDTH USE CASES FOR NAHANNI / IVSHMEM Cam Macdonell, Xiaodi Ke, Adam Wolfe Gordon, Paul Lu University of Alberta paullu@cs.ualberta.ca KVM Forum 2011 August 16, 2011
2 Contents 1. For the QEMU/KVM developer Nahanni/ivshmem included as of QEMU 0.13.0, August 2010 2. VMs for Web services Memcached: up to 29% lower inter-vm latencies on workloads 3. VMs for computational science Order-of-magnitude lower latency and higher bandwidth on MPI microbenchmarks Up to 30% faster on MPI application benchmarks (GAMESS, SPEC MPI2007)
3 Part 1: What is Nahanni / ivshmem? User-level library for data movement OS bypass: No guest or host OS involvement Memcached, DDI: pointer-based structured data MPI: message passing, stream data Nahanni device looks like a graphics card Guest OS driver (for initialization only) No impact on guests that do not load the driver New Nahanni virtual PCI device in QEMU -device ivshmem,shm=<name>,size=1024! Creates POSIX shared memory on host Cam Macdonell s Ph.D.
Host-Guest, Inter-VM Shared Memory 4
5 Host-Guest Use Case Copying a file from host into a guest VM Nahanni is faster than Netcat, SCP-HPN, 9P
6 Part 2: Web Services in the Cloud Many companies use the cloud for their Web servers Reddit and FourSquare are on Amazon EC2 Memcached is a key-value cache for databases, etc. Used by Facebook, Twitter, others Conclusion: Nahanni reduces look-up latency by 29% on a read-mostly Yahoo Cloud Serving Benchmark (YCSB) workload, for co-located VMs M.Sc. thesis and NetDB 11 paper by Adam Wolfe Gordon
7 Nahanni Memcached + YCSB 1 million 1 KB records Each VM ran 12 million operations; 36 million in total Each VM ran 2,000 operations per second
8 Nahanni Memcached + YCSB VM 3 can do look-up using pointers and synchronization 1 million 1 KB records Each VM ran 12 million operations; 36 million in total Each VM ran 2,000 operations per second
9 Nahanni vs. virtio/bridging (br) vs. virtio/vhost (vh) Average Total Read Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 1.93 1.93 1.4 0.21 0.6 1.02 0.59 0.64 0.23 1.87 0.12 0.15 1.87 No cache (br) memcached (br) Nahanni (br) No cache (vh) memcached (vh) Nahanni (vh) Cache Type CASSANDRA_READ CACHE_READ CACHE_WRITE 0.95 0.41 0.67 0.13 0.11 0.43 0.42
10 Nahanni vs. virtio/bridging (br) vs. virtio/vhost (vh) Average Total Read Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 1.93 1.93 1.4 0.21 0.6 1.02 0.59 0.64 0.23 1.87 0.12 0.15 1.87 No cache (br) memcached (br) Nahanni (br) No cache (vh) memcached (vh) Nahanni (vh) Cache Type CASSANDRA_READ CACHE_READ CACHE_WRITE 0.95 0.41 0.67 0.13 0.11 0.43 0.42 Workload: 29% lower latency
11 Nahanni vs. virtio/bridging (br) vs. virtio/vhost (vh) Average Total Read Latency (ms) 0.0 0.5 1.0 1.5 2.0 2.5 1.93 1.93 1.4 0.21 0.6 1.02 0.59 0.64 0.23 1.87 0.12 0.15 1.87 No cache (br) memcached (br) Nahanni (br) No cache (vh) memcached (vh) Nahanni (vh) Cache Type CASSANDRA_READ CACHE_READ CACHE_WRITE 0.95 0.41 0.67 0.13 0.11 0.43 0.42 Workload: 29% lower latency Read Ops: 73% lower latency
12 Part 3: Computational Science in VMs Your HPC application is likely to have an Message- Passing Interface (MPI) version We developed the MPI-Nahanni user-level library Port of MPICH2-Nemesis For co-located VM instances Conclusion: MPI-Nahanni has order-of-magnitude lower latency and higher bandwidth; applications are up to 30% faster. M.Sc. thesis of Xiaodi Ke, Ph.D. thesis of Cam Macdonell
13 NetPIPE 2-sided Bandwidth (Nahanni vs. vhost) ($!!! (#!!! *+,-./0/112 *+,-30456 *+,-./0/112 *+,-30456 (!!!! =/1@A2@60<*=B57C? )!!! &!!! $!!! #!!!!! $ "# #%& #' (&' (#)' *755/879:2;7<=>675? (* )* &$* %(#*
14 NetPIPE 2-sided Bandwidth (Nahanni vs. vhost)!"((!((( +,-./010223 +,-.41567 +,-.41567 )+!"#' >02AB3A71=+>C68D@ $(( %(( #(( "(( (!!"#$ ()(" " %&"'( ()(# # #&"(! ()($ $ '%"#) ()!% *&"#!!% ()&! %'!"'( ()%( &" +866098:;3<8=>?786@ #)'"#!!%'"!+ %#!"$!)"! ")&* "'% #)$(
15 NetPIPE 2-sided Bandwidth (full results) D/19G2960C*DH5;<F ($!!! (#!!! (!!!! )!!! &!!! $!!! #!!! *+,-./0/112 *+,-30456 *+,-7829:; *+,-54<=;6 *+,-30456.4>*-*+,-.;?;525.4>*-*+,-54<=;6 *+,-7829:; *+,-54<=;6.4>*-*+,-.;?;525.4>*-*+,-54<=;6!! $ "# #%& #' (&' (#)' *;55/:;@A2B;CDE6;5F (* )* &$* %(#*
16 NetPIPE 2-sided Latency (Nahanni vs. vhost) (&! (!! )*+,-./.001 )*+,2/345 )*+,2/345 $&! $!!?.560@=;A46@> "&! "!! %&! %!! &!!! " # $" %"# &%" )644.76891:6;<=564> "' #' $"'
17 NetPIPE 2-sided Latency (full results) F.5:0;DBG4:;E (&! (!! $&! $!! "&! "!! %&! %!! &! )*+,-./.001 )*+,2/345 )*+,67189: )*+,43;<:5-3=),)*+,-:>:414-3=),)*+,43;<:5 )*+,-./.001 )*+,2/345 )*+,67189: )*+,43;<:5-3=),)*+,-:>:414-3=),)*+,43;<:5!! " # $" %"# &%" ):44.9:?@1A:BCD5:4E "' #' $"'
18 SPEC MPI2007 SPEC MPI2007 is an industry benchmark for MPI We ran 9 of 13 applications from the medium input set Part of Cam Macdonell s Ph.D. thesis Conclusions: Performance benefit of MPI-Nahanni grows as the number of processes grows Improvement is proportional to time spent in MPI functions
SPEC MPI2007 (8x8) 19
% MPI time (y) vs. % Improvement (x) 20
% MPI time (y) vs. % Improvement (x) 21
% MPI time (y) vs. % Improvement (x) 22
23 GAMESS: Non-MPI Communications GAMESS Quantum Chemistry Not MPI; DDI ported to Nahanni
24 Concluding Remarks Nahanni / ivshmem is an alternative to other mechanisms (e.g., virtual network, virtio+vhost) for inter-vm, intra-host IPC OS bypass: does not modify or use data movement paths in host or guest VM Supports pointers, non-stream data too Web: low latency for client-server, structured data 29% lower on a YCSB + Nahanni memcached workload Computational science: low latency and high bandwidth for message-passing applications No changes to MPI code. Up to 30% faster on full applications. paullu@cs.ualberta.ca