Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA
Outline Introduction Overview of Memcached and Hadoop A new approach towards OFA in Cloud Experimental Results & Analysis Summary 2
Introduction High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet) Low latency (few micro seconds), High Bandwidth (40 Gb/s) Low CPU overhead OpenFabrics has been quite successful in the HPC domain Many machines in Top500 list Beginning to draw interest from the enterprise domain Google keynote shows interest in IB for improving RPC cost Oracle has used IB in Exadata Performance in the enterprise domain remains a concern Google keynote also highlighted this 3
Latency (us) MillionBytes/sec MPI (MVAPICH2) Performance over IB 4.5 4 Small Message Latency 3500 3000 Unidirectional Bandwidth 3023.7 3.5 3 2.5 2500 2000 2 1.5 1.51 1500 1000 1 0.5 0 500 0 Message Size (bytes) Message Size (bytes) 2.4 GHz Quad-core (Nehalem) Intel serves with Mellanox ConnectX-2 QDR adapters and IB switch 4
Software Ecosystem in Cloud and Upcoming Challenges Memcached scalable distributed caching Widely adopted caching frontend to MySQL and other DBs MapReduce scalable model to process Petabytes of data Hadoop MapReduce framework widely adopted Both Memcached and Hadoop designed with Sockets Sockets API itself was designed decades ago At the same time SSDs have improved I/O characteristics Google keynote also highlighted that I/O costs are coming down a lot Communication cost will dominate in the future Can OFA help cloud computing software performance? 5
Typical HPC and Cloud Computing Deployments Compute cluster VM VM Physical Machine Frontend High-Performance Interconnects VM VM Physical Machine Local Storage VM VM Local Storage GigE Virtual FS Meta-Data I/O Server I/O Server I/O Server Meta Data Data Data Data Physical Machine Local Storage VM VM Physical Machine I/O Server Data Local Storage HPC system design is interconnect centric Cloud computing environment has complex software and historically relied on Sockets and Ethernet 6
Outline Introduction Overview of Memcached and Hadoop A new approach towards OFA in Cloud Experimental Results & Analysis Summary 7
Memcached Architecture System Area Network System Area Network Internet Web-frontend Servers Memcached Servers Database Servers Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 8
Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases 9
Outline Introduction Overview of Memcached and Hadoop A new approach towards OFA in Cloud Experimental Results & Analysis Summary 10
InfiniBand and 10 Gigabit Ethernet InfiniBand is an industry standard packet switched network Has been increasingly adopted in HPC systems User-level networking with OS-bypass (verbs) 10 Gigabit Ethernet follow up to Gigabit Ethernet Provides user-level networking with OS-bypass (iwarp) Some vendors have accelerated TCP/IP by putting it on the network card (hardware offload) Convergence: possible to use both through OpenFabrics Same software, different networks 11
Modern Interconnects and Protocols Application Application Interface Sockets Verbs Kernel Space TCP/IP TCP/IP SDP RDMA Protocol Implementation Ethernet Driver IPoIB Hardware Offload RDMA User space Network Adapter 1/10 GigE Adapter InfiniBand Adapter 10 GigE Adapter InfiniBand Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch 10 GigE Switch InfiniBand Switch InfiniBand Switch 1/10 GigE IPoIB 10 GigE-TOE SDP Verbs 12
A New Approach towards OFA in Cloud Current Cloud Software Design Current Approach Towards OFA in Cloud Our Approach Towards OFA in Cloud Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload Verbs Interface 1/10 GigE Network 10 GigE or InfiniBand Sockets not designed for high-performance 10 GigE or InfiniBand Stream semantics often mismatch for upper layers (Memcached, Hadoop) Zero-copy not available for non-blocking sockets (Memcached) Significant consolidation in cloud system software Hadoop and Memcached are developer facing APIs, not sockets Improving Hadoop and Memcached will benefit many applications immediately! 13
Memcached Design Using Verbs Sockets Client 1 2 Master Thread Sockets Worker Thread Shared Data RDMA Client 1 2 Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads 14
Outline Introduction Overview of Memcached and Hadoop A new approach towards OFA in Cloud Experimental Results & Analysis Summary 15
Memcached Experiments Experimental Setup Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320 Intel Westmere 2.67GHz, 12GB RAM, InfiniBand QDR Memcached server: 1.4.5 Memcached Client (libmemcached) 0.45 Hadoop Experiments Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320 Intel X-25E 64GB SSD and 250GB HDD Hadoop version 0.20.2, Sun/Oracle Java 1.6.0 Dedicated NameServer and JobTracker Number of Datanodes used: 2, 4, and 8 We used unmodified Hadoop for our experiments OFA used through Sockets 16
Time (us) Time (us) Memcached Get Latency 250 200 150 100 50 350 300 250 200 150 100 50 SDP IPoIB OSU Design 1G 10G-TOE 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Clovertown Cluster (IB: QDR) Memcached Get latency 4 bytes DDR: 6 us; QDR: 5 us 4K bytes -- DDR: 20 us; QDR:12 us Almost factor of four improvement over 10GE (TOE) for 4K We are in the process of evaluating iwarp on 10GE 17
Thousands of Transactions per second (TPS) Memcached Get TPS 700 2000 600 500 1800 1600 1400 400 300 200 100 1200 1000 800 600 400 200 8 clients 16 clients 0 SDP IPoIB OSU Design 0 SDP IPoIB OSU Design Intel Clovertown Cluster (IB: DDR) Memcached Get transactions per second for 4 bytes On IB DDR about 700K/s for 16 clients On IB QDR 1.9M/s for 16 clients Almost factor of six improvement over SDP We are in the process of evaluating iwarp on 10GE Intel Clovertown Cluster (IB: QDR) 18
Bandwidth (MB/s) Bandwidth (MB/s) Hadoop: Java Communication Benchmark 1400 Bandwidth with C version 1400 Bandwidth Java Version 1200 1200 1000 800 600 400 1000 800 600 1GE IPoIB SDP 10GE-TOE SDP-NIO 200 400 0 200 0 Message Size (Bytes) Sockets level ping-pong bandwidth test 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Java performance depends on usage of NIO (allocatedirect) C and Java versions of the benchmark have similar performance HDFS does not use direct allocated blocks or NIO on DataNode S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda Can High-Performance Interconnects Benefit Hadoop Distributed File System?, MASVDC 10 in conjunction with MICRO 2010, Atlanta, GA. 19
Average Write Throughput (MB/sec) Hadoop: DFS IO Write Performance 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 File Size(GB) DFS IO included in Hadoop, measures sequential access throughput We have two map tasks each writing to a file of increasing size (1-10GB) Significant improvement with IPoIB, SDP and 10GigE 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD With SSD, performance improvement is almost seven or eight fold! SSD benefits not seen without using high-performance interconnect! In-line with comment on Google keynote about I/O performance Four Data Nodes Using HDD and SSD 20
Execution Time (sec) Hadoop: RandomWriter Performance 700 600 500 400 300 200 100 0 2 4 Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD Each map generates 1GB of random binary data and writes to HDFS SSD improves execution time by 50% with 1GigE for two DataNodes For four DataNodes, benefits are observed only with HPC interconnect IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes 21
Execution Time (sec) Hadoop Sort Benchmark 2500 2000 1500 1000 500 0 2 4 Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD Sort: baseline benchmark for Hadoop Sort phase: I/O bound; Reduce phase: communication bound SSD improves performance by 28% using 1GigE with two DataNodes Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE 22
Summary OpenFabrics has come a long way in HPC adoption Facing new frontiers in the Cloud computing domain Previous attempts at OpenFabrics adoption in Cloud focused on Sockets Even using OpenFabrics through Sockets good gains can be observed 50% faster sorting when OFA used in conjunction with SSDs There is a vast performance gap between Sockets and Verbs level performance Factor of four improvement in Memcached get latency (4K bytes) Factor of six improvement in Memcached get transactions/s (4 bytes) Native Verbs-level designs will benefit cloud computing domain We are currently working on Verbs-level designs of HDFS and Hbase 23
Thank You! {panda, surs}@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 24