Scaling Hadoop for Multi-Core and Highly Threaded Systems Jangwoo Kim, Zoran Radovic Performance Architects Architecture Technology Group Sun Microsystems Inc.
Project Overview Hadoop Updates CMT Hadoop Systems Scaling Hadoop on CMT Virtualization Technologies Zones Logical Domains AGENDA Case Study: E-mail Discovery Conclusions 2
Project Overview Chip Multi-Threading (CMT) processors and Hadoop are designed for maximum throughput Sun's JVM optimized for CMT > Java has been widely deployed by many customers on CMT Hadoop is written with Java an ideal throughput candidate Seemed like a great fit for Hadoop with the potential for a greatly reduced footprint Related Work by Ning Sun and Lee Anne Simmons: > Blueprint: Using Logical Domains and CoolThreads Technology: Improving Scalability and System Utilization 3
700+ attendees.. 4
Hadoop Expands Beyond the Web... Some examples from the Summit Genetic Sequence Analysis Parallel Data Mining in Telco Natural language learning Business Fraud Detection Clinical Trials Retail Business Planning... 5
Map/Reduce Organization Input HDFS map copy sort/merge Output HDFS Split 0 reduce Part 0 Split 1 Split 2 map Split 3 sort/merge reduce Part 1 map 6
Next-Gen Hadoop Low Latency Focus Hadoop is traditionally optimized for throughput World Record Sort source code changes http://developer.yahoo.net/blogs/hadoop/yahoo2009.pdf Winning a 60 Second Dash with a Yellow Elephant Reducer Improvements (Shuffle); memory to memory merge Fetch of multiple map outputs from the same node Reduces number of server connections Improved timeout behavior Better data corruption detection (CRC32 improvements) Map output compression (45% of the original size) Improved and multi-threaded data partitioning Lower latency with faster heartbeat 7
OpenSolaris 2009.06 OpenSolaris Moves Into Enterprise UltraSPARC T1 and T2 Support, Sun4u 5 Year Enterprise Support Datacenter-Ready Installation New and Modern Networking Stack http://opensolaris.org/os/project/crossbow/ Multi-Core Optimized Easy Network Virtualizetion and Resource Control Powerful, Built-in and Free Virtualization Techology http://opensolaris.org/os/community/ldoms http://www.opensolaris.com/learn/features/whats-new/200906/ 8
UltraSPARC T2 Processor 8 SPARC V9 cores @ 1.4Ghz > 8 vertical threads per core > 2 execution pipelines per core > 1 instruction/cycle per pipeline > 1 FPU per core > 1 SPU (crypto) per core > 4MB 16-way 8-bank L2$ 64 threads 2.5Ghz x8 PCI-Express interface 2 x 10Gb on-chip Ethernet Crypto processor per core Power: 84 watts (typical) http://www.opensparc.net 42 GB/s read, 21 GB/s write 9
CMT Hadoop Systems T5440 4U 2P US-T2 Plus Platform & Sun Storage J4400 Blade 6000 10U US-T2 Blades T5240 2U 2P US-T2 Plus Server 10
CMT Hadoop Node and Rack Specs * http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html 11
Ideal Performance Model job start all maps start all maps finish mapping job completion shuffle start all reduces start all reduces finish shuffling reducing All tasks start and finish simultaneously 12
Performance Model with Serialized Tasks job start first map start last map starts last map finishes mapping job completion first map finishes last shuffle starts last reduce starts last reduce finishes shuffling reducing Launching many tasks can incur significant overhead 13
Distributed Performance Data Collection Created a set of scripts to facilitate distributed execution for performance data collection and analysis Based on traditional single-node system analysis tools mpstat, nicstat, iostat, vmstat,... Varaiable sampling frequency to monitor hardware utilization Pinpoint which resource is a bottleneck at any point CPU utilization, network, disk I/O Periods where no resource is fully utilized may indicate poorlytuned Hadoop configuration or other system issues Hadoop log processing to monitor Hadoop task timeline Examine startup rate, Hadoop phase overlap Scripts and details are available here: http://blogs.sun.com/jgebis/ 14
Serialized Task Launching Overhead ( disks 30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 ( min ) Time Mapping Shuffling Reducing ( reduce (#map, # <60% CPU utilization Significant launching overhead limits scalability 15
10-Node 150G Sort Task Timeline ( threads Detailed Look: One T2 Blade (64 16
10-Node 150G Sort Utilization Stats 17
Intra-node Virtualization: ( LDOMs ) Logical Domains Hardware-assisted Virtualization Single hypervisor > OS-Level Isolation > Dedicated H/W threads and memory Logical Domain 0 Logical Domain 1 Logical Domain 2 Logical Domain N Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node Hypervisor 18
Example LDOMs Configuration Single control domain > Virtual disk server (vds) > Virtual network switch (vsw) > Virtual console concentrator (vcc) Multiple logical domains ldm add-vcpu 8 ldom0 ldm add-memory 16G ldom0 ( cpu ) ( memory ) ( disk ) ldm add-vdisk vdisk0 control-vds ldom0 ldm add-vnet vnet0 control-vsw ldom0 Single control domain ldm bind ldom0 (bind) ldm start ldom0 (boot) > OS Install as usual ( network ) 19
Intra-node Virtualization: Zones ( Containers ) Software (OS) Virtualization Single operating system > Application-Level Isolation > No H/W threads and memory dedicated Zone 0 Zone 1 Zone 2 Zone N Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node 20
Example Zones Configuration Create zones zonecfg z zone0 f zone0.config zone0.config: create; add net; set physical=interface; set address=ip;.. add fs; set dir=mount_path ; set raw=partition;.... Zone administration zoneadm z zone0 boot zoneadm list zoneadm z zone0 halt ( boot ) ( list ) ( halt ) 21
Example 4-LDOM Setup Evenly distributing H/W resources * LDOM/ZONE administration scripts and details available here: http://blogs.sun.com/jangwook/ 22
Scaling Hadoop with Intra-node Virtualization ( disks 30GB sort on a single T5240 node (128 threads, 128GB RAM, 16 ( min ) Time Mapping Shuffling Reducing ( nodes (#map, # reduce, #virtual ~100% CPU utilization with 4 logical domains 23
Scaling Sorting Workload ( Virtualization (Without Large data sorting performance ( disks/node (Sun Blade 6000: 10 nodes, 640 threads, 64GB RAM/node, 4 ( min ) Time Data Size CMT Hadoop systems scale nicely with larger datasets 24
E-mail Discovery Overview Preparing data for searching over large email corpus Five phases with different MapReduce profiles 1. PipelineMapReduce Reads and parses 27GB of raw emails 2. DocumentSeqFileToMapFile Prepares MapFile to retrieve data 3. PersonNormalization Groups data into unique entities 4. Consumer Creates indices 5. ThreadDetection Conversation threads detected Output is a set of shards used in an E-mail discovery search application 25
(/ http://www.it-discovery.com ) E-mail Discovery 26
E-Discovery Results Email processing performance ( min ) Time 1 node 128 threads 1 node 256 threads 10 nodes 640 threads 15 nodes 60 EC2 units CMT Hadoop systems scale for throughput applications 27
Performance / 40U Rack Email processing performance normalized to a 40U rack 4.6X Relative performance 1.0 2.0X 3.1X 40 nodes 4 EC2 units / node 5 nodes 256 threads / node 40 nodes 64 threads / node 20 nodes 128 threads / node High performance with smaller datacenter footprint 28
MySQL Enterprise Solution Enterprise software, services delivered as annual subscription Database Monitoring Support Most up-to-date MySQL software Monthly rapid updates Quarterly service packs Hot-fix program Indemnification Subscription: MySQL Enterprise License (OEM): Embedded Server Support Virtual database assistant Global monitoring of all servers Web-based central console Built-in advisors, expert advice Problem query detection/analysis MySQL Cluster Carrier-Grade Training Consulting NRE Online self-help MySQL Knowledge Base 24/7 problem resolution with priority escalation Consultative help High-Availability and Scale-Out 29
Conclusions Hadoop and Java scale well on CMT systems Startup cost dominates performance on highly threaded systems (256 threads per node) Virtualization techniques enable good scalability, high system utilization and better performance > Parallelized startup > Less external node-to-node Ethernet traffic Hadoop consolidation on CMT systems reduces datacenter footprint, power and cooling costs Next-gen Hadoop focuses on performance and latency 30
Software Stack, Pointers to Download Sun CMT servers > http://www.sun.com/servers/coolthreads/overview/index.jsp Hadoop 0.20.0 > http://hadoop.apache.org JVM from Sun 1.6.0_13 > http://www.java.sun.com OpenSolaris for SPARC 2009.6 > http://www.opensolaris.org LDOMs 1.1 > http://opensolaris.org/os/community/ldoms 31
Learn More Free Using LDom and CoolThreads Technology: Improving Scalability and Utilization Improving Database Scalability on T5440 Blueprint Deploying Web 2.0 Applications on Sun Servers and the OpenSolaris Operating Systems Tech Resources tab at sun.com/mysqlsystems Try it Yourself Try free for 60 days: Sun Enterprise SPARC rack or blade systems and storage Test Hadoop on up to 128 threads 60 days to decide to buy Return and pay nothing not even shipping if you don't sun.com/tryandbuy 32
Scaling Hadoop for Multi-Core and Highly Threaded Systems ( jangwoo.kim@sun.com ) Jangwoo Kim ( zoran.radovic@sun.com ) Zoran Radovic ( denis.sheahan@sun.com ) Denis Sheahan ( joseph.gebis@sun.com ) Joseph Gebis This is an extended version of our Hadoop Summit '09 presentation, Santa Clara, CA, June 2009 http://developer.yahoo.com/events/hadoopsummit09