Report Documentation Page



Similar documents
Domain 1: Designing a SQL Server Instance and a Database Solution

(VCP-310)

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Modified Line Search Method for Global Optimization

CHAPTER 3 THE TIME VALUE OF MONEY

LECTURE 13: Cross-validation

Engineering Data Management

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

I. Chi-squared Distributions

CS100: Introduction to Computer Science

CCH Accountants Starter Pack

Quantitative Computer Architecture

How to read A Mutual Fund shareholder report

Output Analysis (2, Chapters 10 &11 Law)

Domain 1: Configuring Domain Name System (DNS) for Active Directory

Optimize your Network. In the Courier, Express and Parcel market ADDING CREDIBILITY

CREATIVE MARKETING PROJECT 2016

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

Baan Service Master Data Management

INVESTMENT PERFORMANCE COUNCIL (IPC)

Soving Recurrence Relations

PUBLIC RELATIONS PROJECT 2016

CS100: Introduction to Computer Science

Confidence Intervals for One Mean

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Incremental calculation of weighted mean and variance

Enhancing Oracle Business Intelligence with cubus EV How users of Oracle BI on Essbase cubes can benefit from cubus outperform EV Analytics (cubus EV)

ODBC. Getting Started With Sage Timberline Office ODBC

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

A Balanced Scorecard

Security Functions and Purposes of Network Devices and Technologies (SY0-301) Firewalls. Audiobooks

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Wells Fargo Insurance Services Claim Consulting Capabilities

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Domain 1: Identifying Cause of and Resolving Desktop Application Issues Identifying and Resolving New Software Installation Issues

CHAPTER 3 DIGITAL CODING OF SIGNALS

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

facing today s challenges As an accountancy practice, managing relationships with our clients has to be at the heart of everything we do.

One Goal. 18-Months. Unlimited Opportunities.

Professional Networking

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Department of Computer Science, University of Otago

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

Chapter 7 Methods of Finding Estimators

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

Agenda. Outsourcing and Globalization in Software Development. Outsourcing. Outsourcing here to stay. Outsourcing Alternatives

France caters to innovative companies and offers the best research tax credit in Europe

Determining the sample size

Digital Enterprise Unit. White Paper. Web Analytics Measurement for Responsive Websites

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

client communication

ANALYTICS. Insights that drive your business

The Forgotten Middle. research readiness results. Executive Summary

Hypergeometric Distributions

Lecture 2: Karger s Min Cut Algorithm

Reliability Analysis in HPC clusters

A Secure Implementation of Java Inner Classes

Pre-Suit Collection Strategies

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Domain 1 - Describe Cisco VoIP Implementations

Effective Data Deduplication Implementation

Agency Relationship Optimizer

Flood Emergency Response Plan

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Neolane Reporting. Neolane v6.1

BaanERP 5.0c. EDI User Guide

Optimal Adaptive Bandwidth Monitoring for QoS Based Retrieval

CCH CRM Books Online Software Fee Protection Consultancy Advice Lines CPD Books Online Software Fee Protection Consultancy Advice Lines CPD

HCL Dynamic Spiking Protocol

RISK TRANSFER FOR DESIGN-BUILD TEAMS

Sequences and Series

Measures of Spread and Boxplots Discrete Math, Section 9.4

1 Computing the Standard Deviation of Sample Means

MTO-MTS Production Systems in Supply Chains

QUADRO tech. PST Flightdeck. Put your PST Migration on autopilot

A GUIDE TO BUILDING SMART BUSINESS CREDIT

STUDENTS PARTICIPATION IN ONLINE LEARNING IN BUSINESS COURSES AT UNIVERSITAS TERBUKA, INDONESIA. Maya Maria, Universitas Terbuka, Indonesia

Configuring Additional Active Directory Server Roles

Data Center Ethernet Facilitation of Enterprise Clustering. David Flynn, Linux Networx Orlando, Florida March 16, 2004

Lesson 17 Pearson s Correlation Coefficient

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Systems Design Project: Indoor Location of Wireless Devices

Quadrat Sampling in Population Ecology

Estimating Probability Distributions by Observing Betting Practices

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Transcription:

Applyig performace models to uderstad data-itesive computig efficiecy Elie Krevat, Tomer Shira, Eric Aderso, Joseph Tucek, Jay J. Wylie, Gregory R. Gager Caregie Mello Uiversity HP Labs CMU-PDL-10-108 May 2010 Parallel Data Laboratory Caregie Mello Uiversity Pittsburgh, PA 15213-3890 Abstract ew programmig frameworks for scale-out parallel aalysis, such as MapReduce ad Hadoop, have become a corerstoe for exploitig large datasets. However, there has bee little aalysis of how these systems perform relative to the capabilities of the hardware o which they ru. This paper describes a simple aalytical model that predicts the optimal performace of a parallel dataflow system. The model exposes the iefficiecy of popular scale-out systems, which take 3 13 loger to complete jobs tha the hardware should allow, eve i well-tued systems used to achieve record-breakig bechmark results. To validate the saity of our model, we preset small-scale experimets with Hadoop ad a simplified dataflow processig tool called Parallel DataSeries. Parallel DataSeries achieves performace close to the aalytic optimal, showig that the model is realistic ad that large improvemets i the efficiecy of parallel aalytics are possible. Ackowledgemets: We thak the members ad compaies of the PDL Cosortium icludig APC, Data Domai, EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi, IBM, Itel, LSI, Microsoft Research, EC Laboratories, etapp, Oracle, Seagate, Su, Symatec, VMWare, ad Yahoo! Labs) for their iterest, isights, feedback, ad support. This research was sposored i part by a HP Iovatio Research Award ad by CyLab at Caregie Mello Uiversity uder grat DAAD19 02 1 0389 from the Army Research Office. Elie Krevat is supported i part by a DSEG Fellowship, which is sposored by the Departmet of Defese.

Report Documetatio Page Form Approved OMB o. 0704-0188 Public reportig burde for the collectio of iformatio is estimated to average 1 hour per respose, icludig the time for reviewig istructios, searchig existig data sources, gatherig ad maitaiig the data eeded, ad completig ad reviewig the collectio of iformatio. Sed commets regardig this burde estimate or ay other aspect of this collectio of iformatio, icludig suggestios for reducig this burde, to Washigto Headquarters Services, Directorate for Iformatio Operatios ad Reports, 1215 Jefferso Davis Highway, Suite 1204, Arligto VA 22202-4302. Respodets should be aware that otwithstadig ay other provisio of law, o perso shall be subject to a pealty for failig to comply with a collectio of iformatio if it does ot display a curretly valid OMB cotrol umber. 1. REPORT DATE MAY 2010 2. REPORT TYPE 3. DATES COVERED 00-00-2010 to 00-00-2010 4. TITLE AD SUBTITLE Applyig performace models to uderstad data-itesive computig efficiecy 5a. COTRACT UMBER 5b. GRAT UMBER 5c. PROGRAM ELEMET UMBER 6. AUTHORS) 5d. PROJECT UMBER 5e. TASK UMBER 5f. WORK UIT UMBER 7. PERFORMIG ORGAIZATIO AMES) AD ADDRESSES) Caregie Mello Uiversity,Parallel Data Laboratory,Pittsburgh,PA,15213 8. PERFORMIG ORGAIZATIO REPORT UMBER 9. SPOSORIG/MOITORIG AGECY AMES) AD ADDRESSES) 10. SPOSOR/MOITOR S ACROYMS) 12. DISTRIBUTIO/AVAILABILITY STATEMET Approved for public release; distributio ulimited 13. SUPPLEMETARY OTES 11. SPOSOR/MOITOR S REPORT UMBERS) 14. ABSTRACT ew programmig frameworks for scale-out parallel aalysis, such as MapReduce ad Hadoop, have become a corerstoe for exploitig large datasets. However, there has bee little aalysis of how these systems perform relative to the capabilities of the hardware o which they ru. This paper describes a simple aalytical model that predicts the optimal performace of a parallel dataflow system. The model exposes the iefficiecy of popular scale-out systems, which take 3?13?loger to complete jobs tha the hardware should allow, eve i well-tued systems used to achieve record-breakig bechmark results. To validate the saity of our model, we preset small-scale experimets with Hadoop ad a simplified dataflow processig tool called Parallel DataSeries. Parallel DataSeries achieves performace close to the aalytic optimal, showig that the model is realistic ad that large improvemets i the efficiecy of parallel aalytics are possible. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATIO OF: 17. LIMITATIO OF ABSTRACT a. REPORT uclassified b. ABSTRACT uclassified c. THIS PAGE uclassified Same as Report SAR) 18. UMBER OF PAGES 18 19a. AME OF RESPOSIBLE PERSO Stadard Form 298 Rev. 8-98) Prescribed by ASI Std Z39-18

Keywords: data-itesive computig, cloud computig, aalytical modelig, Hadoop, MapReduce, performace ad efficiecy

1 Itroductio Data-itesive scalable computig DISC) refers to a rapidly growig style of computig characterized by its reliace o huge ad growig datasets [7]. Drive by the desire ad capability to extract isight from such datasets, data-itesive computig is quickly emergig as a major activity of may orgaizatios. With massive amouts of data arisig from such diverse sources as telescope imagery, medical records, olie trasactio records, ad web pages, may researchers are discoverig that statistical models extracted from data collectios promise major advaces i sciece, health care, busiess efficiecies, ad iformatio access. Ideed, statistical approaches are quickly bypassig expertise-based approaches i terms of efficacy ad robustess. To assist programmers with data-itesive computig, ew programmig frameworks e.g., MapReduce [9], Hadoop [1] ad Dryad [13]) have bee developed. They provide abstractios for specifyig dataparallel computatios, ad they also provide eviromets for automatig the executio of data-parallel programs o large clusters of commodity machies. The map-reduce programmig model, i particular, has received a great deal of attetio, ad several implemetatios are publicly available [1, 20]. These frameworks ca scale jobs to thousads of computers, which is great. However, they curretly focus o scalability without cocer for efficiecy. Worse, aecdotal experieces idicate that they fall far short of fully utilizig hardware resources, effectively wastig large fractios of the computers over which jobs are scaled. If these iefficiecies are real, the same work could theoretically) be completed at much lower costs. A ideal approach would provide maximum scalability for a give computatio without wastig resources such as the CPU or disk. Give the widespread use ad scale of data-itesive computig, it is importat that we move toward such a ideal. A importat first step is uderstadig the degree, characteristics, ad causes of iefficiecy. Ufortuately, little help is curretly available. This paper begis to fill the void with a simple model of ideal map-reduce job rutimes ad the evaluatio of systems relative to it. The model s iput parameters describe basic characteristics of the job e.g., amout of iput data, degree of filterig i the map ad reduce phases), of the hardware e.g., per-ode disk ad etwork throughputs), ad of the framework cofiguratio e.g., replicatio factor). The output is the ideal job rutime. A ideal ru is hardware-efficiet, meaig that the realized throughput matches the maximum throughput for the bottleeck hardware resource, give its usage i.e., amout of data moved over it). Our model ca expose how close or far, curretly) a give system is from this ideal. Such throughput will ot occur, for example, if the framework does ot provide sufficiet parallelism to keep the bottleeck resource fully utilized, or it makes poor use of a particular resource e.g., iflatig etwork traffic). I additio, our model ca be used to quatify resources wasted due to imbalace i a ubalaced system, oe resource e.g., etwork, disk, or CPU) is uder-provisioed relative to others ad acts as a bottleeck. The other resources are wasted to the extet that they are over-provisioed ad active. To illustrate these issues, we applied the model to a umber of bechmark results e.g., for the TeraSort ad PetaSort bechmarks) touted i the idustry. These presumably well-tued systems achieve rutimes that are 3 13 loger tha the ideal model suggests should be possible. We also report o our ow experimets with Hadoop, cofirmig ad partially explaiig sources of iefficiecy. To cofirm that the model s ideal is achievable, we preset results from a efficiet parallel dataflow system called Parallel DataSeries PDS). PDS lacks may features of the other frameworks, but its careful egieerig ad stripped-dow feature-set demostrate that ear-ideal hardware-efficiecy withi 20%) is possible. I additio to validatig the model, PDS provides a iterestig foudatio for subsequet aalyses of the icremetal costs associated with features, such as distributed file system fuctioality, dyamic task distributio, fault tolerace, ad task replicatio. Data-parallel computatio is here to stay, as is scale-out performace. However, we hope that the low efficiecy idicated by our model is ot. By gaiig a better uderstadig of computatioal bottleecks, 1

Figure 1: A map-reduce dataflow. ad uderstadig the limits of what is achievable, we hope that our work will lead to improvemets i commoly used DISC frameworks. 2 Dataflow parallelism ad map-reduce computig Today s data-itesive computig derives much from earlier work o parallel databases. Broadly speakig, data is read from iput files, processed, ad stored i output files. The dataflow is orgaized as a pipelie i which the output of oe operator is the iput of the followig operator. DeWitt ad Gray [10] describe two forms of parallelism i such dataflow systems: partitioed parallelism ad pipelied parallelism. Partitioed parallelism is achieved by partitioig the data ad splittig oe operator ito may ruig o differet processors. Pipelied parallelism is achieved by streamig the output of oe operator ito the iput of aother, so that the two operators ca work i series o differet data at the same time. Google s MapReduce 1 [9] offers a simple programmig model that facilitates developmet of scalable parallel applicatios that process a vast amout of data. Programmers specify a map fuctio that geerates values ad associated keys from each iput data item ad a reduce fuctio that describes how all data matchig each key should be combied. The rutime system hadles details of schedulig, load balacig, ad error recovery. Hadoop [1] is a ope-source implemetatio of the map-reduce model. Figure 1 illustrates the pipelie of a map-reduce computatio ivolvig three odes computers). The computatio is divided ito two phases, labeled Phase 1 ad Phase 2. Phase 1: Phase 1 begis with the readig of the iput data from disk ad eds with the sort operator. It icludes the map operators ad the exchage of data over the etwork. The first write operator i Phase 1 stores the output of the map operator. This backup write operator is optioal, but used by default i the Google ad Hadoop implemetatios of map-reduce, servig to icrease the system s ability to cope with failures or other evets that may occur later. Phase 2: Phase 2 begis with the sort operator ad eds with the writig of the output data to disk. I systems that replicate data across multiple odes, such as the GFS [11] ad HDFS [3] distributed file systems used with MapReduce ad Hadoop, respectively, the output data must be set to all other odes that will store the data o their local disks. 1 We refer to the programmig model as map-reduce ad to Google s implemetatio as MapReduce. 2

Parallelism: I Figure 1, partitioed parallelism takes place o the vertical axis; the iput data is split betwee three odes, ad each operator is, i fact, split ito three sub-operators that each ru o a differet ode. Pipelied parallelism takes place o the horizotal axis; each operator withi a phase processes data uits e.g., records) as it receives them, rather tha waitig for them all to arrive, ad passes data uits to the ext operator as appropriate. The oly breaks i pipelied parallelism occur at the boudary betwee phases. As show, this boudary is the sort operator. The sort operator ca oly produce its first output record after it has received all of its iput records, sice the last iput record received might be the first i sorted order. Quatity of data flow: Figure 1 also illustrates how the amout of data flowig through the system chages throughout the computatio. The amout of iput data per ode is d i, ad the amout of output data per ode is d o. The amout of data per ode produced by the map operator ad cosumed by the reduce operator is d m. I most applicatios, the amout of data flowig through the system either remais the same or decreases i.e., d i d m d o ). I geeral, the mapper will implemet some form of select, filterig out rows, ad the reducer will perform aggregatio. This reductio i data across the stages ca play a key role i the overall performace of the computatio. Ideed, Google s MapReduce icludes combier fuctios to move some of the aggregatio work to the map operators ad, hece, reduce the amout of data ivolved i the etwork exchage [9]. May map-reduce workloads resemble a grep -like computatio, i which the map operator decreases the amout of data d i d m ad d m = d o ). I others, such as i a sort, either the map or the reduce fuctio decrease the amout of data d i = d m = d o ). 2.1 Related work Cocers about the performace of map-reduce style systems emerged from the parallel databases commuity, where similar data processig tasks have bee tackled by commercially available systems. I particular, Stoebraker et al. compare Hadoop to a variety of DBMSs ad fid that Hadoop ca be up to 36x slower tha a commercial parallel DBMS [25]. I previous work [5], two of the authors of our paper poited out that may parallel systems especially map-reduce systems, but also other parallel systems) have focused almost exclusively o absolute throughput ad high-ed scalability. This focus, as the authors quatify by back-of-the-evelope comparisos, has bee at the detrimet of other worthwhile metrics. I perhaps the most relevat prior work, Wag et al. use simulatio to evaluate how certai desig decisios e.g., etwork layout ad data locality) will effect the performace of Hadoop jobs [27]. Specifically, their MRPerf simulator istatiates fake jobs, which impose fixed times e.g., job startup) ad iput-size depedet times cycles/byte of compute) for the Hadoop parameters uder study. The fake jobs geerate etwork traffic simulated with s-2) ad disk I/O also simulated). Usig executio characteristics accurately measured from small istaces of Hadoop jobs, MRPerf accurately predicts to withi 5-12%) the performace of larger clusters. Although simulatio techiques like MRPerf are useful for explorig differet desigs, by relyig o measuremets of actual behavior e.g., of Hadoop) such simulatios will also emulate ay iefficiecies particular to the specific implemetatio simulated. 3 Performace model This sectio presets a model for the rutime of a map-reduce job o a hardware-efficiet system. It icludes the model s assumptios, parameters, ad equatios, alog with a descriptio of commo workloads. Assumptios: For a large class of data-itesive workloads, which we assume for our model, computatio time is egligible i compariso to I/O speeds. Amog others, this assumptio holds for grep- ad sort-like jobs, such as those described by Dea ad Ghemawat [9] as beig represetative of most MapReduce jobs at Google, but may ot hold i other settigs. For workloads fittig the assumptio, pipelied 3

d m < memory i-memory sort) d m memory exteral sort) Phase 1 Disk read iput): d i Disk read iput): d i Disk write backup): d m Disk write backup): d m etwork: 1 d m etwork: 1 d m Disk write sort): d m Phase 2 etwork: r 1)d o Disk read sort): d m etwork: r 1)d o Disk write output): rd o Disk write output): rd o Table 1: I/O operatios i a map-reduce job. The first disk write i Phase 1 is a optioal backup to protect agaist failures. parallelism ca allow o-i/o operatios to execute etirely i parallel with I/O operatios, such that overall throughput for each phase will be determied by the I/O resource etwork or storage) with the lowest effective throughput. For modelig purposes, we also do ot cosider specific etwork topologies or techologies, ad we assume that the etwork core is over-provisioed eough that the iteral etwork topology does ot impact the speeds of iter-ode data trasfers. From our experiece, ulimited backplae badwidth without ay performace degradatio is probably impractical, although it was ot a issue for our experimets ad we curretly have o evidece for it causig issues o the other large clusters which we aalyze i Sectio 8. The model assumes that iput data is evely distributed across all participatig odes i the cluster, that odes are homogeeous, ad that each ode retrieves its iitial iput from local storage. Most map-reduce systems are desiged to fit these assumptios. The model also accouts for output data replicatio, assumig the commo strategy of storig the first replica o the local disks ad sedig the others over the etwork to other odes. Fially, aother importat assumptio is that a sigle job has full access to the cluster at a time, with o competig jobs or other activities. Productio map-reduce clusters may be shared by more tha oe simultaeous job, but uderstadig a sigle job s performace is a useful startig poit. Derivig the model from I/O operatios: Table 1 idetifies the I/O operatios i each map-reduce phase for two variats of the sort operator. Whe the data fits i memory, a fast i-memory sort ca be used. Whe it does ot fit, a exteral sort is used, which ivolves sortig each batch of data i memory, writig it out to disk, ad the readig ad mergig the sorted batches ito oe sorted stream. The 1 d m term appears i the equatio, where is the umber of odes, because i a well-balaced system each ode partitios ad trasfers that fractio of its mapped data over the etwork, keepig 1 of the data for itself. Table 2 lists the I/O speed ad workload property parameters of the model. They iclude amouts of data flowig through the system, which ca be expressed either i absolute terms d i, d m, ad d o ) or i terms of the ratios of the map ad reduce operators output ad iput e M ad e R, respectively). Table 3 gives the model equatios for the executio time of a map-reduce job i each of four scearios, represetig the cross-product of the Phase 1 backup write optio yes or o) ad the sort type i-memory or exteral). I each case, the per-byte time to complete each phase map ad reduce) is determied, summed, ad multiplied by the umber of iput bytes per ode i ). The per-byte value for each phase is the larger max) of that phase s per-byte disk time ad per-byte etwork time. Usig the last row exteral sort, with backup write) as a example, the map phase icludes three disk trasfers ad oe etwork trasfer: readig, writig the e M map output bytes to disk the backup write; e M Dw ), writig e M bytes as ) part of the exteral sort em Dw, ad sedig 1 1 ) of the e M map output bytes over the etwork e M to other reduce odes. The correspodig reduce phase icludes two disk trasfers ad oe etwork trasfer: each iput byte 1 D r ) 4

Symbol D w D r Defiitio The umber of odes i the cluster. The aggregate disk write throughput of a sigle ode. A ode with four disks, where each disk provides 65 MB/s writes, would have D = 260 MB/s. The aggregate disk read throughput of a sigle ode. The etwork throughput of a sigle ode. r The replicatio factor used for the job s output data. If o replicatio is used, r = 1. i ) d i = i d m = i e M ) d o = i e M e R ) e M = d m di ) e R = d o d m ) The total amout of iput data for a give computatio. The amout of iput data per ode, for a give computatio. The amout of data per ode after the map operator, for a give computatio. The amout of output data per ode, for a give computatio. The ratio betwee the map operator s output ad its iput. The ratio betwee the reduce operator s output ad its iput. Table 2: Modelig parameters that iclude I/O speeds ad workload properties. Without backup write With backup write Without backup write With backup write d m < memory i-memory sort) { } i max 1, 1 e M Dr + max{ rem e R D } w max{ 1 D + e M e M r i Dw, 1 + max{ rem e R D w d m memory exteral sort) } i max{ 1 D + e M r Dw, 1 e M + max{ em } i max{ 1 D r + 2e M D w, 1 e M + max{ em }), e Me R r 1), e Me R r 1) Dr + re Me R D w D r + re Me R D w }) }), e Me R r 1) }), e Me R r 1) Table 3: Model equatios for the executio time of a map-reduce computatio o a parallel dataflow system. readig sorted batches em bytes replicated from other odes odes em e R r 1) ) em e R D w ad r 1)e M e R D r ), writig e M e R reduce output bytes produced locally r 1)eM e R D w ), ad sedig e M e R bytes produced locally to r 1) other ). Puttig all of this together produces the equatio show. Applyig the model to commo workloads: May workloads beefit from a parallel dataflow system because they ru o massive datasets, either extractig ad processig a small amout of iterestig data or shufflig data from oe represetatio to aother. We focus o parallel sort ad grep i aalyzig systems ad validatig our model, which Dea ad Ghemawat [9] idicate are represetative of most programs writte by users of Google s MapReduce. For a grep-like job that selects a very small fractio of the iput data, e M 0 ad e R = 1, meaig that oly a egligible amout of data is optioally) writte to the backup files, set over the etwork, ad writte to the output files. Thus, the best-case rutime is determied by the iitial iput disk reads: tgrep = i D r 1) A sort workload maitais the same amout of data i both the map ad reduce phases, so e M = e R = 1. If the amout of data per ode is small eough to accommodate a i-memory sort ad ot warrat a Phase 1 5

backup, the top equatio of Table 3 is used, simplifyig to: tsort = i { 1 max, 1 D r } + max { r, r 1 }) D w Determiig iput parameters for the model: Appropriate parameter values are a crucial aspect of model accuracy, whether usig the model to evaluate how well a productio system is performig or to determie what should be expected from a hypothetical system. The ad r parameters are system cofiguratio choices that ca be applied directly i the model for both productio ad hypothetical systems. The amout of data flowig through various operators d i, d m, or d o ) deped upo the characteristics of the map ad reduce operators ad of the data itself. For a productio system, they ca be measured ad the plugged ito a model that evaluates the performace of a give workload ru o that system. For a hypothetical system, or if actual system measuremets are ot available, some estimates must be used, such as d i = d m = d o for sort or d m = d o = 0 for grep. The determiatio of which equatio to use, based o the backup write optio ad sort type choices, is also largely depedet o the workload characteristics, but i combiatio with system characteristics. Specifically, the sort type choice depeds o the relatioship betwee d m ad the amout of mai memory available for the sort operator. The backup write optio is a softer choice, worthy of further study, ivolvig the time to do a backup write dm D w ), the total executio time of the job, ad the likelihood of a ode failure durig the job s executio. Both Hadoop ad Google s MapReduce always do the backup write, at least to the local file system cache. The appropriate values for I/O speed deped o what is beig evaluated. For both productio ad hypothetical systems, specificatio values for the hardware ca be used for example, 1 Gbps for the etwork ad the maximum streamig badwidth specified for the give disks). This approach is appropriate for evaluatig the efficiecy of the etire software stack, from the operatig system up. However, if the focus is o the programmig framework, usig raw hardware specificatios ca idicate greater iefficiecy tha is actually preset. I particular, some efficiecy is geerally lost i the uderlyig operatig system s coversio of raw disk ad etwork resources ito higher level abstractios, such as file systems ad etwork sockets. To focus attetio o programmig framework iefficiecies, oe should use measuremets of the disk ad etwork badwidths available to applicatios usig the abstractios. As show i our experimets, such measured values are lower tha specified values ad ofte have o-trivial characteristics, such as depedece o file system age or etwork commuicatio patters. 4 Existig data-itesive computig systems are far from optimal Our model idicates that, though they may scale beautifully, popular data-itesive computig systems leave a lot to be desired i terms of efficiecy. Figure 2 compares optimal times, as predicted by the model, to reported measuremets of a few bechmark ladmarks touted i the literature, presumably o well-tued istaces of the programmig frameworks utilized. These results idicate that far more machies ad disks are ofte employed tha would be eeded if the systems were hardware-efficiet. The remaider of this sectio describes the systems ad bechmarks represeted i Figure 2. Hadoop TeraSort: I April 2009, Hadoop set a ew record [18] for sortig 1 TB of data i the Sort Bechmark [17] format. The setup had the followig parameters: i = 1 TB, r = 1, = 1460, D = 4 disks 65 MB/s/disk = 260 MB/s, = 110 MB/s, d m = i/ = 685 MB. With oly 685 MB per ode, the data ca be sorted by the idividual odes i memory. A phase 1 backup write is ot eeded, give the short rutime. Equatio 2 gives a best-case rutime of 8.86 secods. After fie-tuig the system for this specific bechmark, Yahoo! achieved 62 secods 7 slower. A optimal system usig the same hardware would achieve better throughput with 209 odes istead of 1460). 2) 6

Figure 2: Published bechmarks of popular parallel dataflow systems. Each bar represets the reported throughput relative to the ideal throughput idicated by our performace model, parameterized accordig to a cluster s hardware. MapReduce TeraSort: I ovember 2008, Google reported TeraSort results for 1000 odes with 12 disks per ode [8]. The followig parameters were used: i = 1 TB, r = 1, = 1000, D = 12 65 = 780 MB/s, = 110 MB/s, d m = i/ = 1000 MB. Equatio 2 gives a best-case rutime of 10.4 secods. Google achieved 68 secods over 6 slower. A optimal system usig the same hardware would achieve better throughput with 153 odes istead of 1000). MapReduce PetaSort: Google s PetaSort experimet [8] is similar to TeraSort, with three differeces: 1) a exteral sort is required with a larger amout of data per ode d m = 250 GB), 2) output was stored o GFS with three-way replicatio, 3) a Phase 1 backup write is justified by the loger rutimes. I fact, Google ra the experimet multiple times, ad at least oe disk failed durig each executio. The setup is described as follows: i = 1 PB, r = 3, = 4000, D = 12 65 = 780 MB/s, = 110 MB/s, d m = i/ = 250 GB. The bottom cell of Table 3 gives a best-case rutime of 6818 secods. Google achieved 21,720 secods approximately 3.2 slower. A optimal system usig the same hardware would achieve better throughput with 1256 odes istead of 4000). Also, accordig to our model, for the purpose of sort-like computatios, Google s odes are over-provisioed with disks. I a optimal system, the etwork would be the bottleeck eve if each ode had oly 6 disks istead of 12. Hadoop PetaSort: Yahoo! s PetaSort experimet [18] is similar to Google s, with oe differece: The output was stored o HDFS with two-way replicatio. The setup is described as follows: i = 1 PB, r = 2, = 3658, D = 4 65 = 260 MB/s, = 110 MB/s, d m = i/ = 273 GB. The bottom cell of Table 3 gives a best-case rutime of 6308 secods. Yahoo! achieved 58,500 secods about 9.3 slower. A optimal system usig the same hardware would achieve better throughput with 400 odes istead of 3658). MapReduce Grep: The origial MapReduce paper [9] described a distributed grep computatio that was executed o MapReduce. The setup is described as follows: i = 1 TB, = 1800, D = 2 40 = 80 MB/s, = 110 MB/s, d m = 9.2 MB, e M = 9.2/1000000 0, e R = 1. The paper does ot specify the throughput of the disks, so we used 40 MB/s, coservatively estimated based o disks of the timeframe 2004). Equatio 1 gives a best-case rutime of 6.94 secods. Google achieved 150 secods icludig startup overhead, or 90 secods without that overhead still about 13 slower. A optimal system usig the same hardware would achieve better throughput with 139 odes istead of 1800). The 60-secod startup time experieced by MapReduce o a cluster of 1800 odes would also have bee much shorter o a cluster of 139 odes. 7

5 Explorig the efficiecy of data-itesive computig The model idicates that there is substatial iefficiecy i popular data-itesive computig systems. The remaider of the paper reports ad aalyzes results of experimets explorig such iefficiecy. This sectio describes our cluster ad quatifies efficiecy lost to OS fuctioality. Sectio 6 cofirms the Hadoop iefficiecy idicated i the bechmark aalyses, ad Sectio 7 uses a stripped-dow framework to validate that the model s optimal rutimes ca be approached. Sectio 8 discusses these results ad ties together our observatios of the sources of iefficiecy with opportuities for future work i this area. Experimetal cluster: Our experimets used 1 25 odes of a cluster. Each ode is cofigured with two quad-core Itel Xeo E5430 processors, four 1 TB Seagate Barracuda ES.2 SATA drives, 16 GB of RAM, ad a Gigabit Etheret lik to a Force10 switch. The I/O speeds idicated by the hardware specificatios are = 1 Gbps ad D r = D w = 108 MB/s for the outer-most disk zoe). All machies ru the Liux 2.6.24 Xe kerel, but oe of our experimets were ru i virtual machies they were all ru directly o domai zero. The kerel s default TCP implemetatio TCP ewreo usig up to 1500 byte packets) was used. Except where otherwise oted, the XFS file system was used to maage a sigle oe of the disks for every ode i our experimets. Disk badwidth for applicatios: For sufficietly large or sequetial disk trasfers, seek times have a egligible effect o performace; raw disk badwidth approaches the maximum trasfer rate to/from the disk media, which is dictated by the disk s rotatio speed ad data-per-track values [21]. For moder disks, sufficietly large is o the order of 8 MB [26]. Most applicatios do ot access the raw disk, istead accessig the disk idirectly via a file system. Usig the raw disk, we observe 108 MB/s, which is i lie with the specificatios for our disks. early the same badwidth withi 1%) ca be achieved for large sequetial file reads oext3 ad XFS file systems. For writes, our measuremets idicate more iterestig behavior. Usig the dd utility with the syc optio, a 64 MB block size, ad iput from the /dev/zero pseudo-device, we observe steady-state write badwidths of 84 MB/s ad 102 MB/s, respectively. Whe writig a amout of data less tha or close to the file system cache size, the reported badwidth is up to aother 10% lower, sice the file system does ot start writig the data to disk immediately; that is, disk writig is ot occurrig durig the early portio of the utility rutime. This differece betwee read ad write badwidths causes us to use two values D r ad D w ) i the model; our origial model used oe value for both. The differece is ot due to the uderlyig disks, which have the same media trasfer rate for both reads ad writes. Rather, it is caused by file system decisios regardig coalescig ad orderig of write-backs, icludig the eed to update metadata. XFS ad ext3 both maitai a write-ahead log for data cosistecy, which also iduces some overhead o ew data writes. ext3 s relatively higher write pealty is likely caused by its block allocator, which allocates oe 4 KB block at a time, i cotrast to XFS s variable-legth extet-based allocator. 2 The 108 MB/s value, ad the dd measuremets discussed above, are for the first disk zoe. Moder disks have multiple zoes, each with a differet data-per-track value ad, thus, media trasfer rate [22]. Whe measurig a XFS filesystem o a partitio coverig the etire disk, read speeds remaied cosistet at 108 MB/s, but write speeds fluctuated across a rage of 92-102 MB/s with a average of 97 MB/s over 10 rus. I reportig optimal values for experimets with our cluster, we use 108 MB/s ad 97 MB/s for the disk read ad write speeds, respectively. etwork badwidth for applicatios: Although a full-duplex 1 Gbps Etheret lik could theoretically trasfer 125 MB/s i each directio, maximum achievable data trasfer badwidths are lower due to uavoidable protocol overheads. Usig the iperf tool with the maximum kerel-allowed 256 KB TCP widow size, we measured sustaied badwidths betwee two machies of approximately 112.5 MB/s, which is 2 To address some of these shortcomigs, theext4 file system improves the desig ad performace ofext3 by addig, amog other thigs, multi-block allocatios [16]. 8

i lie with expected best-case data badwidth. However, we observed lower badwidths with more odes i the all-to-all patter used i map-reduce jobs. For example, i a 5 16 ode all-to-all etwork trasfer, we observed 102 106 MB/s aggregate ode-to-ode badwidths over ay oe lik. These lower values are caused by ewreo s kow slow covergece o usig full lik badwidths o high-speed etworks [14]. Such badwidth reductios uder some commuicatio patters may make the use of a sigle etwork badwidth ) iappropriate for some eviromets. For evaluatig data-itesive computig o our cluster, we use a coservative value of = 110 MB/s. We also ra experimets usig the ewer CUBIC [12] cogestio cotrol algorithm, which is the default o Liux 2.6.26 ad is tued to support high-badwidth liks. It achieved higher throughput up to 115 MB/s per ode with 10 odes), but exhibited sigificat ufairess betwee flows, yieldig skews i completio times of up to 86% of the total time. CUBIC s ufairess ad stability issues are kow ad are promptig cotiuig research toward better algorithms [14]. 6 Experieces with Hadoop We experimeted with Hadoop o our cluster to cofirm ad better uderstad the iefficiecy exposed by our aalysis of reported bechmark results. Tuig Hadoop s settigs: Default Hadoop settigs fail to use most odes i a cluster, usig oly two total) map tasks ad oe reduce task. Eve icreasig those values to use four map ad reduce tasks per ode, a better umber for our cluster, with o replicatio, still results i lower-tha-expected performace. We improved the Hadoop sort performace by a additioal 2 by adjustig a umber of cofiguratio settigs as suggested by Hadoop cluster setup documetatio ad other sources [2, 24, 19]. Table 4 describes our chages, which iclude reducig the replicatio level, icreasig block sizes, icreasig the umbers of map ad reduce tasks per ode, ad icreasig heap ad buffer sizes. Iterestigly, we foud that speculative executio did ot improve performace for our cluster. Occasioal map task failures ad laggig odes ca ad do occur, especially whe ruig over more odes. However, they are less commo for our smaller cluster size oe ameode ad 1 25 slave odes), ad surprisigly they had little effect o the overall performace whe they did occur. Whe usig speculative executio, it is geerally advised to set the umber of total reduce tasks to 95 99% of the cluster s reduce capacity to allow for a ode to fail ad still fiish executio i a sigle wave. Sice failures are less of a issue for our experimets, we optimized for the failure-free case ad chose eough Map ad Reduce tasks for each job to fill every machie at 100% capacity. Sort measuremets ad compariso to the model: Figure 3 shows sort results for differet umbers of odes usig our tued Hadoop cofiguratio. Each measuremet sorts 4 GB of data per ode up to 100 GB total over 25 odes). Radom 100 byte iput records were geerated with the TeraGe program, spread across active odes via HDFS, ad sorted with the stadard TeraSort Hadoop program. Before every sort, the buffer cache was flushed withsyc) to prevet previously cached writes from iterferig with the measuremet. Additioally, the buffer cache was dropped from the kerel to force disk read operatios for the iput data. The sorted output is writte to the file system, but ot syced to disk before completio is reported; thus, the reported results are a coservative reflectio of actual Hadoop sort executio times. The results cofirm that Hadoop scales well, sice the average rutime oly icreases 6% 14 secods) from 1 ode up to 25 odes as the workload icreases i proportio). For compariso, we also iclude the optimal sort times i Figure 3, calculated from our performace model. The model s optimal values reveal a large costat iefficiecy for the tued Hadoop setup each sort requires 3 the optimal rutime to complete, eve without sycig the output data to disk. The 6% higher total rutime at 25 odes is due to skew i the completio times of the odes this is the source of the 9% additioal iefficiecy at 25 odes. The iefficiecy due to OS abstractios is 9

Hadoop Settig Default Tued Effect Replicatio level 3 1 The replicatio level was set to 1 to avoid extra disk writes. HDFS block size 64 MB 128 MB Larger block sizes i HDFS make large file reads ad writes faster, amortizig the overhead for startig each map task. Speculative exec. true f alse Failures are ucommo o small clusters, avoid extra work. Maximum map tasks per ode Maximum reduce tasks per ode 2 4 Our odes ca hadle more map tasks i parallel. 1 4 Our odes ca hadle more reduce tasks i parallel. Map tasks 2 4 For a cluster of odes, maximize the map tasks per ode. Reduce tasks 1 4 For a cluster of odes, maximize the reduce tasks per ode. Java VM heap size 200 MB 1 GB Icrease the Java VM heap size for each child task. Daemo heap size 1 GB 2 GB Icrease the heap size for Hadoop daemos. Sort buffer memory 100 MB 600 MB Use more buffer memory whe sortig files. Sort streams factor 10 30 Merge more streams at oce whe sortig files. Table 4: Hadoop cofiguratio settigs used i our experimets. already accouted for, as discussed i Sectio 5. Oe potetial explaatio for part of the iefficiecy is that Hadoop uses a backup write for the map output, eve though the rutimes are short eough to make it of questioable merit. As show by the dotted lie i Figure 3a, usig the model equatio with a backup write would yield a optimal rutime that is 39 secods loger. This would explai approximately 25% of the iefficiecy. However, as with the sort output, the backup write is set to the file system but ot syced to disk with 4 GB of map output per ode ad 16 GB of memory per ode, most of the backup write data may ot actually be writte to disk durig the map phase. It is uclear what fractio of the potetial 25% is actually explaied by Hadoop s use of a backup write. Aother possible source of iefficiecy could be ubalaced distributio of the iput data or the reduce data. However, we foud that the iput data is spread almost evely across the cluster. Also, the differece betwee the ideal split of data ad what is actually set to each reduce ode is less tha 3%. Therefore, the radom iput geeratio alog with TeraSort s samplig ad splittig algorithms is partitioig work evely, ad the workload distributio is ot to blame for the loss of efficiecy. Aother potetial source of iefficiecy could be poor schedulig ad task assigmet by Hadoop. However, Hadoop actually did a good job at schedulig map tasks to ru o the odes that store the data, allowig local disk access rather tha etwork trasfers) for over 95% of the iput data. The fact that this value was below 100% is due to skew of completio times where some odes fiish processig their local tasks a little faster tha others, ad take over some of the load from the slower odes. We do ot yet have a full explaatio for Hadoop s iefficiecy. Although we have ot bee able to verify i the complex Hadoop code, some of the iefficiecy appears to be caused by isufficietly pipelied parallelism betwee operators, causig serializatio of activities e.g., iput read, CPU processig, ad etwork write) that should ideally proceed i parallel. Part of the iefficiecy is commoly attributed to CPU overhead iduced by Hadoop s Java-based implemetatio. Of course, Hadoop may also ot be usig I/O resources at full efficiecy. More diagosig of Hadoop s iefficiecy is a topic for cotiuig research. 10

250 250 Actual Optimal 200 200 Time s) 150 100 Time s) 150 100 50 0 Hadoop Optimal with backup write Optimal 5 10 15 20 25 odes a) Scalig a Hadoop sort bechmark up to 25 odes. 50 0 Map Reduce 1 2 4 8 16 25 odes b) Time breakdow ito phases. Figure 3: Measured ad optimal sort rutimes for a tued Hadoop cluster. Performace is about 3 times slower tha optimal, ad 2 times slower tha a optimal sort that icludes a extra backup write for the map output, which is curretly Hadoop s behavior. Hadoop scales well with 4 GB per ode up to 25 odes, but it is iefficiet. The measured rutime, optimal calculatio, ad optimal with backup write calculatio are show i a). The breakdow of rutime ito map ad reduce phases is show i b). 7 Verifyig the model with Parallel DataSeries The Hadoop results above clearly diverge from the predicted optimal. The large extet to which they diverge, however, brigs the accuracy of the model ito questio. To validate our model, we preset Parallel DataSeries PDS), a data aalysis tool that attempts to closely approach the maximum possible throughput. PDS Desig: Parallel DataSeries builds o DataSeries, a efficiet ad flexible data format ad rutime library optimized for aalyzig structured data [4]. DataSeries files are stored as a sequece of extets, where each extet is a series of records. The records themselves are typed, followig a schema defied for each extet. Data is aalyzed at the record level, but I/O is performed at the much larger extet level. DataSeries supports passig records i a pipelie fashio through a series of modules. PDS exteds DataSeries with modules that support parallelism over multiple cores itra-ode parallelism) ad multiple odes iter-ode parallelism), to support parallel flows across modules as depicted i Figure 4. Sort evaluatio: We built a parallel sort module i PDS that implemets a dataflow patter similar to map-reduce. I Phase 1, data is partitioed ad shuffled across the etwork. As soo as a ode receives all data from the shuffle, it exits Phase 1 ad begis Phase 2 with a local sort. To geerate iput data for experimets, we used Gesort, which is the sort bechmark [17] iput geerator o which TeraGe is based. The Gesort iput set is separated ito partitios, oe for each ode. PDS does t curretly utilize a distributed filesystem, so we maually partitio the iput, with 40 millio records 4 GB) at each ode. We coverted the GeSort data to DataSeries format without compressio, which expads the iput by 4%. We measured PDS to see how closely it performed to the optimal predicted performace o the same cluster used for the Hadoop experimets. Figure 5 presets the equivalet sort task as ru for Hadoop. We repeated all experimets 10 times, startig from a cold cache ad sycig all data to disk before termiatig the measuremet. As with the earlier Hadoop measuremets, time is broke dow ito each phase. Furthermore, average per-ode times are icluded for the actual sort, as well as a stragglers category that represets the average wait time of a ode from the time it completes all its work util the the last ode ivolved i the 11

[Module] Output Module Source Module [Module] [Module] Output Module [Module] Output Module Figure 4: Parallel DataSeries is a carefully-tued parallel rutime library for structured data aalysis. Icomig data is queued ad passed i a pipelie through a umber of modules i parallel. parallel sort also fiishes. PDS performed well at 12-24% of optimal. About 4% of that is the aforemetioed iput expasio. The sort time takes a little over 2 secods, which accouts for aother 3% of the overhead. Much of this CPU could be overlapped with IO PDS does t curretly), ad it is sufficietly small to justify excludig CPU time from the model. These two factors explai most of the 12% overhead of the sigle ode case, leavig a small amout of atural coordiatio ad rutime overhead i the framework. As the parallel sort is scaled to 25 odes, besides the additioal coordiatio overhead from code structures that eable partitioig ad parallelism, the remaiig divergece ca be mostly explaied by two factors: 1) straggler odes, ad 2) etwork slowdow effects from may competig trasfers. Stragglers broke out i Figure 5b) ca be the result of geerally slow i.e., bad ) odes, skew i etwork trasfers, or variace i disk write times. The up to 5% observed straggler overhead is reasoable. The etwork slowdow effects were idetified i Sectio 5 usigiperf measuremets, ad are mostly resposible for the slight time icrease startig aroud 4 odes. However, eve if the effective etwork goodput speeds were 100 MB/s istead of the 110 MB/s used with the model, that would elimiate oly 4% of the additioal overhead for our PDS results compared to the predicted optimal time. As more odes are added at scale, the straggler effects ad etwork slowdows become more proouced. Whe we origially ra these experimets ad ispected the results of the 25 ode case, we oticed that 6 of the odes cosistetly fiished later ad were processig about 10% more work tha the other 19. It tured out that our data partitioer was usig oly the first byte of the key to split up the space ito 256 bis, so it partitioed the data uevely for clusters that were ot a power of 2. After desigig a fairer partitioer that used more bytes of the key, ad applyig it to the 25 ode parallel sort, we were able to brig dow the overhead from 30% to 24%. To see how both the model ad PDS react to the etwork as a bottleeck, we cofigured our etwork switches to egotiate 100 Mbps Etheret. Just as the 1 term i the model predicts icreasigly loger sort times which coverge i scale as more odes participate, Figure 6 demostrates that our actual results with PDS match up very well to that patter. The PDS sort results vary betwee 12-27% slower tha optimal. For clusters of size 16 ad 25, 5% of the time is spet waitig for stragglers. The slow speed of the etwork amplifies the effects of skew; we observed a few odes fiishig their secod phase before the most delayed odes had received all of their data from the first phase. 8 Discussio The experimets with PDS demostrate that our model is ot wildly optimistic it is possible to get close to the optimal rutime. Thus, the iefficiecies idicated for our Hadoop cluster ad the published bechmark 12

Time s) 100 90 80 70 60 50 40 30 20 10 0 PDS with ubalaced partitioer PDS with balaced partitioer Optimal 5 10 15 20 25 odes a) Scalig a PDS sort bechmark up to 25 odes. Time s) 100 90 80 70 60 50 40 30 20 10 0 Actual Optimal Phase 1 Sort Phase 2 Stragglers 1 2 4 8 16 25 odes b) Time breakdow. Figure 5: Usig Parallel DataSeries to sort up to 100 GB, it is possible to approach withi 12-24% of the optimal sort times as predicted by our performace model. PDS scales well for a i-memory sort with 4 GB per ode up to 25 odes i a), although there is a small time icrease startig aroud 4 odes due to etwork effects. Also show for the 25 ode case is the performace of our older, ubalaced partitioer, which had a additioal 6% performace overhead from optimal. A breakdow of time i b) shows that the time icreases at scale are mostly i the first phase of a map-reduce dataflow, which icludes the etwork data shuffle, ad i the time odes sped waitig for stragglers due to effects of skew. results are real. We do ot have complete explaatios for the 3 13 loger rutimes for curret dataitesive computig frameworks, but we have idetified a umber of cotributors. Oe class of iefficiecies comes from duplicatio of work or uecessary use of a bottleeck resource. For example, Hadoop ad Google s MapReduce always write phase 1 map output to the file system, whether or ot a backup write is warrated, ad the read it from the file system whe sedig it to the reducer ode. This file system activity, which may traslate ito disk I/O, is uecessary for completig the job ad iappropriate for shorter jobs. Oe sigificat effect faced by map-reduce systems is that a job oly completes whe the last ode fiishes its work. For our cluster, we aalyzed the pealty iduced by such stragglers, fidig that it grows to 4% of the rutime for Hadoop over 25 odes. Thus, it is ot the source of most of the iefficiecy at that scale. For much larger scale systems, such as the 1000+ ode systems used for the bechmark results, this straggler effect is expected to be much more sigificat it is possible that this effect explais much of the differece betwee our measured 3 higher-tha-optimal rutimes ad the published 6 higher-thaoptimal rutime of the Hadoop record-settig TeraSort bechmark. The straggler effect is also why Google s MapReduce ad Hadoop dyamically distribute map ad reduce tasks amog odes. Support for speculative executio also ca help mitigate this effect, although fault tolerace is its primary value. If the straggler effect really is the cause of poor ed-to-ed performace at scale, the it motivates chages to these ew data-parallel systems to examie ad adapt the load balacig techiques used i works like River [6] or Flux [23]. It is temptig to blame lack of sufficiet bisectio badwidth i the etwork topology for much of the iefficiecy at scale. This would exhibit itself as over-estimatio of each ode s true etwork badwidth, assumig uiform commuicatio patters, sice the model does ot accout for such a bottleeck. However, this is ot a issue for the measured Hadoop results o our small-scale cluster because all odes are attached 13

Time s) 500 450 400 350 300 250 200 150 100 50 0 Parallel DataSeries Optimal 5 10 15 20 25 odes a) Scalig a PDS sort bechmark up to 25 odes. Time s) 500 450 400 350 300 250 200 150 100 50 0 Actual Optimal Phase 1 Sort Phase 2 Stragglers 1 2 4 8 16 25 odes b) Time breakdow ito phases. Figure 6: With 100 Mbps Etheret as the bottleeck resource, a 100 GB sort bechmark o Parallel DataSeries matches up well with the model s predictio ad stays withi 12-27% of optimal. As more data is set over the etwork with larger cluster sizes i a), both the model ad PDS predict loger sort times that evetually coverge. A breakdow of time i b) shows that the predicted ad actual time icreases occur durig the first map-reduce phase, which icludes the etwork data shuffle. across two switches with sufficiet backplae badwidth. The etwork topology was ot disclosed for most of the published bechmarks, but for may we do t believe bisectio badwidth was a issue. For example, MapReduce grep ivolves miimal data exchage because e M 0. Also, for Hadoop PetaSort, Yahoo! used 91 racks, each with 40 odes, oe switch, ad a 8 Gbps coectio to a core switch via 8 truked 1 Gbps Etheret liks). For this experimet, the average badwidth per ode was 4.7 MB/s. Thus, the average badwidth per uplik was oly 1.48 Gb/s i each directio, well below 8 Gbps. Other bechmarks may have ivolved a bisectio badwidth limitatio, but such a imbalace would have meat that far more machies were used per rack ad overall) tha were appropriate for the job, resultig i sigificat wasted resources. aturally, deep istrumetatio ad aalysis of Hadoop will provide more isight ito its iefficiecy. Also, PDS i particular provides a promisig startig poit for uderstadig the sources of iefficiecy. For example, replacig the curret maual data distributio with a distributed file system is ecessary for ay useful system. Addig that feature to PDS, which is kow to be efficiet, would allow oe to quatify its icremetal cost. The same approach ca be take with other features, such as dyamic task distributio ad fault tolerace. 9 Coclusio Data-itesive computig is a icreasigly popular style of computig that is beig served by scalable, but iefficiet, systems. A simple model of optimal map-reduce job rutimes shows that popular mapreduce systems take 3 13 loger to execute jobs tha their hardware resources should allow. With Parallel DataSeries, our simplified dataflow processig tool, we demostrated that the model s rutimes ca be approached, validatig the model ad cofirmig the iefficiecy of Hadoop ad Google s MapReduce. Our model ad results highlight ad begi to explai the iefficiecy of existig systems, providig isight ito areas for cotiued improvemets. 14

Refereces [1] Apache Hadoop, http://hadoop.apache.org/. [2] Hadoop Cluster Setup Documetatio, http://hadoop.apache.org/commo/docs/r0.20.2/cluster setup.html. [3] HDFS, http://hadoop.apache.org/core/docs/curret/hdfs desig.html. [4] Eric Aderso, Marti Arlitt, Charles B. Morrey, III, ad Alistair Veitch, Dataseries: a efficiet, flexible data format for structured serial data, SIGOPS Oper. Syst. Rev. 43 2009), o. 1, 70 75. [5] Eric Aderso ad Joseph Tucek, Efficiecy Matters!, HotStorage 09: Proceedigs of the Workshop o Hot Topics i Storage ad File Systems 2009). [6] Remzi H. Arpaci-Dusseau, Eric Aderso, oah Treuhaft, David E. Culler, Joseph M. Hellerstei, David Patterso, ad Kathy Yelick, Cluster I/O with River: makig the fast case commo, IOPADS 99: Proceedigs of the sixth workshop o I/O i parallel ad distributed systems ew York, Y, USA), ACM, 1999, pp. 10 22. [7] Radal E. Bryat, Data-Itesive Supercomputig: The case for DISC, Tech. report, Caregie Mello Uiversity, 2007. [8] Grzegorz Czajkowski, Sortig 1PB with MapReduce, October 2008, http://googleblog.blogspot.com/2008/11/sortig-1pb-with-mapreduce.html. [9] Jeffrey Dea ad Sajay Ghemawat, MapReduce: simplified data processig o large clusters, Commuicatios of the ACM 51 2008), o. 1, 107 113. [10] David DeWitt ad Jim Gray, Parallel database systems: the future of high performace database systems, Commu. ACM 35 1992), o. 6, 85 98. [11] Sajay Ghemawat, Howard Gobioff, ad Shu-Tak Leug, The Google file system, SOSP 03: Proceedigs of the ieteeth ACM symposium o Operatig systems priciples ew York, Y, USA), ACM, 2003, pp. 29 43. [12] Sagtae Ha, Ijog Rhee, ad Lisog Xu, CUBIC: A ew TCP-friedly high-speed TCP variat, SIGOPS Oper. Syst. Rev. 42 2008), o. 5, 64 74. [13] Michael Isard, Mihai Budiu, Yua Yu, Adrew Birrell, ad Deis Fetterly, Dryad: distributed data-parallel programs from sequetial buildig blocks, EuroSys 07: Proceedigs of the 2d ACM SIGOPS/EuroSys Europea Coferece o Computer Systems 2007 ew York, Y, USA), ACM, 2007, pp. 59 72. [14] Vishu Koda ad Jaslee Kaur, RAPID: Shrikig the Cogestio-cotrol Timescale, IFOCOM 09, April 2009, pp. 1 9. [15] Michael A. Kozuch, Michael P. Rya, Richard Gass, Steve W. Schlosser, James Cipar, Elie Krevat, Michael Stroucke, Julio Lpez, ad Gregory R. Gager, Tashi: Locatio-aware Cluster Maagemet, ACDC 09: First Workshop o Automated Cotrol for Dataceters ad Clouds, Jue 2009. [16] Aeesh Kumar K.V, Migmig Cao, Jose R. Satos, ad Adreas Dilger, Ext4 block ad iode allocator improvemets, Proceedigs of the Liux Symposium 2008), 263 274. [17] Chris yberg ad Mehul Shah, Sort Bechmark, http://sortbechmark.org/. 15

[18] Owe O Malley ad Aru C. Murthy, Wiig a 60 Secod Dash with a Yellow Elephat, April 2009, http://sortbechmark.org/yahoo2009.pdf. [19] Itel White Paper, Optimizig Hadoop Deploymets, October 2009. [20] Colby Rager, Ramaa Raghurama, Aru Pemetsa, Gary Bradski, ad Christos Kozyrakis, Evaluatig MapReduce for Multi-core ad Multiprocessor Systems, HPCA 07: Proceedigs of the 2007 IEEE 13th Iteratioal Symposium o High Performace Computer Architecture Washigto, DC, USA), IEEE Computer Society, 2007, pp. 13 24. [21] Chris Ruemmler ad Joh Wilkes, A itroductio to disk drive modelig, IEEE Computer 27 1994), 17 28. [22] Jiri Schidler, Joh Liwood Griffi, Christopher R. Lumb, ad Gregory R. Gager, Track-aliged Extets: Matchig Access Patters to Disk Drive Characteristics, I proceedigs of the 1st USEIX Symposium o File ad Storage Techologies, 2002, pp. 259 274. [23] Mehul A. Shah, Joseph M. Hellerstei, Sirish Chadrasekara, ad Michael J. Frakli, Flux: A Adaptive Partitioig Operator for Cotiuous Query Systems, Iteratioal Coferece o Data Egieerig 0 2003), 25. [24] Sajay Sharma, Advaced Hadoop Tuig ad Optimisatio, December 2009, http://www.slideshare.et/impetusifo/ppt-o-advaced-hadoop-tuig--optimisatio. [25] Michael Stoebraker, Daiel Abadi, David J. DeWitt, Sam Madde, Erik Paulso, Adrew Pavlo, ad Alexader Rasi, MapReduce ad Parallel DBMSs: Frieds or Foes?, CACM 2010). [26] Matthew Wachs, Michael Abd-El-Malek, Eo Thereska, ad Gregory R. Gager, Argo: Performace isulatio for shared storage servers, I Proceedigs of the 5th USEIX Coferece o File ad Storage Techologies., USEIX Associatio, 2007, pp. 61 76. [27] Guayig Wag, Ali R. Butt, Prashat Padey, ad Kara Gupta, A Simulatio Approach to Evaluatig Desig Decisios i MapReduce Setups, 17th IEEE/ACM MASCOTS, September 2009. 16