SARAH Statistical Analysis for Resource Allocation in Hadoop

Transcription

1 SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA Abstract Improving the performance of big data applications requires understanding the size and distribution of the input and intermediate data sets. Obtaining this understanding and then translating it into resource settings is challenging. SARAH provides a set of tools that analyze input and intermediate data sets and recommend configuration settings and performance optimizations. Statistics generated by SARAH are persistently stored, incrementally updated and operate across the several processing frameworks available in Apache Hadoop. In this paper we present the SARAH tool set, describe several Hadoop use cases for utilizing statistics and illustrate the effectiveness of utilizing statistics for balancing reduce workload on Map-Reduce jobs on web server log file data. Keywords big data; statistical analysis; Hadoop; Map-Reduce, performance tuning; I. INTRODUCTION The performance of big data applications is typically a function of the size and distribution of input, intermediate and output data sets. The Apache Hadoop platform[2] offers developers, system administrators, data scientists and analysts 1 dozens of configuration parameters to specify the cluster resources needed by a big data application and to influence how the big data application executes. While taking advantage of such flexibility can result in a finely tuned system, the challenge of effectively setting those parameters is great. It requires understanding the size and distribution of input and intermediate data sets and the algorithms of the big data application. It also requires understanding the operation and configuration of Hadoop processing frameworks, the available resources of a given cluster and the overall workload of the cluster. Consider the problem in the Map-Reduce[8]framework of determining the number of reducers that an application needs and balancing the load of intermediate data across those reducers. [10] In the Map-Reduce and Hive[3] frameworks, the user sets a property to specify the number of reducers. In the Pig[1] and Spark[11] frameworks, the user specifies an optional parameter to commands that typically run in reducers. To come up with a meaningful value, the user needs to understand the size and distribution of records in the intermediate data sets and given that understanding, have a way to influence the assignment of records to reducers. At best, the informed user understands the intermediate data and 1 Throughout the paper we will refer to the developer, administrator, data scientist and analyst as the user. can carefully calculate the number of reducers. At worst, the uniformed user utilizes system defaults or makes a random guess. Advanced relational data base systems gather and utilize statistics about tables in query optimization. [16] Such systems are closed systems; they offer a single relational model, a single query language and the storage format is defined by the system. Hadoop, on the other hand, supports unlimited storage formats defined by the user, multiple models and multiple processing frameworks with varying degrees of metadata. The Pig, Hive and Impala frameworks utilize some statistics about a job s data but the statistics generated in one framework are not usable in the others.[14],[15],[18] Computing statistics for big data sets is expensive. It makes little sense to spend more time computing statistics than the performance gain obtained by a more efficient use of resources. On the other hand, if the statistics are persistently saved, used across subsequent executions of applications and available in multiple processing frameworks, then this cost can be amortized over time. Furthermore, if updates to analyzed data sets only require an incremental statistical analysis cost, then the cost of generated statistics can be amortized over a long time. We view persistence, cross framework access and incremental update as requirements for any big data environment that gathers and utilizes big data statistics. SARAH (Statistical Analysis for Resource Allocation in Hadoop) is a test bed we have built to experiment with the generation of statistics of big data and the use of those statistics at runtime. SARAH generated statistics are used to enhance performance and help the user in setting resource properties. Statistics generated by SARAH are persistently saved, incrementally updatable and can be utilized across processing frameworks. Concretely, SARAH is a set of tools run by users on their data sets for their big data applications. SARAH contains a tool set for each supported processing framework to generate, store and update statistics. The statistics generated by the tool in one framework can be used at runtime in other frameworks. Having systems automatically generate, update and use statistics without user involvement is appealing. SARAH takes a more pragmatic approach, requiring user involvement but at a high level, productive fashion. User input is needed to determine when statistics should be gathered and Appeared in: 3 rd IEEE Conference on Big Data Science and Engineering (BSDE14), September, 2014

2 incrementally updated and to map those statistics across platforms. II. HADOOP USE CASES FOR STATISTICAL ANALYSIS The Hadoop Map-Reduce, Pig, Hive, Impala and Spark frameworks have many configurations that allow or require users to specify resources. Our goal is that SARAH generated statistics support these and other use cases. A. Smart Input Split Hadoop processing frameworks divide the input data set into subsets of records for parallel processing. The default behavior is to split each file into 64 MB blocks and assign each block to a map task. This approach is often adequate because the amount of work each parallel map task does is not sensitive to the distribution of the data, as it is with reducers. Each map task is given 64 MB of data. There is overhead creating and initializing a task, most notably the overhead to create and initialize a Java Virtual Machine. If a task has too little work, this overhead dominates and a larger block size is appropriate. Statistical analysis of the cost of executing the map function to the input data can estimate an effective value for the block size. An input data set that consists of many small files, that is, files that are smaller than 64 MB, results in too many small map tasks because the default behavior is to assign one map task to each file in this case. Statistical analysis of the input data set can recognize this. B. Appropriate number of balanced reducers Users can specify the number of reducers to use in a Map- Reduce job, including those executed by the Hive framework. Similarly, users of the Pig and Spark frameworks can set an additional parameter in the commands that are usually executed in parallel reducers. Statistical analysis of the intermediate data can estimate the number of reducers. Such analysis needs to take into account the size of the intermediate data and the overhead for creating and initializing a task. Estimating the number of reducers using only the size of the intermediate data is insufficient. Intermediate data is susceptible to data skew. Statistical analysis of the distribution of the intermediate data can break the intermediate data into similarly sized partitions. We describe this use case with SARAH in more detail in section IV. C. Skewed Joins Joining two large data sets can result in unbalanced parallel reducers if the joined data is skewed. [7] Statistical analysis of both data sets can estimate the number of reducers. Furthermore, by analyzing the distribution of the joined data, multiple reducers can be assigned to popular join keys. This approach requires replicating some of the records across reducers. Pig does this kind of analysis for skewed joins[14], however the analysis does not persist, cannot be incrementally updated and is not available across processing frameworks. D. Combiner Benefit In the Hadoop processing frameworks, a combiner is a function that is applied to subsets of intermediate data. For large intermediate data, a combiner almost always improves performance and lessens network utilization. When a reduce function is not commutative and associative, the reduce function cannot simply be reused as a combiner. Instead, the user must program a separate function. Statistical analysis of the input and intermediate data can advise on the benefits of coding an additional combiner function. In the Pig framework combiners are automatically determined by the execution plan. The Pig framework does not apply combiners when the script invokes a user-defined function because it treats the function as a black box. Pig does apply combiners, however, if the user code is declared as algebraic and provided as initial, intermediate and final functions. [13] Again, statistical analysis of the input and intermediate data can advise of the benefits of this additional coding. E. Task Memory Allocation Hadoop processing frameworks define several properties that specify task memory requirements. These properties are defined prior to executing the job. The properties include a map task s heap size, the size of the map task s in-memory buffer for intermediate data and a reduce task s heap size. Statistical analysis of input and intermediate data can estimate values for these memory specifications. While not required by the Hadoop framework, some reducers buffer all of the values associated with a key. Statistical analysis of intermediate data can estimate an upper bound on the amount of memory a reduce function requires. F. Balanced Total Order Sort The Map-Reduce Framework sorts partitions by key. It does not, however, sort across all the partitions. The total order partitioner [6] ensures the sorted keys in one partition are less than the sorted keys in the next partition, effectively producing a total sort of the data set. The user provides keys that divide the partitions and the practitioner builds the partitions at runtime. The partitions can be unbalanced since they depend on the keys provided by the user. Statistical analysis of the intermediate data can estimate the distribution of the keys and calculate keys for balanced partitions. G. Compressed Intermediate Data Hadoop processing frameworks transmit intermediate data over the network. The user can specify if this data should be compressed and the compression algorithm that should be used. If the amount of intermediate data is large and the overhead of compressing and decompressing the data is not too great, then compressing the data improves performance. Statistical analysis of intermediate data can estimate whether compression is worth it. H. Parallel Data Transfer Hadoop processing frameworks transmit intermediate data over the network in parallel. In the Map-Reduce framework, reducers pull sorted intermediate data from multiple mappers

3 and merge the sorted data. A property controls how many streams are received and sorted in parallel. Statistical analysis of the intermediate data can estimate appropriate values for this property. I. Estimating cluster workload The previous use cases utilize statistical analysis of input, intermediate, output data sets and algorithms for optimizing the performance of a single job. Data and algorithm statistics can also be used across jobs and over time. Job and data statistics are useful in expanding a cluster, that is, in determining additional hardware to deploy. The analysis can also be useful in determining service level agreements and scheduling policies. III. THE SARAH TEST BED SARAH is a test bed for generating and using crossframework, persistent and incremental statistics for Hadoop. Some of the SARAH tools generate statistics; other tools produce artifacts from the generated statistics. Some of the artifacts are used at runtime for better resource utilization, others are used to configure a job, others are used when developing and testing software and still others are used for informational purposes. SARAH tools are framework specific. For Hadoop s Map-Reduce framework, SARAH provides a set of generic Map-Reduce jobs to compute statistics and generate artifacts. For the Pig framework, SARAH provides a set of parameterized Pig scripts. For the Hive and Impala frameworks, SARAH provides a set of parameterized HiveQL scripts. For the Spark framework, SARAH provides a Scala API for computing statistics on intermediate RDDs. Metadata differ between frameworks. In Hive and Impala, data sets are completely described as tables. The table definitions are stored in the Hive metastore. In the Map- Reduce, Pig and Spark frameworks, metadata are embedded in programs and incomplete. Such differences necessitate separate tools for generating statistics. SARAH artifactgenerating tools are also framework specific because execution costs differ between frameworks and many of the artifacts themselves are framework specific. Since generating statistics on big data sets is costly, SARAH saves generated statistics persistently. Furthermore, SARAH tracks changes to input data sets and users can request SARAH to incrementally update the generated statistics. All of the tools represent generated statistics in a common format. Statistics generated by one tool set can be used in another framework. In particular, artifact-generating tools from one framework can use statistics generated in another framework. A. Input Data Sets Hadoop processing frameworks typically operate on sets of files, stored in HDFS. Hive and Impala equate tables with HDFS directories and support partitioning of tables as subdirectories. Other frameworks are more flexible, allowing the user to define an input data set as an arbitrary set of files. SARAH requires users to name and define data sets. Data sets are either a directory in the file system or an explicitly defined set of files. SARAH then bases incremental updates to generated statistics on the addition to or deletion from files in the input data set. SARAH does not track changes within a file since HDFS files are immutable. SARAH calculates and stores basic statistics from input data sets, including the number of records in the data set, the size of the data set in bytes, the minimum, maximum and average record size in bytes and the distribution of record sizes. B. Intermediate Data Sets and Functions An intermediate data set results from applying a function to an input data set. A function is named and realized in different ways in different frameworks. A given named function can have multiple realizations. By naming different realizations of the function with the same name, the user is indicating that they produce the same intermediate data set, given the same input data set. In the Map-Reduce framework, a map method in a Java mapper class is an example of a function. It produces an intermediate data set. In Hive and Impala, a simple query that operates on a single record is an example. In Pig, a simple script that operates on a single record is an example. In Spark, a Scala or Java function that is defined on a single record is an example. Unlike input data sets, intermediate data sets are not necessarily realized as files in HDFS. An intermediate data set in the Map-Reduce framework is initially generated and partitioned at all the mappers and then transmitted over the network to the reducers. It is always a distributed data structure, never being stored in a single place. In Spark, an intermediate data set may only live in the cluster-wide cache. Advanced database systems, Hive[5] and Impala[18] generate so-called column statistics. Such systems are essentially generating statistics on intermediate data sets generated from simple column selection functions. An intermediate data set can represent a resource intensive state of a big data application. It represents the result of applying a function to an input data set. Statistics about the intermediate data set are useful in different resource allocation contexts in different frameworks. For each function applied to an input data set, SARAH calculates and stores basic statistics from the resulting intermediate data sets, including the average execution time of the function, the number of intermediate records, the size of the intermediate data set in bytes, the minimum, maximum and average intermediate record size in bytes and the distribution of intermediate records. C. Artifacts Once statistics are generated for input and intermediate data sets, SARAH tools can generate useful artifacts from those statistics. Some of the artifacts are used at runtime for better resource utilization, others are used to configure a job, others are used when developing and testing software and still others are used for informational purposes.

4 An example of a runtime artifact in the Map-Reduce framework is an interval file that can be used with the Total Order Partitioner[6] to balance or sort the load across reducers. The number of reducers in a Map-Reduce job, the value for a parallel parameter in a Pig statement and an estimation of the amount of memory that a task needs to process a data set are all examples of configuration artifacts. SARAH computes random samples of input and intermediate data sets for generating statistics. Users specify a sample percentage between 0 and 100%. The generated samples are saved artifacts and users can use them for development, testing and analysis. Besides random samples, SARAH can create other kinds of samples of input and intermediate data, including samples with outliers and samples of data sets to be joined that reflect the resulting distribution of joined data. Such samples are useful in development, testing and analysis of big data applications. The join samples are useful in understanding and addressing skewed joins. SARAH generates artifacts that are useful for a user to understand the input and intermediate data sets. SARAH can generate visualizations of data distributions. Figure 1 visualizes the distribution of applying the months() function to a weblog input data set. SARAH generates the distribution data as well as an R [17] script to create the graphic. IV. USING SARAH TO BALANCE LOAD ACROSS REDUCERS We now illustrate the use of SARAH to estimate an appropriate number of reducers in the Map-Reduce framework and to balance the load across those reducers. While this use case is also relevant to Hive, Pig and Spark, we limit the description to generating statistics and artifacts for Hadoop Map-Reduce jobs To begin, the user issues the following command: sarah statistics [sample-%] input-data-set function1.. functionn The statistics command generates or updates statistics for an input data set and the n intermediate data sets defined by function1.. functionn. The user can specify an optional sample percentage. The first phase of the statistics command computes and saves the n+1 samples in a single map-only job. The mappers randomly select records in the input to generate the input sample. The mappers also apply each function to each record in the input data set and randomly select records in each intermediate data set for each intermediate sample. The first sample generating phase is efficient. It only requires a single, parallel, processing of the input data. Each record is considered once in parallel and the n functions are applied to it. The second phase of the statistics command generates statistics from the samples. For samples that are small enough to fit in memory, SARAH generates all statistics as a single efficient map-only job. Each mapper processes a single sample. For larger samples, SARAH generates multiple Map- Reduce jobs to compute the statistics for all of the data sets. To obtain SARAH s recommendations for the number of reducers and artifacts for balancing the load across reducers, the user issues the following command: sarah balanced-reducers [split-values] input-data-set function1.. functionn For each function, SARAH estimates a number of reducers by simply dividing the estimated number of records in the associated intermediate data set by the ideal partition size; that is, it divides the estimate by the number of records ideally processed by each reducer. The later is a constant. For each function, SARAH also generates an interval file. The interval file is an artifact that serves as input to a get_partition function used by the Map-Reduce framework to assign records to reducers. By default, SARAH generates a n i n t e r v a l f i l e t h a t c a n b e p r o v i d e d t o t h e TotalOrderPartitioner in the Map-Reduce framework. This partitioner does not split the set of values that are associated with a key. This limits the ability to balance the load across

5 reducers. If the number of values associated with a single key is greater than the ideal partition size, the partitioning is less than optimal. If the user sets the optional split-values parameter to true, the intermediate data set is exactly balanced over all of the reducers. For keys with split set of values, the interval file also contains a percentage of records that are included in each partition. SARAH provides an extension to the TotalOrderPartioner that uses the percentages to split the set of values associated with a single key. The extended partitioner violates the rule that all values associated with a single key are provided to a single reducer. If the user wishes to do this, the application must accommodate the non-standard partitioning. A. Measuring SARAH Effectiveness Balancing Reducers Measuring the value of SARAH requires comparing performance and resource utilization of big data applications that have been configured using SARAH artifacts to those that have been configured manually. We compare the utilization of reducers on a Map-Reduce job using SARAH artifacts to the same job configured manually. We consider three cases of manual configuration for the balanced reducer use case: The naïve user understands neither the Map-Reduce job being executed nor the input and intermediate data sets. The naïve user accepts all of the defaults for running the job. Since the default number of reducers is 1, there is no need to partition the intermediate data set. The rule of thumb user learned some simple rule for allocating reducers. For our purposes the rule is based on the size of the input data set. Hive and Pig provide this as a default. [9], [12] The job configured by the rule of thumb user runs with the default HashPartitioner that simply assigns records to reducers by hashing the key. The educated guess user attempts to understand the intermediate data but the understanding is incomplete. In particular, the user measures the size of the intermediate data set by running the mapper but fails to measure the distribution of the intermediate data set. Again, the job configured by the educated guess user runs with the default HashPartitioner. We ran a Map-Reduce job that processes an 18 million record weblog data set generated by an Apache Web Server. The job analyzes the weblog data to understand the differences by month of users who access the web server in the evening. The intermediate data consisted of 6 million records. The naïve user ran the job with a single reducer. The single reducer processed all 6 million records. This approach obviously does not scale. The intermediate data set grows as a function of the input data set, eventually overloading the single reducer. The rule of thumb user ran the job as a function of the input size, with 12 reducers. Figure 2 illustrates the result of running the job configured by the rule of thumb user. The partitioning of data to reducers was not very good. There were three reducers with no records to process and the workload of the remaining reducers was not balanced. Furthermore, the balanced reducers were underutilized, processing less than 600,000 records each. We used SARAH to generate a sample of the intermediate data and used the tool s recommendation of 5 reducers and the interval file generated by the tool. We did not split the set of values so an exact partitioning is not possible. Figure 3 illustrates the distribution of records to reducers that SARAH generated. The first four reducers are fairly balanced. The fifth reducer handled a single key with 1.7 million records.

6 Since we did not choose to break up sets of records, the fifth reducer had more work than the others. Finally, the educated guess user ran the job with 5 reducers but without considering the distribution of the data. Figure 4 illustrates the distribution of records to reducers as a result of running the job by the educated guess user. The number of reducers is appropriate but the skew in the intermediate data is not handled. Notice that the third, rather than the fifth, reducer processed the 1.7 million records associated with the same key in this case. V. CONCLUSIONS AND FUTURE WORK The SARAH test bed produces statistics for large input and intermediate data sets. The statistics are persistently saved, incrementally updated and usable across frameworks. From these statistics, SARAH generates artifacts that are useful for allocating resources in the desired framework. We illustrated this with the problem of allocating reducers and balancing the workload across those reducers. The analogous problem exists in the Pig, Hive and Spark frameworks. We continue to develop the SARAH test bed for the Map- Reduce, Hive, Impala, Pig and Spark frameworks. We continue experimenting with different algorithms and use cases. We continue to expand the SARAH test cases to better experiment with the algorithms. To date, we have concentrated on using SARAH to estimate resources for a single computation. We have not yet addressed any of the use cases that use collected data set statistics over time to estimate cluster wide sizing and scheduling. Finally, in building SARAH we realized the need to define a runtime API and service that makes the statistics and generated artifacts available to the Hadoop applications and frameworks. VI. REFERENCES [1] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava., "Building a High Level Dataflow System on top of Map-Reduce: The Pig Experience." Proc. of the VLDB Endowment, vol. 2, no. 2, [2] Apache Hadoop Project. [3] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy, "Hive - a Petabyte Scale Data Warehouse Using Hadoop," icde, pp , 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010 [4] Cloudera Impala Project [5] Column Statistics in Apache Hive, [6] D. Miner and A. Shook, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, December, [7] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, S. Seshadri, Practical Skew Handling in Parallel Joins, Proceedings of the 18th International Conference on Very Large Data Bases, p.27-40, August 23-27, 1992 [8] Dean, J., and Ghemawat, S. Mapreduce: Simplified Data Processing on Large Clusters. Communications of the ACM, vol. 51, no. 1, 2008 [9] E. Capriolo, D. Wampler, J. Ruthergien, Programming Hive, O'Reilly Media, September, [10] Lars Kolb, Andreas Thor, Erhard Rahm, Load Balancing for MapReduce-based Entity Resolution", Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, p , April 01-05, 2012 [11] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, Spark: Cluster Computing With Working Sets, Proceedings of the 2nd USENIX Conference on Hot topics in Cloud Computing, p.10-10, June 22-25, 2010, Boston, MA [12] Pig Reducer Estimation, [13] Pig User-defined Functions, [14] PigSkewedJoinSpec, [15] Statistics in Hive, [16] Sunil Chakkappen, Thierry Cruanes, Benoit Dageville, Linan Jiang, Uri Shaft, Hong Su, Mohamed Zait, Efficient and scalable statistics gathering for large databases in Oracle 11g, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, June 09-12, 2008, Vancouver, Canada [17] The R Project for Statistical Computing. [18] Tuning Impala for Performance, docs/cdh5/latest/impala/installing-and-using- Impala/ciiu_performance.html