On Modeling CPU Utilization of MapReduce Applications

Transcription

1 On Modeling CPU Utilization of MapReduce Applications Nikzad Babaii Rizvandi 1,2, Young Choon Lee 1, Albert Y. Zomaya 1 1 Centre for Distributed and High Performance Computing School of IT, University of Sydney, Australia 2 National ICT Australia (NICTA), Australian Technology Park [email protected] Abstract In this paper, we present an approach to predict the total CPU utilization in terms of CPU clock tick of applications when running on MapReduce framework. Our approach has two key phases: profiling and modeling. In the profiling phase, an application is run several times with different sets of MapReduce configuration parameters to profile total CPU clock tick of the application on a given platform. In the modeling phase, multi linear regression is used to map the sets of MapReduce configuration parameters (number of Mappers, number of Reducers, size of File System (HDFS) and the size of input file) to total CPU clock ticks of the application. This derived model can be used for predicting total CPU requirements of the same application when using MapReduce framework on the same platform. Our approach aims to eliminate error-prone manual processes and presents a fully automated solution. Three standard applications (WordCount, Exim Mainlog parsing and Terasort) are used to evaluate our modeling technique on pseudo-distributed MapReduce platforms. Results show that our automated model generation procedure can effectively characterize total CPU clock tick of these applications with average prediction error of 3.5%, 4.05% and 2.75%, respectively. Keyword- CPU utilization, CPU clock tick, MapReduce, Modeling, Prediction, Regression 1. Introduction Cloud computing has received a lot of attention from both research community and industry due to the deployment and growth of commercial cloud platforms and the associated services (e.g., Amazon EC2, Microsoft Azure and Google AppEngine) [2]. These cloud services enable customers to change, or dynamically supplement, their own IT infrastructures with large choices of computational and storage resources that are accessible on demand. On the other side, cloud providers charge customers based on their usage or reservation of datacenter resources (CPU hours, storage capacity, network bandwidth, etc) which results in a typical dependency between service level agreements (SLAs) and resource availability. Therefore, the accurate prediction of resource usage in such a scenario is important in that customers are much facilitated to decide how many nodes and for how long they should hire them from the cloud 1

2 providers [3]. Moreover, such a prediction can be used by cloud providers to guide scheduling and resource management decisions, and realistic workload generators to evaluate the choice of policies prior to full production deployment. Recently, businesses have started using MapReduce as a popular computation framework for processing large amounts of data in both public and private clouds. For example, many web-based service providers like Facebook utilize MapReduce for analysing their core business and mining their produced data. Therefore, understanding performance characteristics in MapReduce-style computations brings significant benefit to application developers in terms of the improvement in application performance and resource utilization. Generally, MapReduce users run a small number of applications for a long time. For example, Facebook uses Hadoop (the Apache implementation of MapReduce in Java) to read log files produced daily and filter database information depending on incoming queries. These tasks are repeated millions of times per day. Another example is a similar application in Yahoo! where the majority of jobs (around 80-90%) are based on Hadoop. These jobs include searching among large quantities of data, indexing the documents, and returning appropriate information. Just like Facebook, these applications run many times per day for different purposes. One major factor that directly influences the performance of MapReduce applications is tuning configuration parameters of these applications. MapReduce users face an issue in how to properly set these configuration parameters [4]. Clearly, the profiling and modeling performance of MapReduce applications and their resource usage with different values of configuration parameters are of great practical importance for both cloud users and service providers. For example, Amazon can use the result of such modeling to effectively schedule MapReduce applications on Amazon Elastic MapReduce service. In this paper, we present a technique to model resource usage of MapReduce applications in terms of total CPU utilization based on CPU clock tick. We have chosen four major configuration parameters: number of Mappers, number of Reducers, size of file system (Hadoop Distributed File System or HDFS), and size of input file. For a given MapReduce platform, applications run iteratively with different values of those parameters and total CPU clock tick of these applications are gathered. Then for each application, a linear model is constructed by applying multilinear regression on the set of configuration parameters values (as input) and obtained total CPU clock ticks of the application (as output). Obviously, the proposed modeling technique can be extended for other configuration parameters or used for modeling other resources such as storage, network bandwidth and memory. Although, our modeling technique can be applied to other applications on different platforms, two issues should be taken into account: first, the obtained model of an application on a specific platform may not be used for predicting the same application on another platform and second, the modeling of an application on a platform is not applicable to predicting other applications on the same platform. 2

3 2. Related Work MapReduce, introduced by Google in 2004 [5], is a framework for processing large quantities of data on distributed systems. The computation of this framework consists of map phase and reduce phase; hence, the name is MapReduce (Figure 1). The MapReduce framework can be seen as a practical application of the traditional data parallel model, single program, multiple data (SPMD). The process of converting an algorithm into independent mappers and reducers causes MapReduce to be inefficient for algorithms with sequential nature. Typical MapReduce applications are designed for computing on significantly large quantities of data instead of making complicated computation on a small amount of data [6]. Due to its simple structure, MapReduce suffers from some serious issues especially in scheduling, energy efficiency and resource allocation. Therefore, predicting an application resource requirement before running on a native system can make a significant contribution to modifying MapReduce scheduling and resource allocation. MapReduce: Early works on analyzing/improving MapReduce performance started almost since 2005; such as an approach by Zaharia et al [7] that addressed problem of improving the performance of Hadoop for heterogeneous environments. Their approach was based on the critical assumption in Hadoop that works on homogeneous cluster nodes where tasks progress linearly. Hadoop utilizes these assumptions to efficiently schedule tasks and (re)execute the stragglers. Their work introduced a new scheduling policy to overcome these assumptions. Besides their work, there are many other approaches to enhance or analysis the performance of different parts of MapReduce frameworks, particularly in scheduling [8], energy efficiency [3, 9-15] and workload optimization[16]. A statistics-driven workload modeling was introduced in [10] to effectively evaluate design decisions in scaling, configuration and scheduling. The framework in this work was utilized to make appropriate suggestions to improve the energy efficiency of MapReduce. A modeling method was proposed in [9] for finding the total execution time of a MapReduce application. It used Kernel Canonical Correlation Analysis to obtain the correlation between the performance feature vectors extracted from MapReduce job logs, and map time, reduce time, and total execution time. These features were acknowledged as critical characteristics for establishing any scheduling decisions. Recent works in [17-18] reported a basic model for MapReduce computation utilizations. Here, at first, the map and reduce phases were modeled using dynamic linear programming independently; then, these phases were combined to build a global optimal strategy for MapReduce scheduling and resource allocation. In [1, 19-23], linear regression is applied to model the total number of CPU tick clocks/execution time of an application needs to execute and four MapReduce configuration parameters. These configuration parameters are: number of Mappers, number of Reducers, size of file system and size of input file. System profiling and modeling: system profiling is a mechanism used to gather information about the both software and hardware configurations of a system in order to model it properly. In high performance computing systems, profiling system specifications is utilized for modeling different parts of the system. For example, in 3

4 Figure 1.MapReduce workflow [1] [24] a combination of performance modeling and prediction was applied to reduce execution times with respect to their predefined energy usage upper limit. After creating models for both execution time and energy consumption, key parameters of models are estimated by executing a program for a small number of times and then regressing the estimated parameters. In recent work[25], the idea of profiling and modeling was used for power metering in virtual machines (VMs).The relation between the major components in hardware such as CPU, memory and disk and energy are modeled as. Then multilinear regression is used to find the coefficients by profiling of thousands of traces. The work in [26] is probably the closest work to this paper. Wood et al. in [24] present a linear regression-based model to predict the CPU resource overhead between running an application in VM and the native system. The values of a set of 11 variables capturing the real values of CPU, disk and network resources in time of an application in native system are extracted by the SysStat tool [27]. The CPU usage value in time is also captured in VM using XenTop and XenMon tools. Then a linear regression model is used to form a linear relation between and these 11 variables as: where are the 11 input variables of the native system at time. In our design, we use SysStat to capture the total CPU utilization (in terms of CPU clock tick) of an application in time in MapReduce. After calculating the total CPU clock tick of the application, we model the relationship between the four MapReduce configuration parameters and using linear regression as: Then, we apply the derived model to the same application but with different MapReduce configuration parameters and use it for predicting the CPU resource requirements of the application before it actually runs. 4

5 Figure 3. the flow of the MapReduce application(left) and CPU utilization time series extracted from actual system (right). This value is then converted to total CPU clock tick based on the platform s operating frequency. 3. Application Modeling in MapReduce 3.1. Problem definition In distributed computing systems, MapReduce has been known as a large-scale data processing or CPU intensive job [4, 28-30] which implies that CPU utilization is the most important part of running an application on MapReduce. Therefore, predicting the CPU capacity that an application needs becomes important for customers to hire enough CPU resources from cloud providers and for cloud providers to schedule incoming jobs properly. In this paper we are going to study the dependency between MapReduce configuration parameters and the CPU utilization of the system. We expect that the CPU utilization of an application in MapReduce is highly correlated and proportional to MapReduce configuration parameters. Therefore by modelling, it becomes possible to predict the finish time of an experiment; and this has a significant impact on job scheduling. Clearly, there is high dependency between the value of total CPU clock tick of an application and changing these configuration parameters. The straightforward benefit of finding a model between the MapReduce configuration parameters and total CPU clock tick is that one can approximately calculate the best values of these parameters to optimize the values of CPU clock tick of the application. This means that if one is to run a given application in a cloud, by modeling it can approximately find how many virtual nodes the application needs and for how long Profiling CPU utilization For each application, we generate a set of experiments with different values of four MapReduce configuration parameters on a given platform. While running each experiment, the average CPU utilization of the experiment is gathered to build a trace for future use as training data for the model (this data can be gathered easily in Linux using the SysStat monitoring package which has low overhead). Within the system, we sample the CPU usage of the experiment in the native system from the time the 5

6 Figure 4. Prune algorithm Figure 5. Modeling and prediction algorithms mappers start to the time when the reducers finish with a time interval of one second (Figure 3-left). If the platform has an operating frequency of f platform (in Hz), then the obtained trace is converted to the relative CPU frequency by calculating the product of the average CPU utilization and f platform. Then, total CPU clock tick of a particular experiment is the summation of all CPU usage samples during the time period of that experiment (Figure 3-right). Because of the temporal changes, it is expected that several runs of an experiment with the same configuration parameters may result in slightly different total CPU clock tick. Therefore, utilizing a mechanism to prune 6

7 unsuitable data from the training dataset should improve the modeling accuracy. In [26], Robust Stepwise Linear Regression was used as a post processing stage to refine the outcome of the model by giving weights to data points with high error. In this study, we use a straightforward technique to prune the data set as described in Figure 4. As a result, the final value of total CPU clock tick of this experiment is equal to the final calculated mean in line 8. The same procedure must be followed for other experiments Model creation using linear regression The next step is how to create a model for MapReduce applications by characterizing the relationship between a set of MapReduce configuration parameters and CPU resource utilization metric. The problem of such a modeling based on linear regression involves the choosing of suitable coefficients for the model in order to approximate the real system response [26, 31]. Consider the linear algebraic equations for experiments of an application with different sets of four configuration parameters values: where is the actual value of total CPU clock tick of the application in the j th experiment on MapReduce and S (j) = (M (j), R (j) ) are the MapReduce configuration parameters which are the number of mappers (M (j) ), and the number of reducers (R (j) ), for this experiment. Using the above definition, the approximation problem becomes one of estimating the values of to optimize a cost function between the approximation values and the actual values of total CPU clock tick. Then, an approximated total CPU clock tick experiment is predicted as: of the application for the There are a variety of well-known mathematical methods in the literature to calculate the variables ( ). One widely used in many application domains is the Least Square Regression which calculates the parameters in Eqn.2 by minimizing the error: Least Square Regression theory claims that if: 7

8 Figure 6. the modeling justification then the model satisfying the above error will be calculated as [31-32]: where denotes transpose matrix. The set of configuration parameters values is the model that approximately describes the relationship between total CPU clock tick of an application regard to the four MapReduce configuration parameters. In other words, the relationship between total CPU clock tick of a MapReduce application and the configuration parameters is: Once a model has been created, it can then be applied to CPU resource utilization traces of the same application to estimate what their total CPU requirements will be if the values of MapReduce configuration parameters change. It also should be considered that the obtained model of an application may be different from another application. The modeling and profiling algorithms related to our technique are briefly described in Figure 5. 8

9 TABLE 1. The mean and variance of prediction errors Mean prediction error Variance prediction error WordCount 4.37% % TeraSort % % Exim Mainlog parsing % % 4. Experimental Validation In this section, we evaluate the effectiveness of our models under three standard applications and two different hardware platforms Experimental setting Three widely used text processing applications are used to evaluate the effectiveness of our method. Our method has been implemented and evaluated on a private Cloud as shown in Figure 5. The cloud in our evaluation has the following specifications: Physical H/W: includes five servers, each one is an Intel Genuine with 3.00GHz clock, 1GB memory, 1GB cache and with 50GB of shared SAN hard disk. For virtualization, Xen cloud platform (XCP) has been used on top of the physical H/W. The Xen-API [33] provides functionality to manage virtual machines inside the XCP directly. It provides binding in high level languages like Java, C# and Python. Using those bindings it is possible to measure the performance of all virtual machines in a datacenter and live-migrate them. Virtual nodes (/servers) are implementd on top of the XCP. The number of virtual nodes is chosen from [5, 10, 15, 20, 25] with linux image (debian). the virtual node runs Hadoop version that is Apache implementation of MapReduce developed in Java [34]; at the same time, the SysStat package (installed in image on node) is executed in background to monitor/extract the CPU utilization time series of applications (in the native system) [27]. For an experiment with a specific set of MapReduce configuration parameters values, statistics are gathered from running job stage to the job completion stage (arrows in Figure 3-left) with sampling time interval of one second. All CPU usages samples are then combined to form CPU utilization time series of an experiment. For each application there are sets of experiments where the number of mappers and reducers are a value in and the size of input data is. To capture the statistical information, each experiment is also repeated 10 times. Our benchmark applications are WordCount, TeraSort, and Exim Mmainlog parsing. These benchmarks are used due to their striking differences and also because other studies [1, 19-20, 35-39] have relied on these benchmarks Results As mentioned earlier, there is a strong dependency between a MapReduce application execution time and the number of Mappers and Reducers. Figure 7 shows the 9

10 dependency between these two configuration parameters and execution time for the applications. As can be seen from this figure, these applications behave differently as the number of Mappers and Reducers increase. In general, executing the same size of data for both applications results in different execution times so that, in many cases, WordCount has double execution time than Exim main log. In addition, although both applications show the minimum execution time for 20 mappers and five reducers, WordCount shows more fluctuation than Exim for other number of mappers/reducers. The reason why these numbers of mappers and reducers give the lowest execution time is not clear to us. Perhaps moving from one platform to another platform causes this variation. Therefore, even though the modeling is valid for applications on different platforms, the coefficients of the model may change by migrating from one platform to another. This result also supports our initial claim as the number of mappers and reducers have significant impact on the total execution time of the application. To test the accuracy of an application s model, we use it to predict total CPU clock tick of several experiments of the application with randomly values of the four configuration parameters in the predefined range. We then run the experiments on the real system and gather total CPU utilization in term of clock tick to determine the prediction error. For evaluation, the outcomes of these models on three standard MapReduce applications (WordCount, Exim_Mainlog parsing and TeraSort) are compared with their actual total CPU clock ticks. Figure 7 shows the prediction accuracies and prediction errors of these applications between actual values of total CPU clock tick and their predicted values, while Table 1 is the statistical mean and variance of prediction errors for the three applications. We find that the average error between the actual values of total CPU clock ticks and their predicted counterparts is less than 5% for the tested applications. Some of the errors come from the model inaccuracy, but it can also be because of temporal changes in the system resulting in total CPU clock tick increase for a short time. Those spikes of prediction errors in Figure 7 happen in the low values of total CPU clock ticks which most likely caused by background processes running during the execution of the applications. For example in Hadoop, one of the main background processes comes from streaming when the mapper and reducer are written in a programming language other than Java. As these background processes constantly consume a certain amount of CPU capacity, their influence becomes significant when total CPU clock tick of the MapReduce application is low. Although the obtained model can successfully predict the level of total CPU capacity required for a MapReduce application, it cannot give information about how application performance, such as response time, changes or how CPU utilization varies during time. Finally, our approach can help cloud customers and providers to approximate the total amount of CPU resources which have to be allocated to a MapReduce application in order to prevent significance performance reduction because of CPU resource limitation. 10

11 WordCount (a) TeraSort (b) Exim MainLog Parsing R=4 R=8 R=12 R=16 R=20 R=24 R=28 R=32 (c) Figure 7. the prediction accuracy and error between the actual and predicted total CPU clock tick for the applications 6. Conclusion 11

12 The motivation behind this work is that the accurate modeling and prediction of resource usage of a MapReduce application before running on the actual cluster/cloud can greatly help both application performance and effective resource utilization. In this paper, we have presented an approach to model/profile the CPU usage of MapReduce application from native system and applied linear regression model to identify the correlation between four major MapReduce configuration parameters and the CPU utilization of the application. Our modeling technique can be used by both users/consumers (e.g., application developers) and service providers in the cloud for effective resource utilization. Evaluation results show that our modeling technique can effectively predict the total computation cost of three standard applications with a median prediction error of less than 5%. Acknowledgment Mr. N. Babaii Rizvandi s work is supported by National ICT Australia (NICTA). Professor A.Y. Zomaya's work is supported by an Australian Research Council Grant LP References [1] N. B. Rizvandi, et al., "Preliminary Results on Modeling CPU Utilization of MapReduce Programs," University of Sydney, TR665, [2] R. Nathuji, et al., "Q-clouds: managing performance interference effects for QoS-aware clouds," presented at the Proceedings of the 5th European conference on Computer systems, Paris, France, [3] Y. Chen, et al., "Towards Understanding Cloud Performance Tradeoffs Using Statistical Workload Analysis and Replay," University of California at Berkeley,Technical Report No. UCB/EECS , [4] S. Babu, "Towards automatic optimization of MapReduce programs," presented at the 1st ACM symposium on Cloud computing, Indianapolis, Indiana, USA, [5] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , [6] Hadoop Developer Training. Available: [7] M. Zaharia, et al., "Improving MapReduce Performance in Heterogeneous Environments," 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2008), pp , 18 December [8] M. Zaharia, et al., "Job Scheduling for Multi-User MapReduce Clusters," University of California at Berkeley,Technical Report No. UCB/EECS , [9] J. Leverich and C. Kozyrakis, "On the Energy (In)efficiency of Hadoop Clusters," SIGOPS Oper. Syst. Rev., vol. 44, pp , [10] Y. Chen, et al., "Statistical Workloads for Energy Efficient MapReduce," University of California at Berkeley,Technical Report No. UCB/EECS ,

13 [11] N. Kamyabpour and D. B. Hoang, "A hierarchy energy driven architecture for wireless sensor networks," presented at the 24th IEEE International Conference on Advanced Information Networking and Applications (AINA- 2010), Perth, Australia, [12] N. Kamyabpour and D. B. Hoang, "A Task Based Sensor-Centeric Model for overall Energy Consumption," CoRR, [13] N. Kamyabpour and D. B. Hoang, "A study on Modeling of Dependency between Configuration Parameters and Overall Energy Consumption in Wireless Sensor Network (WSN)," CoRR, [14] K. Almiani, et al., "RMC: An Energy-Aware Cross-Layer Data-Gathering Protocol for Wireless Sensor Networks," presented at the 22nd International Conference on Advanced Information Networking and Applications (AINA), GinoWan, Okinawa, Japan, [15] K. Almiani, et al., "Energy-efficient data gathering with tour lengthconstrained mobile elements in wireless sensor networks," presented at the The 35th Annual IEEE Conference on Local Computer Networks (LCN), Denver, Colorado, USA, [16] T. Sandholm and K. Lai, "MapReduce optimization using regulated dynamic prioritization," presented at the the eleventh international joint conference on Measurement and modeling of computer systems, Seattle, WA, USA, [17] A. Wieder, et al., "Brief Announcement: Modelling MapReduce for Optimal Execution in the Cloud," presented at the Proceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, Zurich, Switzerland, [18] A. Wieder, et al., "Conductor: orchestrating the clouds," presented at the 4th International Workshop on Large Scale Distributed Systems and Middleware, Zurich, Switzerland, [19] N. B. Rizvandi, et al., "On using Pattern Matching Algorithms in MapReduce Applications," presented at the The 9th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Busan, South Korea, [20] N. B. Rizvandi, et al., "Preliminary Results on Using Matching Algorithms in Map-Reduce Applications," University of Sydney, TR672, [21] N. B. Rizvandi, et al., "Preliminary Results: Modeling Relation between Total Execution Time of MapReduce Applications and Number of Mappers/Reducers," University of Sydney2011. [22] N. B. Rizvandi, et al., "A Study on Using Uncertain Time Series Matching Algorithms in Map-Reduce Applications," CoRR, [23] N. B. Rizvandi, et al., "On Modeling Dependency between MapReduce Configuration Parameters and Total Execution Time," CoRR, [24] R. Springer, et al., "Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster," presented at the Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, New York, New York, USA, [25] A. Kansal, et al., "Virtual machine power metering and provisioning," presented at the 1st ACM symposium on Cloud computing, Indianapolis, Indiana, USA,

14 [26] T. Wood, et al., "Profiling and Modeling Resource Usage of Virtualized Applications," presented at the Proceedings of the ACM/IFIP/USENIX 9th International Middleware Conference, Leuven, Belgium, [27] Sysstat Available: [28] H. Wang, et al., "Distributed Systems Meet Economics: Pricing in the Cloud," presented at the the 2nd USENIX conference on Hot topics in cloud computing, Boston, MA, [29] K. Kambatla, et al., "Towards Optimizing Hadoop Provisioning in the Cloud," presented at the the 2009 conference on Hot topics in cloud computing, San Diego, California, [30] S. Groot and M. Kitsuregawa, "Jumbo: Beyond MapReduce for Workload Balancing," presented at the 36th International Conference on Very Large Data Bases, Singapore [31] N. B. Rizvandi, et al., "An Accurate Fir Approximation of Ideal Fractional Delay Filter with Complex Coefficients in Hilbert Space," Journal of Circuits, Systems, and Computers, vol. 14, pp , [32] N. B. Rizvandi, et al., "An accurate FIR approximation of ideal fractional delay in Hilbert space," presented at the the Fourth IEEE International Symposium on Signal Processing and Information Technology, [33] D. Chisnall, The definitive guide to the xen hypervisor., first ed.: Prentice Hall Press, [34] Hadoop Available: [35] a. Mao, et al., "Optimizing MapReduce for Multicore Architectures," Massachusetts Institute of Technology2010. [36] G. Wang, et al., "Using realistic simulation for performance analysis of mapreduce setups," presented at the Proceedings of the 1st ACM workshop on Large-Scale system and application performance, Garching, Germany, [37] G. Wang, et al., "A Simulation Approach to Evaluating Design Decisions in MapReduce Setups " presented at the MASCOTS, [38] "Optimizing Hadoop Deployments," Intel Corporation2009. [39] N. B. Rizvandi, et al., "MapReduce Implementation of Prestack Kirchhoff Time Migration (PKTM) on Seismic Data," presented at the The 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Gwangju, Korea