2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy of Sciences, Beijing, China Email:liushengyuan12@mails.ucas.ac.cn, xujg@ucas.ac.cn Zongzhen Liu Computer Network Information Center Chinese Academy of Sciences Beijing, China Email: liuzongzhen@cstnet.cn Xu Liu Department of Computer Science Rice University Houston, America Email: xu.liu@rice.edu Abstract Nowadays, private clouds are widely used for resource sharing. Hadoop-based clusters are the most popular implementations for private clouds. However, because workload traces are not publicly available, few previous work compares and evaluates different cloud solutions with publicly available benchmarks. In this paper, we use a recently-released Cloud benchmarks suite---cloudrank-d to quantitatively evaluate five different Hadoop task schedulers, including FIFO, capacity, naïve fair sharing, fair sharing with delay, and HOD (Hadoop On Demand) scheduling. Our experiments show that with an appropriate scheduler, the throughput of a private cloud can be improved by 20%. Keywords-Cloud, Hadoop, Taks scheduling, Evaluation. I. INTRODUCTION Private clouds are widely used in modern enterprises [1][2][3][4][5]. A well-known implementation of private clouds is Apache Hadoop [6], which is an open-source software framework for processing a large volume of data on a cluster. Commonly, a private cloud serves multiple users. Each user may have different priorities, task types, and data sizes. Usually, such user variations are significant, leading to a challenge for task scheduling. There are five task schedulers implemented in Hadoop: FIFO (First In First Out), capacity [7], naïve fair sharing [2], fair sharing with delay [3], and HOD (Hadoop On Demand) scheduling [8]. FIFO, the default scheduler of Hadoop, executes tasks according to their arrival order. Capacity, a multi-user scheduler, designs a multi-level resource constraint to make full use of resources. In the fair sharing with delay scheduler, Zaharia et al. [3] propose a simple algorithm called delay scheduling for improving data locality: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. HOD, built on Torque [9] resource manager, targets virtual Hadoop clusters. Previous work [3] presents an evaluation among FIFO, naïve fair and fair with delay scheduling. However, its workload traces are not publicly available, and hence users can not repeat the experiments to compare or optimize different task scheduling algorithm. In this paper, we use a publicly available Cloud benchmarks suite---cloudrank-d to evaluate five different Hadoop task schedulers, including FIFO, capacity, naïve fair sharing, fair sharing with delay, and HOD (Hadoop On Demand) scheduling. In comparison with that in [3], the workloads in CloudRank-D are more diverse, including basic operations, data mining, and data warehouse operations, and user can increase or decrease both the data and workload scales according to their requirements. The main contribution of this paper is that we use a publicly available benchmark suite to quantitatively evaluate five Hadoop schedulers. We reports the performance of a small-scale Hadoop deployment, including average turnaround time, throughput, average execution time of job, average waiting time, and data processed per second. Our work provides a basis for further optimization of task scheduling in Hadoop-based systems. The rest of the paper is organized as follows: Section II introduces Hadoop task schedulers as background knowledge. Section III describes the CloudRank-D. Section IV presents the evaluation of Hadoop s five schedulers. Finally, section V concludes the paper and shows future work. II. HADOOP TASKSCHEDULER A. Hadoop Overview The Apache Hadoop is an open-source software frame work that is proposed for distributed processing of large data sets across clusters of computers. Hadoop consists of the MapReduce [10] model and the Hadoop Distributed File System (HDFS). In MapReduce model, a job is divided two parts: Map task and Reduce task. Firstly, the data set is divided into several dependent blocks of data and delivered to Map tasks as key-value pairs. Map tasks produce intermediate keyvalue pairs, and then the intermediate key-value pairs are transferred to Reduce tasks in the shuffle phase. At last, the 978-1-4799-1293-3/13/$31.00 2013 IEEE 47
intermediate data is processed by Reduce tasks to generate the final output key-value pairs. The HDFS is designed for storing all input and output data, and multiple replicas of each data block are stored on many different cluster nodes. B. Hadoop In Hadoop, the task scheduler is a pluggable and important module, which is used to assign the idle system resources to jobs according to a certain strategy. Because the task scheduler is pluggable, the users can select and design the schedulers according to their requirements. Several schedulers are proposed by different organizations and researchers to satisfy their specific requirements. Here we just introduce the five common schedulers in Hadoop mentioned above. 1) FIFO FIFO scheduler is the default scheduler in Hadoop. In FIFO scheduling, the first job in the work queue in terms of job submission time will be chose and performed by JobTracker firstly, and jobs size or priority will be ignored. The FIFO algorithm is efficient and easy to implement, but its drawback is also obvious, such as low resource utilization rates and no support for multi-user executions. 2) Naïve Fair sharing The core idea of the fair scheduler is to assign resources to all jobs on average as far as possible, and each job can get the same share of resources. When only one job is performed in the system, it will monopolize the entire cluster and use all computing resources. Once other new jobs are submitted, the system will assign idle task slots to them, in order to ensure that each job can achieve approximately the same amount of computing resources. The jobs that require less time can access the CPU and can be completed within a reasonable time, but the jobs that require more time to execute are also able to access resources at the same time [11]. All jobs are grouped and pushed into a set of job pools created by Hadoop in advance. Each job pool can be assigned minimum of share resources, and all pools have equal shares by default. But we can configure job pools as required, such as assigning shares depending on job type, and constraining the numbers of jobs executing at the same time. Naïve Fair sharing supports multi-users, and each user is assigned a job pool, which is assigned by equal share of cluster resources like other users, no matter how many jobs the user has submitted [11]. The advantage of naïve fair is that it can improve the quality of service by supporting job classification schedule, in which resources are assigned by job type, and takes full use of resource, by the manual configuration of the number of jobs in parallel. But, it ignores the actual load condition of its nodes, easy to result in the load imbalance of nodes. 3) Fair Sharing with Delay Scheduling In the case of Naïve Fair sharing, a job may be launched with local data. So the fair sharing with delay scheduling is proposed to solve this problem. When a new job is submitted, the system will choose to delay task execution if there is no required data on the current assigned node, and then allow other jobs that meet data locality [3] to be scheduled first. 4) Capacity Scheduling Capacity scheduling supports multiple queues, and each queue is allocated on a certain amount of resources, within the upper and lower limit. Each user can also set a certain limit to prevent the misuse of resources. Idle resources can be allocated to heavy load queues dynamically. [12] Each queue uses FIFO scheduling. During the period of scheduling, the scheduler can choose an appropriate queue by calculating the ratio between the numbers of running tasks and allocated computing resources, and choose the smallest ratio; then, it can choose a job by the priority and the submitted time of the job, considering the limit of user resources and memory at the same time [7]. The advantages of capacity scheduler are listed as follow: it can improve resources utilization by supporting parallel executions of multiple jobs and sharing cluster for multiple users; it have high flexibility and execution efficiency by dynamically adjusting resource allocation; it can be used on large clusters and support jobs under intensive resources. The disadvantage of capacity scheduler is that users need to learn a large amount of system information so as to select and set a queue. 5) HOD The HOD is not a true scheduler; it is proposed for quickly building several independent virtual Hadoop clusters in a shared physical cluster, in order to achieve different purposes. The HOD allocates nodes for a virtual Hadoop cluster with the Torque resource manager according to the needs of the virtual cluster. Then, the HOD starts every daemon in MapReduce and HDFS on the allocated nodes, and automatically creates configuration files for Hadoop daemons and clients [8]. The advantages of HOD are listed as follows: it is adaptive for the changing load, and the physical cluster resources can achieve the highest efficiency in this way; its security is quite higher because the nodes rarely share resources. But its performance is inefficient, and its configuration relies upon a solid understanding of cluster systems. III. THE INTRODUCTION OF CLOUDRANK-D CloudRank-D [13] is a benchmark suite for private cloud, which can help researchers to simulate various multi-user applications in industrial scenarios. As shown in Table I, the benchmark suite provides a set of 13 representative data analysis tools, including basic operations for data analysis, classification, clustering, recommendation, sequence 48
learning, association rule mining applications, and data warehouse operations. In the CloudRank-D design, diversity of data characteristics of workloads are considered, including data semantics, data models, data sizes, and characteristics of data-centric computation, e.g., the ratio of the size of data input to that of data output. Moreover, the users can easily increase or decrease both data and workload scale according to their requirements, e.g., different cluster size. TABLE I. DATA SOURCES OF EACH PROGRAM IN CLOUDRANK-D [13] Application Sort Word count Grep Naive Bayes Support vector machine K-means Item based collaborative filtering Frequent pattern growth Data sources Automatically generated News and Wikipedia Scientist search Sougou corpus Ratings on movies Retail market basket data Click-stream data of an on-line news portal Traffic accident data Collection of web html document Another important attribution of workloads is input data size. CloudRank-D follows the distribution of input data size reported in workload characterization of a large electronic commence---taobao [17]. Table III shows the percentage of different input data in our experiments. The total size of input data of all workloads is about 1.7TB. TABLE II. WORKLOAD BREAKDOWN IN CLOUDRANK-D Category Application Jobs Basic Operations Data Mining Operations Data Warehouse Operations Sort 9 Word count 11 Grep 11 Naïve Bayes 6 Support vector machine 6 K-means 7 Item based collaborative 3 Frequent pattern growth 7 Hidden Markov model 6 Grep select Ranking select user visits aggregation user visits-rankings join 34 Hidden Markov model Grep select Ranking select User visits aggregation Scientist search Automatically generated table Data storage 17% Image processing 2% Text indexing 16% User visits-rankings join IV. EVALUATION We use the CloudRank-D benchmark to evaluate five typical Hadoop schedulers including FIFO, capacity, naïve fair sharing, fair sharing with delay scheduling and HOD. A. Benchmarks and Methodology The real applications in private clouds running big data applications can vary dynamically. Reference [14] In order to evaluate the Hadoop task schedulers in a proper way, and considering the scale of cluster we used, we synthesize a 100-jobs mixed workload with Cloud-D. Please note that with CloudRank-D, users can scale-up or scale-down the workload traces in terms of both data and workload scales according to their requirements. Please refer to the details in [15]. Table II shows the workload breakdown in CloudRank-D in terms of job number: 31% (Basic Operations), 35% (Data Mining Operations), 34% (Data Warehouse Operations). As shown in Fig. 1, according to the explanation in [15], the ratios of different workloads are consistent with usage percentages of applications in private clouds reported in [16]. Reporting 17% Figure 1. Usage percentages of applications in private clouds reported in [16]. TABLE III. Machine learning 11% Data mining 7% THE PERCENTATAGE OF DIFFERENT INPUT DATA IN WORKLOADS. Input Data size Percentage <25MB 40.57% 25MB-625MB 39.33% 1.2GB-5GB 12.03% Log processing 15% Web crawling 15% >5GB 8.07% Reported in [3], job submission in Facebook roughly follows an exponential distribution with a mean of 14 49
seconds [3]. With CloudRank-D, workloads are submitted in a random order following this distribution. About the metrics, we select turnaround time (total time taken from the submission of a job till the end of the job execution), which is a performance metric from perspective of users and select the throughput in terms of the number of jobs finished per minute) to measure system capacity. And then we break down the turnaround time into average running time (the execution time of job) and average waiting time (the time of job wait in queue). We have also chosen data processed per second (DPS) to measure the data processing capability. According to [13][15], the metric of data processed per second is defined as the total amount of data inputs of all jobs divided by the total running time from the submission time of the first job to the finished time of the last job. Hadoop parameter configuration can often have a great impact on system performance. In order to ensure all the five schedulers in the best situation in the case of evaluation, we optimize the Hadoop configuration according to the physical node in cluster before evaluation. And then we optimize the schedulers according to the cluster configuration and workload characteristics, modify Hadoop configuration file to use five schedulers to run the same mixed workloads, respectively. B. Testbed To evaluate the Hadoop task schedulers, we deploy a Hadoop cluster with 5 nodes (one NameNode and four DataNodes). Each node has the same hardware configuration: Intel Xenon E5645CPU, 16GBMemory, and 8TB Disk. These nodes are connected by Gigabit switch. Each node has also the same software stack: CentOS 5.5 with kernel 2.6.34, Hadoop 1.0.2, Mahout 0.6, and Hive 0.11.Table IV lists the detail parameter configuration of the Hadoop cluster. TABLE IV. CONFIGURATION DETAILS OF NODES CPU Type Intel CPU Core Intel Xeon E5645 6 cores@2.40g L1 D/I Cache L2 Cache L3 Cache Memory Disk 6 32 KB 6 256 KB 12MB 16GB 8TB OS Hadoop Mahout Hive CentOS 5.5 1.0.2 0.6 0.11 C. Hadoop Scheduler Evaluation Before the evaluation, we should optimize the Hadoop configuration. Some Hadoop parameters and its values are shown in Table V. We should also optimize the performance of five schedulers according to the cluster configuration and workload characteristics. And then when it is in a better performance, we run the mixed workload with five schedulers respectively by changing the profile of Hadoop. The submitted sequence and job-interval are all the same. A shell script to submit 100-job in parallel is written automatically; during the process of shell script writing, Hadoop logs are also recorded for debugging. 1) Data Processed per Second The data processed rate can show the performance of private cloud system on a whole level. Fig. 2 shows the total running time of 5 different schedulers by running the whole workload. The total running time is about 5-6 hours. TABLE V. HADOOP PARAMETERS AND ITS VALUES Hadoop Parameter Value Description The maximum number of mapred.tasktracker.map.tasks.ma map tasks that will be 12 ximum executed simultaneously by a task tracker. The maximum number of mapred.tasktracker.reduce.tasks. reduce tasks that will be 12 maximum executed simultaneously by a task tracker. Maximum number of mapred.map.tasks 48 concurrent running reduce task. Maximum number of mapred.reduce.tasks 45 concurrent running map task. The actual number of dfs.replication 2 replications specified when the file is created. mapreduce.tasktracker.outofband Open the out of band TRUE.heartbeat heartbeat. Total running time (10 3 s) 25 20 15 10 5 0 Figure 2. The total running time (10 3 sec) of running full workload by using five schedulers respectively. As shown in Fig. 3, because the fair sharing with delay scheduling can solve locality problem, its DPS is 1.17 times that of Naïve Fair sharing and 1.2 times of FIFO scheduler. Capacity scheduler can enhance the utilization of system; it can improve the DPS by 4% than that of FIFO scheduler. 50
DPS (MB/s) Figure 3. The Data Processed per Second (Megabytes processed per second) of running full workload by using five schedulers respectively. 2) Turnaround time Turnaround Time can often show an intuitive view of system performance at the user level, Fig. 4 shows the average turnaround time of five schedulers. Compared to HOD and FIFO schedulers, naïve fair, fair with delay scheduling and capacity schedulers can significantly reduce the job average turnaround time. Especially, fair with delay scheduling consumes the least amount of time. It mainly because that these three schedulers can reduce network cost. At the same time, these three schedulers limit the number of concurrent jobs, this limitation can reduce the cost of scheduling between jobs in task pool, and system can get a performance bonus. But from Hadoop log file we can see that fair with delay scheduling can affect some jobs with large size data, if the job doesn t meet data locality, and the pool will have some other jobs to run, the job without data locality have to wait for some time, so some jobs with large size data will have longer time to finish than usual jobs. Because of the cost of competition of each concurrent job in queue, FIFO scheduler performed not very well, and due to the extra virtualization cost, HOD scheduler performed even worse than FIFO scheduler. Turn around time (10 3 s) 12 10 8 6 4 2 0 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure 4. The average job turnaround time (10 3 sec) of running full workload by using five schedulers respectively. a) Running time Fig. 5 presents the average running time of five schedulers. Through limiting the number of concurrent jobs, Hadoop can enhance the performance. Because of approving data locality, fair with delay scheduling can reduce average running time significantly, which is only 29% of that of FIFO scheduler and 26% of that of HOD scheduler. The HOD scheduler also has the longest average running time. b) Waiting Time Fig. 6 shows the average waiting time of five schedulers. Waiting time indicates the time of job in the queue waiting to be executed, because naive fair, fair with delay and capacity schedulers limits the number of jobs in parallel, its average waiting time is much longer than the other two schedulers. Fair with delay scheduler can enhance system throughput, resource release is faster than Fair scheduler, so the average waiting time should be less than Fair scheduler which doesn t use the delay scheduling algorithm. The limitation on the number of concurrent jobs in capacity scheduler is a little larger than Fair scheduler, so its average waiting time is shorter than that of Fair scheduler. As to FIFO and HOD schedulers, the mixed workload did not reach its maximum of concurrent jobs, and submitted jobs are running in parallel, so their average waiting time is almost zero. Thus, it can be seen that the cost of job scheduler in Hadoop is pretty large when jobs are executed in parallel, so to limit the number of concurrent jobs in certain size can significantly improve overall system performance. Running time (10 3 s) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure 5. The average job running time (10 3 sec) of running full workload by using five schedulers respectively. Waiting time (sec.) 250 200 150 100 50 0 Figure 6. Average job waiting time (second) of running full workload by using five schedulers respectively. 51
3) Throughput Through the throughput, system administrators can evaluate system performance clearly, higher throughput indicates that the system can complete more jobs in unit time, and the system resources can be utilized sufficiently. Fig. 7 presents the throughput (number of jobs processed in one minute) of five schedulers under mixed workload. Because the mixed workload contains many large jobs, the value of throughput is low. By solving the locality problem, the throughput of fair with delay scheduling can gain an outstanding improvement, which is 1.2 times of that of default FIFO scheduler. Although Naïve Fair sharing and capacity schedulers have different turnaround time, their throughputs are almost same; they are all larger than FIFO scheduler and HOD scheduler. Throughput (jobs/min) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Figure 7. The throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively. Through the above experiments, we find fair with delay scheduling scheduler is the most efficient scheduler; it can be shown in DPS, average turnaround time, average running time, and throughput. But due to the affection of data locality some jobs with large size will have longer time to finish than usual jobs. For running mixed workload the Naïve Fair sharing and capacity scheduler have the nearly same performance in throughput and DPS two administrator s perspective, but naïve fair has the shorter average turnaround time and capacity has the shorter average waiting time. Fair with delay scheduling, naïve fair, capacity, these three schedulers are all have the better performance than default FIFO scheduler. HOD scheduler preformed not very well, affected by the extra cost of virtualization. V. CONCLUSIONS With the popularization of Hadoop, numerous private clouds based on Hadoop clusters are established by many organizations. Since these data centers are often used by multiple users, optimizing the performance of Hadoop clusters is very necessary and significant. In this paper, we evaluate the five common Hadoop schedulers by using CloudRank-D, a publicly benchmark suite for private cloud. Throughput, Turnaround time and Data Processed per Second are chosen as the metrics of evaluation. Experimental results show that different task schedulers have different influences on system performance in different situation, so the choice of task schedulers is very critical for system performance improvement of Hadoop cluster. Specially, fair sharing with delay scheduling can improve data locality and reduce the cost of network communication between nodes, and DPS is improved by 20% than that of FIFO scheduler. Optimization and design of the scheduler need to refer to the characteristics of the workload. In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systems. ACKNOWLEDGMENT Our work is supported in part by the National Natural Science Foundation of China under Grant No. 61372171 and the National Key Technology R&D Program of China under Contract No. 2012BAH23B03. REFERENCES [1] J. Zhan, L. Wang, X. Li, W. Shi, C. Weng, W. Zhang, and X. Zang, Cost-aware cooperative resource provisioning for heterogeneous workloads in data centers, IEEE Transl. Computers, in press. [2] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Job scheduling for multi-user MapReduce clusters, Technical Report UCB/EECS-2009-55, EECS Department, University of California at Berkeley, April 2009. [3] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, Proc. the 5th European conference on Computer systems (EuroSys 2010), ACM Press, April 2010, pp. 265 278, doi: 10.1145/1755913.1755940. [4] L. Wang, J. Zhan, W. Shi, and Y. Liang, In cloud, can scientific communities benefit from the economies of scale? IEEE Transl. Parallel and Distributed Systems, vol. 23, pp. 296-303, Feb. 2012, doi: 10.1109/TPDS.2011.144. [5] W. Gao, Y. Zhu, Z. Jia, C. Luo, L. Wang, Z. Li, et al., Bigdatabench: a big data benchmark suite from web search engines, Proc. the third workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with the 40th International Symp. Computer Architecture (ISCA 2013), arxiv preprint, May 2013, arxiv:1307.0320. [6] Apache Hadoop, http://hadoop.apache.org. [7] Capacity Scheduler Guide, http://hadoop.apache.org/docs/stable/capacity_scheduler.html. [8] Hadoop On Demand Documentation, http://hadoop.apache.org/docs/stable/hod_scheduler.html. [9] TORQUE Resource Manager, http://www.adaptivecomputing.com/products/open-source/torque/. [10] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, January 2008, pp. 107-113, doi: 10.1145/1327452.1327492. [11] Fair Scheduler Documentation, http://hadoop.apache.org/docs/stable/fair_scheduler.html. [12] The Hadoop Map-Reduce Capacity Scheduler, http://developer.yahoo.com/blogs/hadoop/hadoop-map-reducecapacity-scheduler-2531.html. [13] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, et al., CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications, Frontiers of Computer Science, vol. 6, August 2012, pp. 347-362, doi: 10.1007/s11704-012-2118-7. [14] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, Characterizing data analysis workloads in data centers, Proc. IEEE Symp. Workload 52
Characterization (IISWC2013), arxiv preprint, July 2013, arxiv:1307.8013. [15] J. Quan, CloudRank-D: a benchmark suite for private cloud systems, High Volume Computing (HVC) tutorial in conjunction with The 19th IEEE Symp. High Performance Computer Architecture (HPCA 2013), February 2013, http://prof.ict.ac.cn/cloudrank/cloudrank- D_HPCA_tutorial.pdf. [16] Hadoop Powered By List, http://wiki.apache.org/hadoop/poweredby. [17] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, Workload characterization on a production Hadoop cluster: a case study on Taobao, Proc. IEEE International Symposium on Workload Characterization (IISWC2012), IEEE Press, November 2012, pp. 3-13, doi: 10.1109/IISWC.2012.6402895. 53