Do You Feel the Lag of Your Hadoop?

Size: px
Start display at page:

Download "Do You Feel the Lag of Your Hadoop?"

Transcription

1 Do You Feel the Lag of Your Hadoop? Yuxuan Jiang, Zhe Huang, and Danny H.K. Tsang Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Hong Kong {yjiangad, ecefelix, Abstract The configuration of a Hadoop cluster is significantly important to its performance, because an improper configuration can greatly deteriorate the job execution performance. Unfortunately, systematic guidelines on how to configure a Hadoop cluster are still missing. In this paper, we undertake an empirical study on key operations and mechanisms of Hadoop job execution, including the task assignment strategy and speculative execution. Based on the experiments, we provide suggestions on the system configuration, particularly on the matching between the hardware resource partitioning scheme and the job splitting granularity. I. INTRODUCTION Recent years have witnessed the rapidly increasing demand for large-scale data processing, such as webpage indexing, data mining, scientific simulations, spam detection and the like. For example, Facebook processed more than 00 TB of new data every day according to a study conducted in the year 0 []. MapReduce [] has emerged as a promising parallel processing framework for big data analytics. Apache Hadoop [] is the de facto open-source standard implementation of the MapReduce framework. It has been adopted by numerous users throughout the world including Twitter, EBay, Yahoo, Facebook, and Hulu []. The popularity of Hadoop can be demonstrated by a recent report that shows the production Hadoop cluster operated by Yahoo had successfully processed over thousand jobs from various users over a period of ten months []. Hadoop carries out enormous data analyzing jobs using computing clusters in a scale-out manner. The Hadoop framework parallelizes the data analysis by separating the processes into two parts, Map tasks that perform filtering and sorting, and Reduce tasks that perform a summary operation. The performance of a Hadoop cluster depends on how Map and Reduce tasks are scheduled to be processed by separate nodes in the cluster. On the one hand, job schedulers [], [], [] are proposed to address different performance issues in the resource allocation processes for jobs. On the other hand, enhancing data locality by task-level scheduling inside jobs is also important [] []. However, the resource allocation in Hadoop is complicated and depends on many system parameters and design decisions. Without carefully fine tuning the system, the performance of the system can be far from optimal. Unfortunately, resource allocation mechanisms in Hadoop are not documented in detail. For most users, these mechanisms operate as black-boxes. This motivates us to study such mechanisms empirically through extensive experiments. The goal of this paper is to shed some light on how to properly configure the Hadoop system so as to improve the job execution performance. To understand how the Hadoop system interacts with system configurations, detailed behaviors of Hadoop are investigated. In this paper, we are particularly interested in the following issues: () Map and Reduce task assignment preference; () the Hadoop speculative execution mechanism for tasks; () hardware granularity of Hadoop task slots; and () the granularity of job splitting. More specifically, the task assignment preference determines which Map or Reduce task is assigned to which node to be executed. The speculative execution mechanism determines when and which task should be duplicated as a backup proactively for fault-tolerance. The hardware granularity of Hadoop task slots decides how resources are partitioned and shared among multiple tasks, and the granularity of job splitting determines the size of each task for a job. The above aspects are closely tangled together. Improper configuration can easily create a severe performance bottleneck. Official documents discuss the above issues at a very high level []. Currently, users mainly rely on their own experience to come up with the configuration parameters. From our experiments in this paper, the key observations and conclusions include: Hadoop task assignment only depends on data locality. Performance bottlenecks emerge when data locality is taken out of consideration, due to the lack of workload balancing. The Hadoop speculative execution mechanism is heuristic and simple. Imperative backup task execution may be delayed or prevented, while unnecessary backups are likely to be created. Matching the hardware resource partitioning granularity with the job splitting granularity significantly improves the job execution performance. However, this matching requires the user to have detailed knowledge of the jobs and the hardware configuration of the cluster. The points above provide a general guideline on configuring the system parameters. More importantly, they open up new directions to improve the current Hadoop implementation. The rest of this paper is organized as follows. Key factors on Hadoop job execution are presented in Section II. Our experimental settings are described in Section III. Experimental results for task assignment, speculative execution and granularity matching are reported and analyzed in Section IV, V and VI, respectively. Finally, Section VII concludes the paper. II. BACKGROUND OF HADOOP JOB EXECUTION A. Hadoop Resource Provisioning and Job Splitting Computing task parallelization lies in the core concept of the MapReduce framework. A particular Hadoop job is parallelized

2 into multiple Map and Reduce tasks. Each Map task is responsible for processing a portion of the data stored in the Hadoop Distributed File System (HDFS). Reduce tasks then sort and combine the Map results. There are many factors that influence the execution performance of a Hadoop job. One such factor is the job splitting scheme. Data associated with the jobs are split into blocks and stored in the HDFS. The number of Map tasks of one job should be no smaller than the amount of data blocks associated with this job. By default, they are exactly equal to each other, and only one task is generated for Reduce processing. The total number of tasks of a job should be carefully determined. If a job is split into a large number of tasks, each task will require less time and fewer resources for processing. In this case, task scheduling is more flexible. But an excessive number of tasks will introduce unnecessary queuing delay and extra task initialization overheads. However, if the number of tasks is too small, scheduling becomes cumbersome. Once all the tasks have already been scheduled, the Hadoop cluster is not able to start new tasks, even if it has extra capacity. Another key factor that influences the job execution performance is the hardware resource partitioning granularity. In classical Hadoop (i.e., before Hadoop.0), the computing resources of a Hadoop cluster are divided into capacity units called task slots. One task slot is able to process only one Hadoop task. Under the framework after Hadoop.0, Apache YARN [] is introduced to manage hardware resources, in which the basic capacity units are called containers. To simplify our discussion, in this paper, we refer to the capacity required to execute one job task as a task slot. There exists a trade-off between the number of simultaneous task executions and the program parallelization efficiency. When the number of tasks that can be executed in parallel is too small, jobs experience unnecessary queueing delay, because unscheduled tasks are waiting for vacant task slots. However, if the cluster configuration allows an excessive number of simultaneous task executions, multi-thread program scheduling overheads and the bottleneck of resource sharing among slots emerge and deteriorate the performance. B. Speculative Execution Speculative execution is introduced as a fault-tolerance mechanism in Hadoop. This mechanism constantly monitors the progressions of tasks. If any task falls behind in progress and is in danger of failure, the speculative execution mechanism proactively creates an identical backup task using an available slot. Because backup tasks consume extra resources, unnecessary restart of tasks can potentially slow down the execution of other active jobs. As a result, the speculative execution mechanism should be carefully designed. III. EXPERIMENTAL SETUP In the following experiments, Hadoop.. is deployed onto a computing cluster. In this paper, we focus on the Hadoop operations and mechanisms for one single job, so a relatively small Hadoop cluster with one master node and six slave nodes is sufficient. The master node is hosted by an HP Compaq DX00 desktop, and six slave nodes are hosted by homogeneous virtual machines (VMs) in our private computing cloud that consists of Dell PowerEdge T0 servers. Each of the VMs is allocated with four virtual processing cores from Intel Xeon E0 CPUs and GB of memory. The HDFS is built upon local storage on each server instead of the network attached storage system. In the experiments, the word-count job is adopted as the benchmark application. According to a trace study of a Yahoo production Hadoop cluster [], % of submitted jobs are map-only or map-mostly. The word-count job falls into the map-mostly category. This program is widely adopted in Hadoop performance analysis experiments by MIT [], Intel [] and the like. Since data locality in the Hadoop system is a well-studied subject [] [], we investigate other possible factors that influence the performance by eliminating the data locality issue. In all of the following experiments, the HDFS data replication parameter is set to so that each slave node will maintain a complete copy of all the data. Hadoop capacity scheduler [] is used to submit jobs. During the experiment, only one job is submitted each time, so the single job microscopic performance can be studied in detail. IV. HADOOP TASK ASSIGNMENT PREFERENCE The task assignment preference of Hadoop determines how the workload is balanced among slave nodes. In this part of the experiments, how Hadoop assigns job tasks to active slave nodes is investigated. In total, 0. GB of Wikipedia English website dump data [] are split into data blocks, with a block size of MB in the HDFS. The data are assigned to create Map tasks. This number is selected to create uneven workloads among slave nodes. One Reduce task is used to summarize the Map results. To study the task assignment behavior, we take Map task assignment as an example. The experiment is repeated four times, and in each run the hardware resources in the Hadoop cluster are partitioned into a different number of task slots. Figure (a) indicates the execution time of each Map task when the Hadoop cluster is configured so that each slave node contains Map slots. Therefore, the cluster contains Map slots in total. Job tasks assigned to the same slave node are presented using the same color in the figure. The results indicate that slaves, and are left totally idle, slave hosts only one task and slaves and are fully loaded to host the remaining tasks. Figures (b), (c) and (d) show similar results when the Hadoop cluster is configured so that each slave node contains, and Map slots, respectively. In the above experiments, the results clearly indicate that job tasks are assigned to slave nodes one-by-one. The task scheduler only assigns tasks to another slave node when the current slave node s task slots are fully occupied. As a result, there is only one under-utilized slave node that hosts the remaining number of job tasks. More importantly, the tasks hosted by this under-utilized slave node complete much earlier than other tasks running on fully utilized slave nodes (e.g., task in Figure (a); tasks, and in Figure (b); tasks,,, and in Figure (c); and task in Figure (d)). Our observation contradicts the belief that Hadoop slots are homogeneous computational resources. The

3 Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave 0 (a) Total Number of Map Slots = 0 0 (b) Total Number of Map Slots = (c) Total Number of Map Slots = Fig.. Execution Time Distribution for Map Tasks 0 (d) Total Number of Map Slots = performance of a Hadoop task slot on a slave node depends on the total workload on that node. This can be explained by the resource bottleneck due to the sharing among all Hadoop task slots on one single slave node. For example, the storage bottleneck of a slave node prevents a large number of Map tasks from reading data at the same time. It is learnt that Hadoop task assignment mainly depends on data locality. The lack of workload balancing can result in serious degradation of performance. Load balancing is, therefore, suggested to be implemented into Hadoop task assignment. For example, a simple heuristic strategy would be distributing tasks evenly among all active slave nodes with regard to data locality. V. HADOOP SPECULATIVE EXECUTION The speculative execution mechanism launches backups for active tasks with slow progressions. However, there is no clear description of the triggering criterion of backup tasks in the official documents of Hadoop []. In this section, we study the triggering conditions of backup tasks in speculative execution. To simplify the experiment, a KB ebook downloaded from the Gutenberg project [] is submitted to the cluster, and Map tasks are created in total. Artificial delay is programmed in the Map function of the Hadoop word-count job. As a result, the progress rate of the Map task is evenly injected by artificial delay in granularity of around 0.%. The cluster is configured to have Map slots. Without introducing delay, our measurements report that a Map task roughly takes seconds to complete. The experimental settings above allow us to freely control the duration of each individual Map task. Figure (a) shows the execution progressions of Map tasks in the time domain when the same amount of delay is introduced for all the tasks. In this case, all the tasks have similar progress rates and hence no backup task is triggered. Figure (b) shows that two backup tasks are generated when extra delay is introduced for tasks and, leading to slow progressions for them. In Figure (c), the backup task is triggered only for task if the extra delay introduced for task is lower than that for task, but higher than that for other tasks. The above results imply that one of the necessary conditions for the speculative execution mechanism to launch a backup task is that there exists a slow-running task whose progress rate is lower than a certain threshold compared to other healthy tasks. However, the progress rate threshold is not the only triggering criterion of backup tasks in the speculative execution mechanism. In the next part of our experiment, artificial delay is introduced only for task. By continuously increasing the injected delay from zero to a large number, the time condition that will trigger backup tasks can be located. Figure (d) shows that no backup task is created if task is finished before t = s. Figure (e) shows that one backup task is launched for task at the time t = s if its execution time is seconds. Note that task has a much slower progress rate than its peer tasks in both Figures (d) and (e). If the progress rate threshold is the only condition, backup tasks should be triggered for both cases. More interestingly, we observe that the backup task is always launched at around t = 0s when the execution time for task is longer than or equal to seconds. This observation suggests that speculative execution takes effect to create backup tasks by monitoring existing tasks after around t = 0s. In Figure (f), each slave node is configured to have only Map task slot. In this way, Map tasks cannot be scheduled within just one wave in the Map phase. The amount of artificial delay is introduced for all the Map tasks in the same way as that in Figure (a). In this case, backup tasks are launched for tasks, and right after the original tasks start. Combined with the observations from Figures (d) and (e), it is concluded that backup tasks can only be launched after the job has been running for some time. This period is an absolute value of around one minute in our experiment. Additionally, Hadoop will check the absolute progress rate of the targeted normal task before launching its backup. If the targeted task is approaching completion, the backup will not be launched. Figure (f) also illustrates another feature of speculative execution. In this figure, the execution times for all Map tasks are close to each other and are,, 0,,,,, and seconds, respectively. For tasks, and, scheduled in the second wave of the Map phase, their backup tasks are launched almost immediately after the corresponding normal tasks are created. This means these tasks are considered to be at the risk of failure from the very beginning of their executions. This phenomenon implies that the absolute progress rate of one task at each time point, not the relative progress rate with respect to the time at which the task is scheduled, is compared with its peer tasks. This strategy obviously increases the probability for backup tasks to be launched. However, the design here is

4 (a) Uniformly Delay 0 Backup for Task Backup for Task (b) Extra Delay for Tasks and 0 Backup for Task (c) Higher Delay for Task (d) Delay Introduced only for Task 0 Fig.. Backup for Task (e) More Delay Introduced for Task Transient Progressions of Map Tasks 0 Backup for Task Backup for Task Backup for Task (f) Uniform Delay but Fewer Map Slots reasonable in terms of fully utilizing the computing resources in the cluster. In Hadoop, unscheduled normal tasks have a higher priority to use free task slots than backup tasks. If free slots still exist after all normal tasks have been scheduled, it is better to utilize them for backup tasks to prevent failure proactively, rather than just leave them idle. Nevertheless, and more importantly, a potential design weakness of slow-running task detection is exposed here. Genuine slow-running tasks in terms of their relative progress rates may not have free task slots to launch backups if these slots are all occupied by unnecessary backups whose corresponding normal tasks are judged to be slow in progress by absolute progress rates, but healthy by relative progress rates. We summarize our observations into the following conditions for speculative execution to launch a backup task: All normal tasks have already been scheduled, and there exists one free task slot in the cluster. There exists a slow-running task whose progress rate is behind some progress rate threshold. The job should have lasted longer than or equal to around one minute. The slow-running task is not approaching completion at the time when its backup task should be triggered. Our observations above agree with the descriptions by Dinu et al. on the speculative execution mechanism []. In summary, the current implementation of speculative execution is simple and heuristic. Unnecessary backups are likely to be generated, while genuine slow-running tasks in terms of relative progress rates may not be granted resources for backups. Also the creation of backup tasks might be delayed due to the condition that backup tasks can only be launched after the job has lasted for one minute. To improve the speculative execution mechanism, it is suggested that more advanced algorithms (e.g. machine learning techniques) should be applied to the detection of slow-running tasks. VI. RESOURCE PARTITION AND JOB SPLITTING SCHEMES Hardware resource partitioning and job splitting granularity determine how efficiently the workload is parallelized and distributed inside the cluster. Given fixed hardware resources, we aim to find the optimal number of task slots that each slave node should have. An optimal job splitting strategy that best matches the hardware resource partitioning scheme is also investigated. In the following experiments, the same 0. GB of English Wikipedia web dump data is used, and the number of Map task slots on each slave node is varied from to. We mainly focus on the performance of the Map phase because it accounts for more than % of the total job execution time in our experiment. To quantitatively measure the effects of job splitting, the job execution times when the job is split into different numbers of Map tasks are measured, under the cases with the total number of Map task slots from to 0. Job execution time measurement results are reported in Table I. It can be observed by the columns that for one fixed job splitting scheme, with the increasing of the total number of Map slots, the job execution time follows the generic trend that it first goes down, then reaches optimal, which is the shortest time, and finally goes up again. Note that insufficient partitioning of task slots on one node leads to severe queueing delay for tasks because parallelization is not enough. On the other hand, an excessive number of task slots on one node incurs extra multi-thread program scheduling overheads. Also in this situation, a heavy workload on one node can cause a resource sharing bottleneck among task slots, which has been indicated in Section IV. In this way, task execution can further be retarded. We speculate from Table I that, in general, the

5 TABLE I MEASUREMENT RESULTS OF WORD-COUNT JOB EXECUTION TIMES (SEC) # of Job # of Total Splits Map Task Slots optimal number of total task slots in one cluster is slightly bigger than the total number of processing cores in that cluster, which is roughly within the range of 0 to in our experiment. Given the total number of Map slots, job execution time varies according to the job splitting scheme, which can be observed by the rows in Table I. Thus, an appropriate matching between the total number of task slots and the total amount of job splits is required. It can be observed from Table I that splitting a job into Map tasks of exactly the same amount as the total number of Map slots in the cluster achieves the best performance (e.g., execution times highlighted in red in the table). According to the observations in Section IV, this job splitting strategy ensures that every slot is assigned a normal task, and the overall Map execution takes just one wave. If the total number of Map tasks is fewer than the total number of Map slots, job execution performance degrades due to the resource underutilization that some slots or even nodes are left idle or used for backups. Likewise, it can be observed that splitting a job into a number of tasks which is the integral multiple of the total number of Map slots achieves sub-optimal performance (e.g., execution times highlighted in blue in the table) because, on the whole, each slot is assigned a normal task in each wave. However, splitting a job into too many Map tasks deteriorates the execution performance and should be avoided. In this case, the whole data processing is divided into tiny units and initialization overheads dominate the task execution. In practice, tasks with longer execution times have higher probabilities of failure. Therefore, a job containing a huge volume of data is better to be split into a larger number of tasks to reduce the cost of failure recovery. As a result, the number of tasks of this kind of jobs is encouraged to be the integral multiple of the total number of slots in the cluster. VII. CONCLUSION How to configure the Hadoop system according to its operations and mechanisms in job execution is of great significance to the execution performance. In this paper, we perform extensive experiments to get insights from a practical perspective. Based on experimental observations, we provide suggestions on system configuration, particularly on granularity determination. REFERENCES [] A. Menon, Big data@ Facebook, in Proc. ACM Workshop on Management of Big Data Systems, 0, pp.. [] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol., no., pp. 0, 00. [] T. White, Hadoop: The definitive guide. O Reilly Media, Inc., 0. [] Hadoop Wiki: Powered By. [Online]. Available: hadoop/poweredby [] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, An analysis of traces from a production MapReduce cluster, in Proc. IEEE/ACM Int. Conf. Cluster, Cloud and Grid Computing (CCGrid), 00, pp. 0. [] K. Kc and K. Anyanwu, Scheduling Hadoop jobs to meet deadlines, in Proc. IEEE Int. Conf. Cloud Computing Technology and Science (CloudCom), 00, pp.. [] T. Sandholm and K. Lai, Dynamic proportional share scheduling in Hadoop, in Job Scheduling Strategies for Parallel Processing. Springer, 00, pp. 0. [] Z. Guo, G. Fox, and M. Zhou, Investigation of data locality in MapReduce, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGrid), 0, pp.. [] X. Zhang, Z. Zhong, S. Feng, B. Tu, and J. Fan, Improving data locality of MapReduce by scheduling in homogeneous computing environments, in Proc. IEEE Int. Symp. Parallel and Distributed Processing with Applications (ISPA), 0, pp. 0. [0] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong, Bar: an efficient data locality driven task scheduling algorithm for cloud computing, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGrid), 0, pp. 0. [] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, A throughput optimal algorithm for map task scheduling in MapReduce with data locality, ACM SIGMETRICS Performance Evaluation Review, vol. 0, no., pp., 0. [] Hadoop YARN. [Online]. Available: current/hadoop-yarn/hadoop-yarn-site/yarn.html [] Y. Mao, R. Morris, and M. F. Kaashoek, Optimizing MapReduce for multicore architectures, in Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep., 00. [] Optimizing Hadoop Deployments. [Online]. Available: cloud-computing-optimizing-hadoop-deployments-paper.pdf [] Wiki Dump. [Online]. Available: [] Free ebooks: Project Gutenberg. [Online]. Available: gutenberg.org/ [] F. Dinu and T. Ng, Understanding the effects and implications of compute node related failures in Hadoop, in Proc. ACM Int. Symp. High-Performance Parallel and Distributed Computing (HPDC), 0, pp..

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

MAPREDUCE [1] is proposed by Google in 2004 and

MAPREDUCE [1] is proposed by Google in 2004 and IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Scheduling Algorithms in MapReduce Distributed Mind

Scheduling Algorithms in MapReduce Distributed Mind Scheduling Algorithms in MapReduce Distributed Mind Karthik Kotian, Jason A Smith, Ye Zhang Schedule Overview of topic (review) Hypothesis Research paper 1 Research paper 2 Research paper 3 Project software

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters IEEE TRANSACTIONS ON CLOUD COMPUTING 1 DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Shanjiang Tang, Bu-Sung Lee, Bingsheng He Abstract MapReduce is a popular computing

More information

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

A Middleware Strategy to Survive Compute Peak Loads in Cloud

A Middleware Strategy to Survive Compute Peak Loads in Cloud A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: sashko.ristov@finki.ukim.mk

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction: ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,

More information

Optimization of Distributed Crawler under Hadoop

Optimization of Distributed Crawler under Hadoop MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*

More information

Cost-effective Resource Provisioning for MapReduce in a Cloud

Cost-effective Resource Provisioning for MapReduce in a Cloud 1 -effective Resource Provisioning for MapReduce in a Cloud Balaji Palanisamy, Member, IEEE, Aameek Singh, Member, IEEE Ling Liu, Senior Member, IEEE Abstract This paper presents a new MapReduce cloud

More information

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform A B M Moniruzzaman 1, Kawser Wazed Nafi 2, Prof. Syed Akhter Hossain 1 and Prof. M. M. A. Hashem 1 Department

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.) - Tehnologii informationale, Editura

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli, Douglas G. Down Department of Computing and Software McMaster University {rasooa, downd}@mcmaster.ca Abstract The

More information

SCHEDULING IN CLOUD COMPUTING

SCHEDULING IN CLOUD COMPUTING SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

Non-intrusive Slot Layering in Hadoop

Non-intrusive Slot Layering in Hadoop 213 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing Non-intrusive Layering in Hadoop Peng Lu, Young Choon Lee, Albert Y. Zomaya Center for Distributed and High Performance Computing,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Ensuring Reliability and High Availability in Cloud by Employing a Fault Tolerance Enabled Load Balancing Algorithm G.Gayathri [1], N.Prabakaran [2] Department of Computer

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

Methodology for predicting the energy consumption of SPMD application on virtualized environments * Methodology for predicting the energy consumption of SPMD application on virtualized environments * Javier Balladini, Ronal Muresano +, Remo Suppi +, Dolores Rexachs + and Emilio Luque + * Computer Engineering

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms

CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms Rodrigo N. Calheiros, Rajiv Ranjan, Anton Beloglazov, César A. F. De Rose,

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

How To Balance In Cloud Computing

How To Balance In Cloud Computing A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

More information

Federated Big Data for resource aggregation and load balancing with DIRAC

Federated Big Data for resource aggregation and load balancing with DIRAC Procedia Computer Science Volume 51, 2015, Pages 2769 2773 ICCS 2015 International Conference On Computational Science Federated Big Data for resource aggregation and load balancing with DIRAC Víctor Fernández

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev (skounev@ito.tu-darmstadt.de) Information Technology Transfer Office Abstract Modern e-commerce

More information

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com,

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

An Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang

An Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang 1 An Efficient Hybrid MMOG Cloud Architecture for Dynamic Load Management Ginhung Wang, Kuochen Wang Abstract- In recent years, massively multiplayer online games (MMOGs) become more and more popular.

More information

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Sudarsanam P Abstract G. Singaravel Parallel computing is an base mechanism for data process with scheduling task,

More information

PROBLEM DIAGNOSIS FOR CLOUD COMPUTING

PROBLEM DIAGNOSIS FOR CLOUD COMPUTING PROBLEM DIAGNOSIS FOR CLOUD COMPUTING Jiaqi Tan, Soila Kavulya, Xinghao Pan, Mike Kasick, Keith Bare, Eugene Marinelli, Rajeev Gandhi Priya Narasimhan Carnegie Mellon University Automated Problem Diagnosis

More information

Survey on Job Schedulers in Hadoop Cluster

Survey on Job Schedulers in Hadoop Cluster IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 1 (Sep. - Oct. 2013), PP 46-50 Bincy P Andrews 1, Binu A 2 1 (Rajagiri School of Engineering and Technology,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn. Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop. 17th Apr 2014 High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

More information