Evaluating Task Scheduling in Hadoop-based Cloud Systems

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Evaluating Task Scheduling in Hadoop-based Cloud Systems"

Transcription

1 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy of Sciences, Beijing, China Zongzhen Liu Computer Network Information Center Chinese Academy of Sciences Beijing, China Xu Liu Department of Computer Science Rice University Houston, America Abstract Nowadays, private clouds are widely used for resource sharing. Hadoop-based clusters are the most popular implementations for private clouds. However, because workload traces are not publicly available, few previous work compares and evaluates different cloud solutions with publicly available benchmarks. In this paper, we use a recently-released Cloud benchmarks suite---cloudrank-d to quantitatively evaluate five different Hadoop task schedulers, including FIFO, capacity, naïve fair sharing, fair sharing with delay, and HOD (Hadoop On Demand) scheduling. Our experiments show that with an appropriate scheduler, the throughput of a private cloud can be improved by 20%. Keywords-Cloud, Hadoop, Taks scheduling, Evaluation. I. INTRODUCTION Private clouds are widely used in modern enterprises [1][2][3][4][5]. A well-known implementation of private clouds is Apache Hadoop [6], which is an open-source software framework for processing a large volume of data on a cluster. Commonly, a private cloud serves multiple users. Each user may have different priorities, task types, and data sizes. Usually, such user variations are significant, leading to a challenge for task scheduling. There are five task schedulers implemented in Hadoop: FIFO (First In First Out), capacity [7], naïve fair sharing [2], fair sharing with delay [3], and HOD (Hadoop On Demand) scheduling [8]. FIFO, the default scheduler of Hadoop, executes tasks according to their arrival order. Capacity, a multi-user scheduler, designs a multi-level resource constraint to make full use of resources. In the fair sharing with delay scheduler, Zaharia et al. [3] propose a simple algorithm called delay scheduling for improving data locality: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. HOD, built on Torque [9] resource manager, targets virtual Hadoop clusters. Previous work [3] presents an evaluation among FIFO, naïve fair and fair with delay scheduling. However, its workload traces are not publicly available, and hence users can not repeat the experiments to compare or optimize different task scheduling algorithm. In this paper, we use a publicly available Cloud benchmarks suite---cloudrank-d to evaluate five different Hadoop task schedulers, including FIFO, capacity, naïve fair sharing, fair sharing with delay, and HOD (Hadoop On Demand) scheduling. In comparison with that in [3], the workloads in CloudRank-D are more diverse, including basic operations, data mining, and data warehouse operations, and user can increase or decrease both the data and workload scales according to their requirements. The main contribution of this paper is that we use a publicly available benchmark suite to quantitatively evaluate five Hadoop schedulers. We reports the performance of a small-scale Hadoop deployment, including average turnaround time, throughput, average execution time of job, average waiting time, and data processed per second. Our work provides a basis for further optimization of task scheduling in Hadoop-based systems. The rest of the paper is organized as follows: Section II introduces Hadoop task schedulers as background knowledge. Section III describes the CloudRank-D. Section IV presents the evaluation of Hadoop s five schedulers. Finally, section V concludes the paper and shows future work. II. HADOOP TASKSCHEDULER A. Hadoop Overview The Apache Hadoop is an open-source software frame work that is proposed for distributed processing of large data sets across clusters of computers. Hadoop consists of the MapReduce [10] model and the Hadoop Distributed File System (HDFS). In MapReduce model, a job is divided two parts: Map task and Reduce task. Firstly, the data set is divided into several dependent blocks of data and delivered to Map tasks as key-value pairs. Map tasks produce intermediate keyvalue pairs, and then the intermediate key-value pairs are transferred to Reduce tasks in the shuffle phase. At last, the /13/$ IEEE 47

2 intermediate data is processed by Reduce tasks to generate the final output key-value pairs. The HDFS is designed for storing all input and output data, and multiple replicas of each data block are stored on many different cluster nodes. B. Hadoop In Hadoop, the task scheduler is a pluggable and important module, which is used to assign the idle system resources to jobs according to a certain strategy. Because the task scheduler is pluggable, the users can select and design the schedulers according to their requirements. Several schedulers are proposed by different organizations and researchers to satisfy their specific requirements. Here we just introduce the five common schedulers in Hadoop mentioned above. 1) FIFO FIFO scheduler is the default scheduler in Hadoop. In FIFO scheduling, the first job in the work queue in terms of job submission time will be chose and performed by JobTracker firstly, and jobs size or priority will be ignored. The FIFO algorithm is efficient and easy to implement, but its drawback is also obvious, such as low resource utilization rates and no support for multi-user executions. 2) Naïve Fair sharing The core idea of the fair scheduler is to assign resources to all jobs on average as far as possible, and each job can get the same share of resources. When only one job is performed in the system, it will monopolize the entire cluster and use all computing resources. Once other new jobs are submitted, the system will assign idle task slots to them, in order to ensure that each job can achieve approximately the same amount of computing resources. The jobs that require less time can access the CPU and can be completed within a reasonable time, but the jobs that require more time to execute are also able to access resources at the same time [11]. All jobs are grouped and pushed into a set of job pools created by Hadoop in advance. Each job pool can be assigned minimum of share resources, and all pools have equal shares by default. But we can configure job pools as required, such as assigning shares depending on job type, and constraining the numbers of jobs executing at the same time. Naïve Fair sharing supports multi-users, and each user is assigned a job pool, which is assigned by equal share of cluster resources like other users, no matter how many jobs the user has submitted [11]. The advantage of naïve fair is that it can improve the quality of service by supporting job classification schedule, in which resources are assigned by job type, and takes full use of resource, by the manual configuration of the number of jobs in parallel. But, it ignores the actual load condition of its nodes, easy to result in the load imbalance of nodes. 3) Fair Sharing with Delay Scheduling In the case of Naïve Fair sharing, a job may be launched with local data. So the fair sharing with delay scheduling is proposed to solve this problem. When a new job is submitted, the system will choose to delay task execution if there is no required data on the current assigned node, and then allow other jobs that meet data locality [3] to be scheduled first. 4) Capacity Scheduling Capacity scheduling supports multiple queues, and each queue is allocated on a certain amount of resources, within the upper and lower limit. Each user can also set a certain limit to prevent the misuse of resources. Idle resources can be allocated to heavy load queues dynamically. [12] Each queue uses FIFO scheduling. During the period of scheduling, the scheduler can choose an appropriate queue by calculating the ratio between the numbers of running tasks and allocated computing resources, and choose the smallest ratio; then, it can choose a job by the priority and the submitted time of the job, considering the limit of user resources and memory at the same time [7]. The advantages of capacity scheduler are listed as follow: it can improve resources utilization by supporting parallel executions of multiple jobs and sharing cluster for multiple users; it have high flexibility and execution efficiency by dynamically adjusting resource allocation; it can be used on large clusters and support jobs under intensive resources. The disadvantage of capacity scheduler is that users need to learn a large amount of system information so as to select and set a queue. 5) HOD The HOD is not a true scheduler; it is proposed for quickly building several independent virtual Hadoop clusters in a shared physical cluster, in order to achieve different purposes. The HOD allocates nodes for a virtual Hadoop cluster with the Torque resource manager according to the needs of the virtual cluster. Then, the HOD starts every daemon in MapReduce and HDFS on the allocated nodes, and automatically creates configuration files for Hadoop daemons and clients [8]. The advantages of HOD are listed as follows: it is adaptive for the changing load, and the physical cluster resources can achieve the highest efficiency in this way; its security is quite higher because the nodes rarely share resources. But its performance is inefficient, and its configuration relies upon a solid understanding of cluster systems. III. THE INTRODUCTION OF CLOUDRANK-D CloudRank-D [13] is a benchmark suite for private cloud, which can help researchers to simulate various multi-user applications in industrial scenarios. As shown in Table I, the benchmark suite provides a set of 13 representative data analysis tools, including basic operations for data analysis, classification, clustering, recommendation, sequence 48

3 learning, association rule mining applications, and data warehouse operations. In the CloudRank-D design, diversity of data characteristics of workloads are considered, including data semantics, data models, data sizes, and characteristics of data-centric computation, e.g., the ratio of the size of data input to that of data output. Moreover, the users can easily increase or decrease both data and workload scale according to their requirements, e.g., different cluster size. TABLE I. DATA SOURCES OF EACH PROGRAM IN CLOUDRANK-D [13] Application Sort Word count Grep Naive Bayes Support vector machine K-means Item based collaborative filtering Frequent pattern growth Data sources Automatically generated News and Wikipedia Scientist search Sougou corpus Ratings on movies Retail market basket data Click-stream data of an on-line news portal Traffic accident data Collection of web html document Another important attribution of workloads is input data size. CloudRank-D follows the distribution of input data size reported in workload characterization of a large electronic commence---taobao [17]. Table III shows the percentage of different input data in our experiments. The total size of input data of all workloads is about 1.7TB. TABLE II. WORKLOAD BREAKDOWN IN CLOUDRANK-D Category Application Jobs Basic Operations Data Mining Operations Data Warehouse Operations Sort 9 Word count 11 Grep 11 Naïve Bayes 6 Support vector machine 6 K-means 7 Item based collaborative 3 Frequent pattern growth 7 Hidden Markov model 6 Grep select Ranking select user visits aggregation user visits-rankings join 34 Hidden Markov model Grep select Ranking select User visits aggregation Scientist search Automatically generated table Data storage 17% Image processing 2% Text indexing 16% User visits-rankings join IV. EVALUATION We use the CloudRank-D benchmark to evaluate five typical Hadoop schedulers including FIFO, capacity, naïve fair sharing, fair sharing with delay scheduling and HOD. A. Benchmarks and Methodology The real applications in private clouds running big data applications can vary dynamically. Reference [14] In order to evaluate the Hadoop task schedulers in a proper way, and considering the scale of cluster we used, we synthesize a 100-jobs mixed workload with Cloud-D. Please note that with CloudRank-D, users can scale-up or scale-down the workload traces in terms of both data and workload scales according to their requirements. Please refer to the details in [15]. Table II shows the workload breakdown in CloudRank-D in terms of job number: 31% (Basic Operations), 35% (Data Mining Operations), 34% (Data Warehouse Operations). As shown in Fig. 1, according to the explanation in [15], the ratios of different workloads are consistent with usage percentages of applications in private clouds reported in [16]. Reporting 17% Figure 1. Usage percentages of applications in private clouds reported in [16]. TABLE III. Machine learning 11% Data mining 7% THE PERCENTATAGE OF DIFFERENT INPUT DATA IN WORKLOADS. Input Data size Percentage <25MB 40.57% 25MB-625MB 39.33% 1.2GB-5GB 12.03% Log processing 15% Web crawling 15% >5GB 8.07% Reported in [3], job submission in Facebook roughly follows an exponential distribution with a mean of 14 49

4 seconds [3]. With CloudRank-D, workloads are submitted in a random order following this distribution. About the metrics, we select turnaround time (total time taken from the submission of a job till the end of the job execution), which is a performance metric from perspective of users and select the throughput in terms of the number of jobs finished per minute) to measure system capacity. And then we break down the turnaround time into average running time (the execution time of job) and average waiting time (the time of job wait in queue). We have also chosen data processed per second (DPS) to measure the data processing capability. According to [13][15], the metric of data processed per second is defined as the total amount of data inputs of all jobs divided by the total running time from the submission time of the first job to the finished time of the last job. Hadoop parameter configuration can often have a great impact on system performance. In order to ensure all the five schedulers in the best situation in the case of evaluation, we optimize the Hadoop configuration according to the physical node in cluster before evaluation. And then we optimize the schedulers according to the cluster configuration and workload characteristics, modify Hadoop configuration file to use five schedulers to run the same mixed workloads, respectively. B. Testbed To evaluate the Hadoop task schedulers, we deploy a Hadoop cluster with 5 nodes (one NameNode and four DataNodes). Each node has the same hardware configuration: Intel Xenon E5645CPU, 16GBMemory, and 8TB Disk. These nodes are connected by Gigabit switch. Each node has also the same software stack: CentOS 5.5 with kernel , Hadoop 1.0.2, Mahout 0.6, and Hive 0.11.Table IV lists the detail parameter configuration of the Hadoop cluster. TABLE IV. CONFIGURATION DETAILS OF NODES CPU Type Intel CPU Core Intel Xeon E L1 D/I Cache L2 Cache L3 Cache Memory Disk 6 32 KB KB 12MB 16GB 8TB OS Hadoop Mahout Hive CentOS C. Hadoop Scheduler Evaluation Before the evaluation, we should optimize the Hadoop configuration. Some Hadoop parameters and its values are shown in Table V. We should also optimize the performance of five schedulers according to the cluster configuration and workload characteristics. And then when it is in a better performance, we run the mixed workload with five schedulers respectively by changing the profile of Hadoop. The submitted sequence and job-interval are all the same. A shell script to submit 100-job in parallel is written automatically; during the process of shell script writing, Hadoop logs are also recorded for debugging. 1) Data Processed per Second The data processed rate can show the performance of private cloud system on a whole level. Fig. 2 shows the total running time of 5 different schedulers by running the whole workload. The total running time is about 5-6 hours. TABLE V. HADOOP PARAMETERS AND ITS VALUES Hadoop Parameter Value Description The maximum number of mapred.tasktracker.map.tasks.ma map tasks that will be 12 ximum executed simultaneously by a task tracker. The maximum number of mapred.tasktracker.reduce.tasks. reduce tasks that will be 12 maximum executed simultaneously by a task tracker. Maximum number of mapred.map.tasks 48 concurrent running reduce task. Maximum number of mapred.reduce.tasks 45 concurrent running map task. The actual number of dfs.replication 2 replications specified when the file is created. mapreduce.tasktracker.outofband Open the out of band TRUE.heartbeat heartbeat. Total running time (10 3 s) Figure 2. The total running time (10 3 sec) of running full workload by using five schedulers respectively. As shown in Fig. 3, because the fair sharing with delay scheduling can solve locality problem, its DPS is 1.17 times that of Naïve Fair sharing and 1.2 times of FIFO scheduler. Capacity scheduler can enhance the utilization of system; it can improve the DPS by 4% than that of FIFO scheduler. 50

5 DPS (MB/s) Figure 3. The Data Processed per Second (Megabytes processed per second) of running full workload by using five schedulers respectively. 2) Turnaround time Turnaround Time can often show an intuitive view of system performance at the user level, Fig. 4 shows the average turnaround time of five schedulers. Compared to HOD and FIFO schedulers, naïve fair, fair with delay scheduling and capacity schedulers can significantly reduce the job average turnaround time. Especially, fair with delay scheduling consumes the least amount of time. It mainly because that these three schedulers can reduce network cost. At the same time, these three schedulers limit the number of concurrent jobs, this limitation can reduce the cost of scheduling between jobs in task pool, and system can get a performance bonus. But from Hadoop log file we can see that fair with delay scheduling can affect some jobs with large size data, if the job doesn t meet data locality, and the pool will have some other jobs to run, the job without data locality have to wait for some time, so some jobs with large size data will have longer time to finish than usual jobs. Because of the cost of competition of each concurrent job in queue, FIFO scheduler performed not very well, and due to the extra virtualization cost, HOD scheduler performed even worse than FIFO scheduler. Turn around time (10 3 s) Figure 4. The average job turnaround time (10 3 sec) of running full workload by using five schedulers respectively. a) Running time Fig. 5 presents the average running time of five schedulers. Through limiting the number of concurrent jobs, Hadoop can enhance the performance. Because of approving data locality, fair with delay scheduling can reduce average running time significantly, which is only 29% of that of FIFO scheduler and 26% of that of HOD scheduler. The HOD scheduler also has the longest average running time. b) Waiting Time Fig. 6 shows the average waiting time of five schedulers. Waiting time indicates the time of job in the queue waiting to be executed, because naive fair, fair with delay and capacity schedulers limits the number of jobs in parallel, its average waiting time is much longer than the other two schedulers. Fair with delay scheduler can enhance system throughput, resource release is faster than Fair scheduler, so the average waiting time should be less than Fair scheduler which doesn t use the delay scheduling algorithm. The limitation on the number of concurrent jobs in capacity scheduler is a little larger than Fair scheduler, so its average waiting time is shorter than that of Fair scheduler. As to FIFO and HOD schedulers, the mixed workload did not reach its maximum of concurrent jobs, and submitted jobs are running in parallel, so their average waiting time is almost zero. Thus, it can be seen that the cost of job scheduler in Hadoop is pretty large when jobs are executed in parallel, so to limit the number of concurrent jobs in certain size can significantly improve overall system performance. Running time (10 3 s) Figure 5. The average job running time (10 3 sec) of running full workload by using five schedulers respectively. Waiting time (sec.) Figure 6. Average job waiting time (second) of running full workload by using five schedulers respectively. 51

6 3) Throughput Through the throughput, system administrators can evaluate system performance clearly, higher throughput indicates that the system can complete more jobs in unit time, and the system resources can be utilized sufficiently. Fig. 7 presents the throughput (number of jobs processed in one minute) of five schedulers under mixed workload. Because the mixed workload contains many large jobs, the value of throughput is low. By solving the locality problem, the throughput of fair with delay scheduling can gain an outstanding improvement, which is 1.2 times of that of default FIFO scheduler. Although Naïve Fair sharing and capacity schedulers have different turnaround time, their throughputs are almost same; they are all larger than FIFO scheduler and HOD scheduler. Throughput (jobs/min) Figure 7. The throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively. Through the above experiments, we find fair with delay scheduling scheduler is the most efficient scheduler; it can be shown in DPS, average turnaround time, average running time, and throughput. But due to the affection of data locality some jobs with large size will have longer time to finish than usual jobs. For running mixed workload the Naïve Fair sharing and capacity scheduler have the nearly same performance in throughput and DPS two administrator s perspective, but naïve fair has the shorter average turnaround time and capacity has the shorter average waiting time. Fair with delay scheduling, naïve fair, capacity, these three schedulers are all have the better performance than default FIFO scheduler. HOD scheduler preformed not very well, affected by the extra cost of virtualization. V. CONCLUSIONS With the popularization of Hadoop, numerous private clouds based on Hadoop clusters are established by many organizations. Since these data centers are often used by multiple users, optimizing the performance of Hadoop clusters is very necessary and significant. In this paper, we evaluate the five common Hadoop schedulers by using CloudRank-D, a publicly benchmark suite for private cloud. Throughput, Turnaround time and Data Processed per Second are chosen as the metrics of evaluation. Experimental results show that different task schedulers have different influences on system performance in different situation, so the choice of task schedulers is very critical for system performance improvement of Hadoop cluster. Specially, fair sharing with delay scheduling can improve data locality and reduce the cost of network communication between nodes, and DPS is improved by 20% than that of FIFO scheduler. Optimization and design of the scheduler need to refer to the characteristics of the workload. In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systems. ACKNOWLEDGMENT Our work is supported in part by the National Natural Science Foundation of China under Grant No and the National Key Technology R&D Program of China under Contract No. 2012BAH23B03. REFERENCES [1] J. Zhan, L. Wang, X. Li, W. Shi, C. Weng, W. Zhang, and X. Zang, Cost-aware cooperative resource provisioning for heterogeneous workloads in data centers, IEEE Transl. Computers, in press. [2] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Job scheduling for multi-user MapReduce clusters, Technical Report UCB/EECS , EECS Department, University of California at Berkeley, April [3] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, Proc. the 5th European conference on Computer systems (EuroSys 2010), ACM Press, April 2010, pp , doi: / [4] L. Wang, J. Zhan, W. Shi, and Y. Liang, In cloud, can scientific communities benefit from the economies of scale? IEEE Transl. Parallel and Distributed Systems, vol. 23, pp , Feb. 2012, doi: /TPDS [5] W. Gao, Y. Zhu, Z. Jia, C. Luo, L. Wang, Z. Li, et al., Bigdatabench: a big data benchmark suite from web search engines, Proc. the third workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with the 40th International Symp. Computer Architecture (ISCA 2013), arxiv preprint, May 2013, arxiv: [6] Apache Hadoop, [7] Capacity Scheduler Guide, [8] Hadoop On Demand Documentation, [9] TORQUE Resource Manager, [10] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, January 2008, pp , doi: / [11] Fair Scheduler Documentation, [12] The Hadoop Map-Reduce Capacity Scheduler, [13] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, et al., CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications, Frontiers of Computer Science, vol. 6, August 2012, pp , doi: /s [14] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, Characterizing data analysis workloads in data centers, Proc. IEEE Symp. Workload 52

7 Characterization (IISWC2013), arxiv preprint, July 2013, arxiv: [15] J. Quan, CloudRank-D: a benchmark suite for private cloud systems, High Volume Computing (HVC) tutorial in conjunction with The 19th IEEE Symp. High Performance Computer Architecture (HPCA 2013), February 2013, D_HPCA_tutorial.pdf. [16] Hadoop Powered By List, [17] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, Workload characterization on a production Hadoop cluster: a case study on Taobao, Proc. IEEE International Symposium on Workload Characterization (IISWC2012), IEEE Press, November 2012, pp. 3-13, doi: /IISWC

CloudRank-D:A Benchmark Suite for Private Cloud Systems

CloudRank-D:A Benchmark Suite for Private Cloud Systems CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data Analysis and Its Scheduling Policy Hadoop

Big Data Analysis and Its Scheduling Policy Hadoop IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy

More information

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli, Douglas G. Down Department of Computing and Software McMaster University {rasooa, downd}@mcmaster.ca Abstract The

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce 2012 Third International Conference on Networking and Computing Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Adaptive Task Scheduling for MultiJob MapReduce Environments

Adaptive Task Scheduling for MultiJob MapReduce Environments Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

A Game Theory Based MapReduce Scheduling Algorithm

A Game Theory Based MapReduce Scheduling Algorithm A Game Theory Based MapReduce Scheduling Algorithm Ge Song 1, Lei Yu 2, Zide Meng 3, Xuelian Lin 4 Abstract. A Hadoop MapReduce cluster is an environment where multi-users, multijobs and multi-tasks share

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

MAPREDUCE/ [1] is a programming model for. An Optimized Algorithm for Reduce Task Scheduling

MAPREDUCE/ [1] is a programming model for. An Optimized Algorithm for Reduce Task Scheduling 794 JOURNAL OF COMPUTERS, VOL. 9, NO. 4, APRIL 2014 An Optimized Algorithm for Reduce Task Scheduling Xiaotong Zhang a, Bin Hu b, Jiafu Jiang c a,b,c School of Computer and Communication Engineering, University

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Do You Feel the Lag of Your Hadoop?

Do You Feel the Lag of Your Hadoop? Do You Feel the Lag of Your Hadoop? Yuxuan Jiang, Zhe Huang, and Danny H.K. Tsang Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Hong Kong Email:

More information

Federated Big Data for resource aggregation and load balancing with DIRAC

Federated Big Data for resource aggregation and load balancing with DIRAC Procedia Computer Science Volume 51, 2015, Pages 2769 2773 ICCS 2015 International Conference On Computational Science Federated Big Data for resource aggregation and load balancing with DIRAC Víctor Fernández

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION

A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION Sumitha VS 1, Shilpa V 2 1 M.E. Final Year, Department of Computer Science Engineering (IT), UVCE, Bangalore, gvsumitha@gmail.com

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Non-intrusive Slot Layering in Hadoop

Non-intrusive Slot Layering in Hadoop 213 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing Non-intrusive Layering in Hadoop Peng Lu, Young Choon Lee, Albert Y. Zomaya Center for Distributed and High Performance Computing,

More information

BigDataBench. Khushbu Agarwal

BigDataBench. Khushbu Agarwal BigDataBench Khushbu Agarwal Last Updated: May 23, 2014 CONTENTS Contents 1 What is BigDataBench? [1] 1 1.1 SUMMARY.................................. 1 1.2 METHODOLOGY.............................. 1 2

More information

SCHEDULING IN CLOUD COMPUTING

SCHEDULING IN CLOUD COMPUTING SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism

More information

BPOE Research Highlights

BPOE Research Highlights BPOE Research Highlights Jianfeng Zhan ICT, Chinese Academy of Sciences 2013-10- 9 http://prof.ict.ac.cn/jfzhan INSTITUTE OF COMPUTING TECHNOLOGY What is BPOE workshop? B: Big Data Benchmarks PO: Performance

More information

Towards a Resource Aware Scheduler in Hadoop

Towards a Resource Aware Scheduler in Hadoop Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3

An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3 An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3 1 M.E: Department of Computer Science, VV College of Engineering, Tirunelveli, India 2 Assistant Professor,

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

A Task Scheduling Algorithm for Hadoop Platform

A Task Scheduling Algorithm for Hadoop Platform JOURNAL OF COMPUTERS, VOL. 8, NO. 4, APRIL 2013 929 A Task Scheduling Algorithm for Hadoop Platform Jilan Chen College of Computer Science, Beijing University of Technology, Beijing, China Email: bjut_chen@126.com

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Survey on Job Schedulers in Hadoop Cluster

Survey on Job Schedulers in Hadoop Cluster IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 1 (Sep. - Oct. 2013), PP 46-50 Bincy P Andrews 1, Binu A 2 1 (Rajagiri School of Engineering and Technology,

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization

Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization Jianfeng Zhan, Lei Wang, Bibo Tu, Yong Li, Peng Wang, Wei Zhou, Dan Meng Institute of Computing Technology

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications

Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications Engin Arslan University at Buffalo (SUNY) enginars@buffalo.edu Mrigank Shekhar Tevfik Kosar Intel Corporation University

More information

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.

Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn. Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

POSIX and Object Distributed Storage Systems

POSIX and Object Distributed Storage Systems 1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors

DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, JULY 214 1 DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors Feng Yan, Member, IEEE, Ludmila Cherkasova, Member, IEEE, Zhuoyao Zhang,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com

More information

Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for Hortonworks Data Platform Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution

More information

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,

More information

Scheduling Algorithms in MapReduce Distributed Mind

Scheduling Algorithms in MapReduce Distributed Mind Scheduling Algorithms in MapReduce Distributed Mind Karthik Kotian, Jason A Smith, Ye Zhang Schedule Overview of topic (review) Hypothesis Research paper 1 Research paper 2 Research paper 3 Project software

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume

More information

TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS

TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS Dr. D Rajya Lakshmi 1, Mr. R Praveen Kumar 2, Mr. N K Sumanth 3 1 Professor of CSE, JNTUK-UCEV, Vizianagaram, AP, (India)

More information

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information